Recover failed YB-TServer and YB-Master

A cluster might be running while a YB-TServer process, YB-Master process, or a node fails.

The following sample steps demonstrate how to recover a process if you have a N-node setup with replication factor (RF)=3.

Monitor processes

It is recommended to have a cron or systemd setup to ensure that the YB-TServer and YB-Master processes are restarted if they are not running.

This handles transient failures, such as a node rebooting or process crash due to an unexpected behavior.

Node failure

Typically, if a node has failed, the system automatically recovers and continues to function with the remaining N-1 nodes. If the failed node does not recover soon enough, and N-1 >= 3, then the under-replicated tablets will be re-replicated automatically to return to RF=3 on the remaining N-1 nodes.

If a node has experienced a permanent failure on a YB-TServer, you should start another YB-TServer process on a new node. This node will join the cluster, and the load balancer will automatically take the new YB-TServer into consideration and start rebalancing tablets to it.

Master failure

If a new YB-Master needs to be started to replace a failed one, the master quorum needs to be updated. Suppose, the original YB-Masters were n1, n2, n3. And n3 needs to be replaced with a new YB-Master n4. Then you need to use the yb-admin subcommand change_master_config, as follows:

./bin/yb-admin -master_addresses n1:7100,n2:7100 change_master_config REMOVE_SERVER n3 7100
./bin/yb-admin -master_addresses n1:7100,n2:7100 change_master_config ADD_SERVER n4 7100

The YB-TServer's in-memory state automatically learns of the new master after the ADD_SERVER step and does not need a restart.

You should update the configuration file of all YB-TServer processes which specifies the master addresses to reflect the new quorum of n1, n2, n4.

This is to handle the case if the yb-tserver restarts at some point in the future.

Planned cluster changes

You might choose to perform planned cluster changes, such as moving the entire cluster to a brand new set of nodes (for example, move from machines of type A to type B). For instructions on how to do this, see Change cluster configuration.