Master fault tolerance in Kudu 1.0

Posted 24 Jun 2016 by Adar Dembo

This blog post describes how the 1.0 release of Apache Kudu (incubating) will support fault tolerance for the Kudu master, finally eliminating Kudu’s last single point of failure.

As those of you who follow this blog know by now, replication is a signature feature in Kudu. Replication is used to provide fault tolerance for all loaded data. By implementing the Raft consensus protocol, Kudu guarantees that a tablet replicated 2N+1 times can tolerate up to N failures.

What you may not know is that Kudu replicates its metadata, too. That is, the Kudu master stores all table and tablet metadata in a single “master” tablet. As a regular Kudu tablet itself, this master tablet may be replicated with Raft. As such, the Kudu master is a special kind of tablet server whose primary job is to host a single replica of the master tablet.

When we launched Kudu’s first beta, support for replicated masters had been implemented but was too fragile to be anything but experimental. One of our goals for Kudu’s 1.0 release is to improve replicated master support so that it can be safely enabled in production clusters.

How master replication works

To use replicated masters, a Kudu operator must deploy some number of Kudu masters, providing the hostname and port number of each master in the group via the --master_address command line option. For example, each master in a three-node deployment should be started with --master_address=<host1:port1>,<host2:port2><host3:port3>. In Raft parlance, this group of masters is known as a Raft configuration.

At startup, a Raft configuration of masters will hold a leader election and elect one master as the leader. The leader master is responsible for servicing both tablet server heartbeats as well as client requests. The remaining masters are followers: they participate in Raft consensus and replicate writes sent by the leader, but are otherwise idle. Any client requests they receive are rejected. Likewise, all tablet server heartbeats they receive are ignored. If the leader master ever dies or steps down, the remaining replicas hold an election to determine the new leader.

All persistent master metadata is stored in the single replicated “master” tablet. Every row in this tablet represents either a table or a tablet. Table records include unique table identifiers, the table’s schema, and other bits of information. Tablet records include a unique identifier, the tablet’s Raft configuration, and other information.

What master metadata is replicated?

  1. Table and tablet existence, via CreateTable() and DeleteTable(). Every new tablet record also includes an initial Raft configuration.
  2. Schema changes, via AlterTable() and tablet server heartbeats.
  3. Tablet server Raft configuration changes, via tablet server heartbeats. These include both the list of Raft peers (may have changed due to under-replication) as well as the current leader (may have changed due to an election).

Scanning the master tablet to service every heartbeat or client request would be slow, so the leader master caches all master metadata in memory. The caches are only updated after a metadata change is successfully replicated; in this way they are always consistent with the on-disk tablet. When a new leader master is elected, it scans the entire master tablet and uses the metadata to rebuild its in-memory caches.

Communication with replicated masters

All tablet servers start up with location information for the entire master Raft configuration and will periodically heartbeat to every master. Similarly, clients are also configured with the locations of all masters. Unlike tablet servers, they always communicate with the leader master as follower masters will reject client requests. To do this, clients must determine which master is the leader before sending the first request as well as whenever any request fails with a NOT_THE_LEADER error.

Remaining work for Kudu 1.0

KUDU-422 tracks the remaining master replication work. The guts of this feature have been implemented as far back as early 2015; the remaining work has been focused on fixing bugs that manifest only under specific conditions. For example, we’ve observed failures in DDL operations (e.g. CreateTable()) that only materialize upon the completion of a master leader election. These failures highlight some of the gaps in our testing regimen: we need a robust stress test that repeatedly performs such operations while holding master leader elections.

That said, there is one remaining work item of larger scope: there’s no mechanism with which to perform a Raft configuration change for replicated masters. Such a mechanism would have multiple uses:

  1. Migrating from a single-node master deployment to a fully replicated three-node (or five-node) deployment.
  2. Replacing a failed master with a new one.

This is being tracked by KUDU-1474, and there’s been some discussion around a design, but nothing has been implemented yet. Stay tuned!