What happens when a ZooKeeper dies?

This post is motivated by a basic question I was wondering about a couple of days ago, and summarizes what I've learned so far.

Apache ZooKeeper is a system used to coordinate distributed systems (including HBase, YARN, and Kafka) by providing a highly-available filesystem-like service that can store configuration data. The ZooKeeper service is like a key-value store, in which the keys can form a hierarchy of tree nodes (znodes). The utility of ZooKeeper comes in maintaining state about partitions and leadership among hosts in a client distributed system. A client system like Apache Kafka relies on ZooKeeper to reliably track this state via a consensus protocol called ZAB. An important characteristic of ZooKeeper is that every mutation is totally ordered. Every znode receives a zxid, which is a global monotonically increasing 64-bit integer encoding its place in the total ordering. Actually, the zxid uses 32-bits to represent the leader election cycle (discussed below) called an epoch, and the remaining 32-bits to represent the transaction number within that epoch.

ZooKeeper ensures high availability by duplicating its own state across multiple servers organized in a single leader configuration. When the ZooKeeper state is to be modified, the write goes to the leader node and is broadcast to the follower nodes. When a response comes back from a quorum of nodes, it is finalized on the leader. Once the leader has finalized the result, the followers replicate it.

So, what happens when a ZooKeeper server falls over? Since ZooKeeper requires a quorum of nodes to operate, a cluster of three nodes can lose one and continue. In order to lose two nodes and continue to accept writes, the cluster must have a total size of five. This is why ZooKeeper clusters are typically configured with odd numbers of servers.

If a ZooKeeper follower fails, that's that. As long as the remaining servers form a quorum, the service will continue to operate. However, if a ZooKeeper leader fails, a leader election takes place. A simplified description of the process is as follows:

  1. The surviving nodes broadcast their highest zxid, tracking the highest seen thus far
  2. They then vote for the ZooKeeper node with the highest zxid i.e. the one with the most recent view of the world as leader, breaking ties by server ID.
  3. Once a quorum of surviving ZooKeeper hosts have agreed on a new leader and it acknowledges, it begin accepting and broadcasting updates.

I found this Jepsen test to be helpful for this process along in the face of a (forced) network partition.