What happens when a ZooKeeper dies?
This post is motivated by a basic question I was wondering about a couple of days ago, and summarizes what I've learned so far.
Apache ZooKeeper is a system used to coordinate distributed systems (including HBase, YARN, and Kafka) by providing a highly-available filesystem-like service that can store configuration data. The ZooKeeper service is like a key-value store, in which the keys can form a hierarchy of tree nodes (znodes). The utility of ZooKeeper comes in maintaining state about partitions and leadership among hosts in a client distributed system. A client system like Apache Kafka relies on ZooKeeper to reliably track this state via a consensus protocol called ZAB. An important characteristic of ZooKeeper is that every mutation is totally ordered. Every znode receives a zxid, which is a global monotonically increasing 64-bit integer encoding its place in the total ordering. Actually, the zxid uses 32-bits to represent the leader election cycle (discussed below) called an epoch, and the remaining 32-bits to represent the transaction number within that epoch.
ZooKeeper ensures high availability by duplicating its own state across multiple servers organized in a single leader configuration. When the ZooKeeper state is to be modified, the write goes to the leader node and is broadcast to the follower nodes. When a response comes back from a quorum of nodes, it is finalized on the leader. Once the leader has finalized the result, the followers replicate it.
So, what happens when a ZooKeeper server falls over? Since ZooKeeper requires a quorum of nodes to operate, a cluster of three nodes can lose one and continue. In order to lose two nodes and continue to accept writes, the cluster must have a total size of five. This is why ZooKeeper clusters are typically configured with odd numbers of servers.
If a ZooKeeper follower fails, that's that. As long as the remaining servers form a quorum, the service will continue to operate. However, if a ZooKeeper leader fails, a leader election takes place. A simplified description of the process is as follows:
- The surviving nodes broadcast their highest zxid, tracking the highest seen thus far
- They then vote for the ZooKeeper node with the highest zxid i.e. the one with the most recent view of the world as leader, breaking ties by server ID.
- Once a quorum of surviving ZooKeeper hosts have agreed on a new leader and it acknowledges, it begin accepting and broadcasting updates.
I found this Jepsen test to be helpful for this process along in the face of a (forced) network partition.