Monday, July 23, 2012

Master election in Neo4j High Availability Clusters

Let me take you on a journey through one of the most challenging parts of Neo4j's current HA architecture - the master election algorithm. We will start by exploring the reason for a master existing, pass through the high level algorithm that master election is based on and we'll end up understanding how it is implemented. After we arrive at our destination, we'll reflect upon some recent changes that don't alter the base algorithm but significantly improve its performance characteristics.

Getting in the mindset: What we need the master for

The HA implementation of Neo4j is a way to consistently replicate the graph across multiple machines, eventually propagating writes through the whole cluster. The key point here is replication - HA is not about decentralizing the graph (a solution known as sharding). This replication is based on an authoritative source that receives all updates and hands them out to the rest of the database instances as needed to keep them up to date. That machine is also needed for keeping track of some other things, such as locks and transactional context. So, you guessed it, that machine is the master - the provider of the ground truth about what the graph and the current mutating operations are.

Filling up our backpack: Getting hold of the basic concepts.

When a Neo4j cluster is setup it consists of a set of pretty much equivalent, random machines - there is no special requirement in place for designating beforehand a machine as master. Additionally, it wouldn't be that much Highly Available if only one particular machine could be master - if that machine fails, what then? So, we need a way to elect a master dynamically and on demand.
To do that we depend on a number of primitives, the implementation of which i will not share right now - it is a mechanical thing, definitely interesting but not directly relevant. We'll talk about them later. What they are though - now that is extremely relevant. The basis for all operations and actually the only primitive needed for implementing any distributed system is Atomic Broadcast (AB) - essentially a method for any instance to reliably send to the whole cluster a specific message. Reliably in our case means no message loss, duplication or corruption and guaranteed message ordering. But I'll agree if you say that this is too generic to be useful, so let's narrow it down a bit. First we need a heartbeat facility, a way for each instance participating in an HA cluster to broadcast its existence to the rest of the cluster. Not broadcasting this heartbeat is equivalent to being in a failed state and not currently participating in cluster operations. Second, we need a way to manage cluster membership - finding out which machines currently comprise the cluster. While very similar to the heartbeat mechanism it is not the same concern, though the final implementation may be the same. Indeed, in Neo4j HA we use the same solution for both and that is how we'll treat it for the rest of this discussion. Third, we need an event notification mechanism that allows instances to broadcast alerts to the rest of the cluster. Finally, we require a method for instances to distribute key pieces of information that are needed by all other instances to make decisions.
Let me repeat - all the above are specialized cases of AB. Having them categorized like this however makes it easier to discuss about implementation specifics later on. Also, since the implementation is a detail we tend to refer to this service as the Coordinator - that is what neo4j-coordinator start provides

Taking the first steps: How the master is elected

Like already mentioned, the master is the provider of ground truth - the keeper of the knowledge of the real graph. Among other things, that means the master must have the most up to date version of the database. In Neo4j land the means to determine graph version is the latest transaction id executed on it. You see, since transaction application has to be serialized either way we can, at no additional cost, assign to it a monotonically increasing id. The id of the latest transaction that was committed on the graph essentially characterizes the version of a database. Of course, that information is generated internally from the system and as such cannot be used alone for election purposes. An externally provided, priority enforcing piece of information must be provided, for at least breaking ties in case two machines have the same latest TxID, a very common occurrence since most of the time the instances are not lagging behind the master. This information can be anything, from each instance's IP address to any guaranteed unique number present on the machine (CPUID for example). In Neo4j's case we chose to have this as a configuration option, known as ha.server_id. The lower the server id of an instance, the higher it's priority in being elected as a master.
By now the master election algorithm should be clear - let me just fill in the gaps. When an instance realizes that a master is needed it gathers the latest transaction id for each machine in the cluster. The machine that has the largest is elected master, with ties broken by the server id value. After the machine chooses the master, it notifies the cluster of the result and the same algorithm is executed on all other machines, including the new master. If all machines reach the same conclusion no new notification takes place (since that would lead to infinite election rounds).
Notice how have taken for granted and integrated in the above the primitives outlined in the previous section. The need for a master election is communicated by lack of heartbeats from the current master. The cluster membership mechanism is used to answer the question which machines participate in the cluster and should be considered for election. The event notification is used to communicate the result of the master election a machine performed. Finally, the information distribution channel is used to gather the latest transaction id for each cluster member prior to doing the election.

Deeper in the jungle: Using ZooKeeper as a guide

It's that time when we need stop being abstract and get down and dirty. In particular, we need to talk about how the required primitives (or actually the one primitive, AB) is implemented in an HA cluster. The answer is Apache ZooKeeper. ZooKeeper is an implementation of Zookeeper Atomic Broadcast, an AB protocol that (unlike Paxos, for example) guarantees that messages from a client are delivered in the order they are issued. A very interesting concept ZooKeeper implements is a filesystem abstraction for its operations. A ZooKeeper quorum is viewed as a distributed, highly available hierarchical filesystem with some additional perks. The most interesting of those perks is the notion of watches. A client can set a watch on a file in the ZooKeeper filesystem and receive a notification when that node's contents change or one of it's children is deleted or a node is added as a child (ZK nodes can have both contents and child nodes, making them a file/directory hybrid, a concept not encountered in everyday filesystems). Also, ZooKeeper has the notion of ephemeral nodes, which are bound to the lifetime of a client. If the ZooKeeper quorum looses contact with a client, all nodes that client created with the ephemeral flag are automatically removed.
Having said that, let's see how Neo4j structures its file layout in ZooKeeper and how all primitives are implemented. If a machine does not find any structure in place (it is the first one) it creates a node that is the cluster root. Then it adds an ephemeral node as the root's child with a node name that is the server id of the instance and sets as contents the latest transaction id its database has. It also creates some more administrative nodes, the most interesting of which is master-notify, to which it adds a watch. A watch is also added to the root node. Each subsequent machine will find the root and hence do the same steps apart from the root and administrative nodes creation. All nodes and their contents are read and write-able from all cluster instances.
We are set. While an instance is available, its ephemeral node will be present. When it fails, its ephemeral node will be removed from root's children and all instances will receive a node deleted event. If that instance was the master a master election will be triggered  and the first one to complete it will broadcast its result by writing it to the master-notify node. Note that since the latest transaction id is shared via ephemeral nodes, all machines that share their latest transaction id are also assumed reachable and eligible for election.

Almost home: The brutalities of real world

We are actually there. We have explained with little handwaving how master election is performed. One detail we glossed over is the fact that a master election may happen not because the master became unreachable but because an instance joins a running cluster and actually has to make a decision on which machine is the master. This means that all machines must keep their Coordinator node data up to date. In particular, the master must keep its latest transaction id really up to date, lest it loses the election and branch data occurs. To save against this scenario we must update from the master synchronously with each committed transaction. So, HA deployments (until 1.8M05) bottleneck transaction commits on writes to the Coordinator. M06 (and of course 1.8GA) removes this bottleneck by adding an extra watched node which is written to by any instance that wants to do master election. When the watch is triggered each instance starts flushing synchronously its transaction id. When the master election round ends the watch is triggered again and all instances stop flushing their TxIDs. That limits the Coordinator writes only during periods where it is actually necessary without any loss in cluster consistency or change in master election semantics. The performance increase is highly site dependent and affects only write operations, but it can easily surpass 2-3x the previous performance.

Putting our feet up: Lessons learned and future travel

You should now have a mental picture of how Neo4j HA works and what to expect if you venture into the HA codebase, at least its master-related parts. You should also have understood why you need to configure, run and maintain the Coordinator service that comes bundled with Neo4j Enterprise edition. It is expected to have more questions now than what you had when you started reading - that is good, it means that understanding has been achieved - keep in mind i have purposefully glossed over some large chunks.
Our understanding of how distributed systems must work and be built continuously evolves and we want to stop depending on third party components for such critical infrastructure. We are in the process therefore of implementing our own AB solution based on some well studied and battle proven algorithms, hopefully coming soon in a milestone release. Until then keep tending to your HA clusters, direct questions to Neo4j forums and keep an eye on the horizon for future travels in the Neo4j land.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments:

Post a Comment