The goal of leader election is to give special powers to specific entities in a distributed system, such as processes, hosts, threads, objects, or people. These powers may include the ability to delegate tasks, the ability to modify data, or the responsibility to handle all system requests. Leader election can be a useful tool for improving efficiency, minimizing coordination, simplifying architecture, and reducing overhead, but it can also introduce additional failure modes and scaling challenges, and make it more difficult to assess the effectiveness of a system.
To achieve this, the leader elects an algorithm to select a processor to coordinate the operation of the distributed system. Leaders are usually selected based on criteria, such as choosing a processor with the highest identifier. When a leader is selected, the other processors enter a terminated state. In the leader election algorithm, the termination state is divided into electoral and non-elective states. Once the processor enters an elective or non-elective state, it remains in that state at all times.
Leader election algorithms must meet security and liveness conditions to be effective. The active condition indicates that each processor will eventually enter an elective or non-elective state. The security conditions stipulate that only one processor is allowed to enter the elective state and become the leader of the distributed network.
There are three scenarios to consider when determining whether a leader election is appropriate:
When each node is approximately the same and there is no clear permanent assignment of leader contenders: in this case, any node can be elected as the leader of the system, and there is no single point of failure.
When a cluster performs complex tasks that require close collaboration: Coordination can involve dividing the work, assigning the work to specific nodes, and combining the results of different nodes. For example, in a scientific calculation that determines how proteins fold, it may be necessary to require the leader node to assign each node to a specific part of the calculation and then combine the results to obtain a fully folded protein configuration.
When the system performs multiple distributed writes to data and requires good consistency: Consistency means that the user always has the latest version of the data, regardless of which node processes the request. In this case, the leader ensures consistency by being the true ** of the current state of the system, and the leader election algorithm must properly maintain this. For example, a bank may need strong consistency to ensure that no matter which server responds to a user's online banking request, the user's bank account balance is accurate and that multiple transactions on the same bank account do not conflict with each other.
Leader election is a common pattern in distributed systems because it has several benefits:
It's easier to think about a system with a single leader because it consolidates all the concurrency of the system, reduces the risk of partial failure, and provides a single location for logs and metrics.
A single leader can be more efficient because it can simply communicate changes to other systems without the need to reach a consensus on those changes.
A single leader can easily provide consistency to customers by observing and controlling all changes in the system's state.
A single leader can improve performance or reduce costs by providing a single, consistent data cache that can be used simultaneously.
Writing software for a single leader may be simpler than writing software for a quorum because a single leader doesn't need to think about other systems that might be working in the same state at the same time.
There are also some significant drawbacks to the election of leaders:
A single leader creates a single point of failure because if the system is unable to detect or correct a faulty leader, the entire system can become unavailable.
A single leader limits the size of the system in terms of data size and request rates, and if the system needs to scale beyond a single leader, a complete re-architecture may be required.
Leaders represent a single point of trust, as problematic leaders can quickly propagate problems throughout the system and make a huge impact (known as the "radius").
Leader-Elected systems can be difficult to partially deploy, which is used in many of Amazon's software security policies, such as one-box, A-B testing, blue-green deployments, incremental deployments with automatic rollbacks, etc.
The leader election algorithm guides the cluster to collectively agree on a node as the leader and minimizes back-and-forth interactions. Typically, the algorithm assigns one of three states to each node: leader, follower, or candidate. The leader also needs to pass a "health check" or "heartbeat" on a regular basis so that the follower node can determine if the leader is unavailable or invalid, and can elect a new leader.
The type of leader election mechanism used depends on whether the cluster is synchronous or asynchronous. In a synchronous cluster, nodes synchronize to the same clock and send messages in a possible order. In an asynchronous cluster, messages cannot be reliably sent within a specific time frame or in any particular order.
Asynchronous algorithms cannot guarantee both security (ensuring that only one leader is elected) and liveness (ensuring that every node is elected) because any number of nodes in an asynchronous cluster can lag indefinitely. In practice, implementation prioritizes security as it can have more serious consequences for the service.
Synchronization algorithms are easier to understand and may be preferable because they guarantee security and liveness. However, synchronous clusters need to impose additional restrictions on how they operate, which may not be possible or scalable in practice.
The bully algorithm is a simple synchronous leader election technique that requires each node to have a unique numeric ID and that all nodes in the cluster know each other's IDs. The election process begins when a node starts or the current leader health check fails.
There are two possible outcomes:
If a node has the highest ID, it declares itself the winner and sends this message to the other nodes.
If one node has a low ID, it will send a message to all nodes with a higher ID. If it doesn't receive a response, it assumes that the nodes have failed or become unavailable and declares itself the winner.
Paxos is a general-purpose consensus protocol that can be used for asynchronous leader elections. The idea behind the Paxos algorithm is to reach a consensus among the nodes in the network on which node should be the leader. In the algorithm, one node proposes itself as a leader, and other nodes in the network vote for or against it. If a majority of nodes vote for that node, that node becomes the leader. If not, the node that failed to become a leader tries again. The process continues until the node receives a majority of votes and becomes the leader.
Raft is a popular alternative to Paxos because it's easier to understand, implement, and use. It is a non-blocking algorithm that involves each node in the raft consensus keeping track of the current "election term". When the leader election begins, each node increments a copy of its term number and listens for messages from other nodes. If the node does not receive any messages after a random interval, it becomes a candidate leader and requests votes from other nodes.
If the candidate wins a majority of votes, he becomes the leader. If another candidate with a higher term number sends a message, the original candidate acknowledges. If the election occurs or the timer times out without consensus, the algorithm will restart. Restarts are rare due to random timeouts, which prevents node collisions.
Apache ZooKeeper is a "self-distributed and highly reliable" centralized orchestration service. It is designed to help distributed systems handle coordination tasks. The idea behind Apache Zookeeper is that coordination is difficult, so it's best to have a shared, open-source implementation with all the necessary components so that services don't have to reinvent everything from scratch. This is especially useful in large distributed systems.
Apache ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol to manage leader election, replication order guarantees, and node recovery. ZAB is an asynchronous algorithm that ensures that writes are consistent and propagated to all nodes by "broadcasting" state changes from the leader to the followers.
Leader election is a powerful technique that can be used in a system to help make the system more fault-tolerant and easier to use. However, when we adopt leader elections, we pay close attention to the guarantees that each protocol provides and, more crucially, the guarantees that are not provided.
List of high-quality authors