Last time we shared the CAP theorem, we learned that in the case of a network partition, we have to choose between consistency and ailability, and more specifically, we are really choosing between strong consistency and eventually consistency. In terms of eventual consistency, we usually hear a lot of things, such as read repair, write repair, anti-entropy, and so on. However, in terms of strong consistency, we often hear the strong consistency algorithm published by a certain **, the reason is that strong consistency must have reasonable mathematical proof, and everyone will believe it, in addition, it also needs to go through large-scale landing verification before everyone will recognize it, so there are few strong consistency algorithms that are widely used in business, such as paxos, zookeeper's zab, raft, etc. Since strong consistency must be guaranteed in some scenarios, such as financial payment, ticketing system, etc., this article will introduce the algorithms of the strong consistency series.
If you want to talk about consensus algorithms, you must mention Paxos, because Paxos lacked engineering implementation details when it was first proposed, and it was more like a theoretical framework, resulting in algorithms with implementation details looking like Paxos, and some people would even say that there is only one consensus algorithm in the world, and that is Paxos. I won't expand the details of Paxos here for the time being, because the Paxos algorithm is notoriously difficult, which led to the author Lamport even published "Paxos Made **" to explain his algorithm, but Paxos is mentioned here because the importance of Paxos is that it has rigorous mathematical proofs, if you really want to understand Paxos, it is recommended to understand other algorithms of the Paxos family first, such as Raft to be mentioned in this article, and finally if for If you are interested in the implementation of Paxos on the engineering side, please refer to the Google team's practical summary of Paxos "Paxos Made Live — An Engineering Perspective".
From the perspective of the types of problems solved, there are two types of consensus algorithms, namely:Byzantine fault-tolerant algorithm(Bezantine Fault Tolerance, BFT) withFault-tolerant algorithms(crash fault tolerance, cft)。The Byzantine fault-tolerant algorithm is mainly used to solve if anyNodes do evilIn the case of how to synchronize the state of the cluster, common Byzantine fault-tolerant algorithms have pbft; Fault tolerant algorithms are mostly dealing with themNode failureOr encounteredNetwork issuesHow to keep the state of the entire cluster consistent, common fault tolerance algorithms include zab, raft, and so on.
Although the Byzantine Generals' Problem and the Byzantine fault-tolerant algorithm PBFT were proposed quite early, it can be said that it was not until the emergence of blockchain that large-scale application scenarios were found, and most of the internal applications of enterprises still belong to fault-tolerant algorithms, such as Google's distributed lock system that reaches consensus through Paxos — Chubby, and today we are going to introduce the consensus mechanism that is common in enterprises — raft。
Raft, a Ph.D. student at Diego Ongaro at Stanford, co-authored "In Search of an Understandable Consensus Algorithm" with his advisor, John Ousterhout, in 2013 and received the Best at the Usenix Annual Technical Conference in 2014** award。
The biggest advantage of an easy-to-understand algorithm is that it is not easy to make mistakes in engineering, which also leads to the fact that new systems after 2013 usually give priority to raft if they need strong consistency, such as etcd in 2013, consul in influxdb, ipfs and cockroachdb in 2015, etc.; Another advantage of understanding is that there are quite a lot of implementations, so you can find quite a few reference implementations, and if you search for paxos and raft on GitHub, you will find that there is a difference of nearly 3 times the number of repostiories.
Next, I'll go into the details of the raft algorithm, starting with the content of the node records — state, term, log, and state machine, and then the two modules — leader election and log replication.
In raft, there are three states of nodes, which are leader, candidate, and follower. raft belongsStrong Leader Model, so there can only be one leader in a raft cluster, and other nodes will respect the leader, and the leader will be whatever he says, which also leads to the fact that raft can only do fault tolerance (CFT), but cannot handle Byzantine fault tolerance (BFT).
Node state tenure sounds like something only leaders need, yes, but in order to be fault tolerant, any node in the cluster may become a candidate after the leader fails and participate in the leader election, so each node needs to know the current term.
Term of office is a strictly increasing number, and raft is the strong leader model, soThere will only be one leader at most in a tenure, and only when the leader is present can he provide services to the outside world。As an example in the following figure, each term of office starts with the leader election (dark blue interval), followed by the time when the cluster can serve externally (light blue interval), and each term of office will only initiate the next election after the leader fails, so the length of each term is not fixed, and there may also be a term of failure in the leader election (such as term3), which means that there is no leader in the term, so the next round of leader election will be directly carried out.
The term log consists of an index, a term, and a command, the index is a strictly increasing number, and the term here represents the log recorded during which term, and the instructions represent what operations to be done. In the following diagram as an example, the red box indicates that log index 4 occurs at term 2, and the instruction is to set x to 2.
The Chinese translation of log state machine should be state machine, but here we use data state machine to distinguish it from node state. raft records the instructions sent by the user by logging, but writing into the log is just a record, it does not mean that the state of the data has really changed, in the section on log copying, I will explain the conditions from adding a log to changing the state of the data, but here we know firstAdding a log is not the same thing as changing the state of the data
In the previous article, it was mentioned that the CP model usually uses two-phase commit (2pc), which is why raft separates logs from data state machines, writing to logs is the first stage, and changing the data state machine is the second stage. As an example, if the condition of changing the data state is met every time a new log is added, the state of the data will also change with the instructions in the log.
Relationship between log and data state machineWe mentioned above that there are three states of nodes - follower, candidate and leader, so let's understand the mechanism of raft leader election according to different node states and the events that each state may encounter.
Each node is a follower when it is first started, and the follower will maintain a timer for the leader's heartbeat information, depending on the result of the timer, there are two possibilities:
Continue to be a follower: Before the timer counts down to 0, the node will reset the timer to continue as a follower when it receives a heartbeat message from the leader or a candidate voting request message (RequestVote RPC). Become a candidate: When the timer counts down to 0, no leader heartbeat message is received, and no other candidate message is received, and the follower decides that there is no leader in the cluster and initiates the election, becoming a candidate. When a node changes from a follower to a candidate, it will increase its term by one and vote for itself (node a in the figure below). It is mentioned above that there will only be one leader at most in a term, so when the node initiates the election, the term is increased by one, which means that the node thinks that the previous term has ended and moves on to the next term.
Followers don't receive heartbeat messages, become candidates, and send a vote request
As soon as a node becomes a candidate, it sends a RequestVote RPC to each node and maintains an election timeout timer, which can occur in one of three ways:
Become a leader: as long as the candidate obtainsMore than half of the votes, the candidate will change their status to leader (node A in the figure below) and start sending heartbeat messages to other nodes. Fall back to Followers: Candidates change their status back to Followers when they find out during the election that they already have a leader of the same term, or a leader with a higher term. Election timeout: When a candidate's countdown clock is 0, he or she has no way to become a leader, and he has not received the heartbeat message from other leaders, at this time, the node will judge that the election is a failure, start the next election, and the candidate will add his term of office by one, representing the beginning of the new term, and reset his votes to 1, and finally send a vote request to other nodes again.
When a candidate receives a majority of the votes, he becomes a leader and sends a heartbeat message
During the period when the node is the leader, it will continue to send heartbeat messages to other nodes to prevent other nodes from holding elections, but the cluster may also have two leaders because of the partition, and when the partition is restored, one of the leaders finds that the other leader has a higher tenure than him, and the leader with the lower tenure will return to the follower, so that the cluster will return to the state of only one leader.
The leader keeps sending heartbeats to prevent other nodes from initiating electionsThe above are the three situations that a node may encounter in the three states, and the next is how a node will vote when it encounters a request for vote (RequestVote RPC):
Nodes with high tenure are not voted for low tenure, and nodes with high log indexes are not voted for nodes with low log indexesOnly the node with the most complete logs can be the leader。If the candidate satisfies the previous point, the node willPreference is given to the candidate who sends the first request to vote。Each node can only cast one vote during a term. As mentioned above, there is a condition for changing the data state machine from the log, and this section will explain how raft can copy the log and change the data state machine. Since raft is a strong leader model, only the leader can receive write requests from the client for processing, and after receiving the client's request, the leader will write the client's instructions into the log, and then send the log copy request (appendentries rpc) to other nodes as long as the leader receives itMore than half of the responses were successful, the leader will execute the log command, change its data state machine, and reply to the client.
The above situation is an ideal situation, but in reality, the logs of each follower may be inconsistent due to various problems (as shown in the figure below). raft is designed for appendentries rpcIt is not allowed to copy the most recent logs directly, and skip the logs that have not been replicated in between。As shown in the following figure, if the leader sends a log copy request for index 8 to the first follower, the latest log of this follower is only up to index 5, so the leader's request will be rejected, and the leader will continue to send the log copy request of index 7 to the first follower, and the follower will refuse until the leader sends the log copy request of index 6 to the first follower, the follower will accept it, and resynchronize to index 8 from index 6.
Therefore, if the raft cluster wants to serve externally, at least half of the nodes must have complete log records before they can serve externally, because the nodes without complete log records cannot successfully reply to the latest log entry request.
Followers haven't copied the leader's track record in its entirety, and now that we've talked about how Raft works, let's go back to two Raft design ideas — random election timeouts, two-phase commit optimization, and partition fault tolerance.
From the follower to the candidate's **, it can be found that the timer countdown time of each node is not the same, so some nodes become candidates relatively quickly, and some are still counting down, this design is to avoid nodes from initiating voting at the same time, resulting in the dispersion of votes, and then causing election failure, you still remember that it was said above that each election term can only serve externally after there is a leader, so raft should try to ensure that the election can be successful, and in order to avoid nodes frequently initiating elections, raft **It is recommended to set the timeout between 150 and 300ms.
As mentioned in the previous article, the two-phase commit should be replied to the client only after the cluster has completed the commit phase, but in raft, only the leader will reply to the client after the execution phase is completed, so it is equal to omitting half of the information propagation, which is an optimization of raft. But you must be wondering, how do followers know when they can go to the execution phase? Do you remember the heartbeat message we mentioned earlier? In fact, the heartbeat information is also the log copy request information (appendentries rpc), and the log request information will be includedleadercommit
, which represents the leader's latest log index that has entered the execution phase, so when the leader disseminates the heartbeat message, he not only informs his followers not to initiate elections, but also copies the log and synchronizes the state of the data state machine to reduce the amount of information.
After the previous article, you will be curious about how raft handles the problem of partition fault tolerance, taking the following ** as an example, after the raft cluster is partitioned again, it may indeed produce two leaders, resulting in the problem of split brain, but we mentioned above that log replication will only be successfully sent back to the client after most nodes respond successfully, so no matter how the partition is cut,At most, there will only be one partition with more than half of the nodesTherefore, only one leader can continue to serve externally, and the other leaders can only reply to failures even if they receive a request from the client.
In the case of partitioning, at most only one partition can be used to serve externallyAs mentioned in the previous article, two-phase commit can keep the state of most nodes in the cluster consistent and achieve strong consistency. In the Byzantine fault-tolerant algorithm PBFT, the two-phase commit is upgraded to a three-phase commit, and the process is changed to avoid some nodes from doing evil, resulting in the inability to reach consensus.
I hope that after this article, readers can understand how the raft algorithm works, one of the reasons why raft has been quickly adopted by many systems after its publication in 2014 is that it is easy to understand, and the other reason is that raft is perfectly decoupled from the system in implementation, so that the system can be developed based on raft, and other consensus mechanisms such as zookeeper's zab were for zookeeper at the beginning of the publication. Therefore, it is more difficult for other systems to develop on top of zab.
There are many textbooks and implementation versions of raft in various languages, if you want to know more, it is recommended to read it directly **If you think it is too difficult to read directly, you can also watch raft The author personally explained raft at the University of Illinois** Finally, if you want to delve into the source code research, you can study the version of hashicorp, because this version has been used in major systems such as consul, ipfs and influxdb.