1. Why Raft?
- Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members.
- Paxos (one of the Consensus Algorithm)has dominated the discussion of consensus algorithms over the last decade. But it is hard to understand.
- Raft define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos.
2. Main features in Raft
- Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.
- Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
- Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.
3. Replicated state machines
Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems.
đź’ˇ Replicated state machines are typically implemented using a replicated log
Each server stores a log containing a series of commands, which its state machine executes in order
Each log contains the same commands in the same order, so each state machine processes the same sequence of commands
State machines are deterministic, each computes the same state and the same sequence of outputs
- Each server stores a log containing a series of commands, which its state machine executes in order
- Each log contains the same commands in the same order, so each state machine processes the same sequence of commands
- State machines are deterministic, each computes the same state and the same sequence of outputs
Keeping the replicated log consistent is the job of the consensus algorithm.
- The consensus module on a server receives commands from clients and adds them to its log.
- It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail.
- Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients.
Consensus algorithms for practical systems typically have the following properties:
- They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.
- They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.
- They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.
- In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.
4. Weakness of Paxos
Over the last ten years, Leslie Lamport’s Paxos protocol has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point.
đź’ˇ Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos)
- Too difficult to understand.
- does not provide a good foundation for building practical implementations.
Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is timeconsuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem.
5. Idea within Raft
- Problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently.
- Simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible
- There are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.
6. The Raft consensus algorithm
First electing a distinguished leader
- leader has complete responsibility for managing the replicated log.
- leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.
- leader can fail or become disconnected from the other servers, in which case a new leader is elected
Given the leader approch, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:
- Leader election: a new leader must be chosen when an existing leader failss
- Log replication: the leader must accept log entries from clients and replicate them across the cluster.
- Safety: the key safety property for Raft is the State Machine Safety Property. if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index.
term
Each server stores a current term number, which increases monotonically over time.
- Current terms are exchanged whenever servers communicate.
- If one server’s current term is smaller than the other’s, then it updates its current term to the larger value.
- If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state.
- If a server receives a request with a stale term number, it rejects the request.
RPCs
- RequestVote RPCs are initiated by candidates during elections.
- AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat.
- A third RPC for transferring snapshots between servers
- Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.
7. Leader election
Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a
follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.
Begin an election
follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster.
A candidate continues in this state until one of three things happens:
**a.** it wins the election
A candidate wins an election if it receives votes from a majority.
**b.** another server establishes itself as leader
Each server will vote for at most one candidate in a given term, on a first-come-first-served basis
a candidate may receive AppendEntries RPC from another server claiming to be leader. If the leader’s term is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.
**c.** a period of time goes by with no winner
candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.
Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly
Why not use ranking?
a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader.