Leader Election + Log Replication + Safety
Paxos, But Readable
Raft was designed in 2014 by Diego Ongaro and John Ousterhout with one goal: understandability. Paxos (Lamport 1989) had been the textbook consensus algorithm for two decades, but engineers found it baffling. Most production implementations were ad-hoc Paxos variants with subtle bugs. Raft solves the same problem (replicated state machine under crash failures) with the same correctness — but decomposes it into three independent mechanisms, each of which fits in a few pages.
The decomposition: Leader Election + Log Replication + Safety. Memorise that triple.
The Failure Model
Raft is fail-stop only. Servers can crash; messages can be delayed or lost; the network can do anything. But servers don't lie. (For Byzantine tolerance you need PBFT, HotStuff, or Tendermint — separate protocols.)
The cluster makes progress as long as a majority is up. With 5 servers, 3 must be alive; with 3 servers, 2 must be alive.
Server States
Three states only:
- Leader: sends heartbeats; receives client commands; replicates to followers.
- Follower: passively receives + replicates from leader.
- Candidate: transitional state during election.
A server is in exactly one state at any time. Transitions follow strict rules.
Leader Election — Random Timeouts + Terms
Every follower has a randomised election timeout (typically 150–300 ms). When it expires without a heartbeat from the current leader, the follower:
1. Transitions to Candidate. 2. Increments its term (a monotonic logical period). 3. Votes for itself. 4. Sends RequestVote RPCs to all other servers.
A candidate that gets votes from a majority (in that term) becomes Leader and starts heartbeating. Each heartbeat resets followers' timeouts.
Why random timeouts? With fixed timeouts, several followers might timeout simultaneously → all become candidates → split votes → no majority → another election → split votes again. Livelock. Random timeouts make one candidate likely to win each round.
Terms are central to safety. Each server stores its current term. If a server sees a higher term in any RPC, it immediately becomes a Follower and updates its term. This is how Raft detects obsolete leaders (e.g., a leader from an older term rejoining after a partition heals — it sees a higher term and steps down).
Log Replication — The Happy Path
A client sends a command to the leader.
1. Leader appends entry to its own log (uncommitted). 2. Leader sends AppendEntries RPC to all followers in parallel. 3. When a majority of followers have written the entry to their own logs, leader marks it committed, applies it to its state machine, and replies to the client. 4. Followers learn about the commit on the next AppendEntries (via the leader's committed-index field) and apply the entry to their own state machines.
If a follower is unreachable, the leader keeps retrying.
AppendEntries Consistency Check
Each AppendEntries includes the index and term of the entry immediately before the new entries. The follower checks: do I have an entry at (prevIndex, prevTerm)?
- Match → accept, append the new entries (overwriting any conflicting later entries).
- Mismatch → reject. Leader decrements nextIndex for this follower and retries — backing up entry by entry until logs agree.
When match is found, leader's log overwrites the follower's. The leader's view always wins.
Safety — Election Restriction
Without the election restriction, Raft could lose committed entries. The rule:
A candidate is granted a vote only if its log is at least as up-to-date as the voter's.
"Up-to-date" means: compare last entry's term; if equal, the longer log wins.
Why this matters: any committed entry exists on a majority of servers. Any winning candidate has votes from a majority. The two majorities intersect (any two majorities of servers share at least one server). The intersection server has the committed entry in its log. The election restriction ensures the new leader's log is at least as long as the voter's — so the new leader has the entry.
Conclusion: a new leader has every committed entry from every prior term. No commits are lost.
The Figure 8 Pitfall — Prior-Term Commitment
There's a subtle bug Raft avoids. Consider this scenario:
- Term 2: leader replicates entry to a majority but crashes before committing.
- Term 3: new leader (without ) replicates new entry .
- Term 4: recovers, becomes leader, sees on majority — if it 'commits' by majority count, but was already committed via another path, we have divergence.
Raft's rule: a new leader does NOT directly commit entries from prior terms. It commits a prior-term entry only by committing a NEW entry from its current term on top (which transitively commits the prior ones).
This rule + the election restriction together give Raft its safety property.
Why Random Election Timeouts Are Critical
Without them, multiple followers timeout simultaneously, all become candidates, split votes, no majority, retry, repeat — livelock. Random timeouts ensure one candidate usually starts and wins before others timeout. Typically 150–300 ms.
Log Compaction And Membership Changes
Log compaction: old log entries already applied to the state machine can be compressed into a snapshot (the current state machine). New servers receive the snapshot instead of replaying the full log. Keeps log size bounded.
Membership changes: changing the cluster size (e.g., 3 → 5) is tricky — naively you could have two disjoint majorities momentarily. Raft uses joint consensus: a transitional configuration requiring agreement in BOTH old and new configs. Once joint is committed, switch to new config alone.
Where Raft Lives In Production
- etcd (Kubernetes' coordination layer).
- Consul (HashiCorp's service discovery).
- CockroachDB (distributed SQL).
- TiKV (the Rust-based KV layer of TiDB).
These are all production-grade replicated state machines built on Raft.
Raft vs Paxos vs 2PC
2PC: distributed transaction commit; blocks on coordinator failure; not a replicated state machine.
Paxos: same consensus problem as Raft; mathematically equivalent; harder to reason about; production implementations diverge.
Raft: simpler decomposition; same correctness; same fault tolerance; the go-to choice for new replicated state machines.
What You Walk In Carrying
Replicated state machine problem. Fail-stop assumption. Three sub-problems (Election + Replication + Safety). Three states (Follower / Candidate / Leader). Terms are monotonic; higher term observed → step down. Random election timeouts prevent livelock. Log replication via AppendEntries + majority commit. AppendEntries consistency check backs up on mismatch; leader's log wins. Election restriction = candidate's log ≥ voter's log. New leaders don't directly commit prior-term entries (Figure 8). Log compaction via snapshots. Joint consensus for membership changes. Raft is used in etcd, Consul, CockroachDB.