Saral Shiksha Yojna
Courses/Distributed Systems

Distributed Systems

CS3.401
Prof. Kishore KothapalliMonsoon 2025-264 credits
Revision Notes/Unit 9 — Raft Consensus/Leader Election + Log Replication + Safety

Leader Election + Log Replication + Safety

NotesStory

Intuition

Raft is Paxos rewritten for understandability (Ongaro and Ousterhout, 2014). Same correctness as Paxos; simpler decomposition. The cluster keeps a replicated log via three independent mechanisms: random timeouts elect a single leader per term; the leader replicates each new entry to a majority before committing; an 'election restriction' ensures the new leader has every committed entry from prior terms.

Explanation

What problem does Raft solve? Building a replicated state machine: a cluster of servers maintaining identical replicas of a log so they execute the same commands in the same order. Gives clients a single-system image even though it's distributed. System progresses as long as a majority is up.

Failure model. Fail-stop (not Byzantine). Messages may be delayed or lost. Network may delay arbitrarily. Servers have persistent storage.

Why Raft instead of Paxos? Paxos is notoriously hard to understand and has gaps between theory and practical implementation. Raft designed for understandability — same correctness, simpler decomposition, less state-space complexity.

Three sub-problems Raft decomposes consensus into. Leader election — pick one server as leader; detect crashes; elect a new one. Log replication — leader accepts client commands, appends to its log, replicates to followers (overwriting any inconsistencies). Safety — keep the log consistent across crashes; only servers with up-to-date logs can become leaders.

Server states. Each server is in exactly one of: Leader, Follower, Candidate.

Leader Election protocol. Every follower has a randomised election timeout. If a follower receives no heartbeat from the leader before timeout, it transitions to Candidate, increments its term, votes for itself, and sends RequestVote RPCs to all others. A candidate that receives votes from a majority of servers in that term becomes the Leader and starts sending periodic heartbeats (AppendEntries with no log entries). Receiving a heartbeat resets a follower's election timeout. Random timeouts prevent split votes — if no majority is achieved, a new election begins after another random timeout.

Terms — what and why. A term is a logical period that begins with an election; one leader (or none, in case of split vote) per term. Each server stores its current term; terms are compared on every RPC. If a server sees a higher term in any message, it immediately becomes a follower and updates its term. Terms are how Raft detects obsolete leaders (a leader from an older term who rejoins after partition steps down on contact).

Log replication (normal operation). Client sends command to leader. Leader appends entry to its log (uncommitted). Leader sends AppendEntries RPC to all followers in parallel. When a majority of followers have written the entry to their logs, leader marks it committed, applies it to its state machine, and replies to the client. Subsequent AppendEntries RPCs inform followers of the latest commit index; they then apply the entry to their own state machines. If a follower is unreachable, leader keeps retrying until success.

AppendEntries consistency check. Each AppendEntries RPC includes the index and term of the entry immediately preceding the new entries. If a follower's log doesn't match (no entry at that index, or different term), it rejects the RPC. The leader then decrements nextIndex for that follower and retries — backing up entry by entry until logs agree. The leader's log OVERWRITES the follower's conflicting entries.

Safety: how does Raft ensure committed entries are never lost? Election restriction: a candidate is granted a vote only if its log is at least as up-to-date as the voter's (compare last-entry term; if equal, longer log wins). This guarantees the newly elected leader has every entry committed in earlier terms (since any committed entry exists on a majority, and any winning candidate must intersect that majority). Commitment rule: a new leader does NOT immediately commit entries from prior terms; it commits an entry from a previous term only by committing a NEW entry from its current term on top.

Why the prior-term commit caveat? Direct commitment of prior-term entries by majority count is unsafe (Figure 8 in the Raft paper): a prior-term entry could be replicated to a majority, then a different new leader could overwrite it. Safe rule: the new leader commits an entry from a previous term only by committing a new entry from its current term on top (transitively commits the earlier ones).

Log compaction. Old log entries that have been applied to the state machine can be compacted into a snapshot (the current state). New servers receive the snapshot instead of replaying the full log. Keeps log size bounded and speeds up recovery.

Cluster membership changes. Uses a joint consensus transitional configuration: while moving from old to new config, decisions require agreement in BOTH old-config majority AND new-config majority. This prevents two disjoint majorities being possible during transition. Once joint config is committed, switch to new config alone.

Raft vs 2PC vs Paxos. 2PC: blocks on coordinator failure; not designed for replicated state. Paxos: solves the same consensus problem; mathematically equivalent but harder to reason about. Raft: practical, understandable, used in production (etcd, Consul, CockroachDB).

Definitions

  • Replicated state machineA cluster of servers maintaining identical log replicas so they execute the same commands in the same order. Gives a single-system image to clients.
  • Term (Raft)Monotonic logical period beginning with an election. At most one leader per term. Servers update their term whenever they see a higher one.
  • Leader / Follower / CandidateServer states. Leader sends heartbeats + accepts client commands. Follower passively replicates. Candidate is between (during election).
  • Election timeoutRandom interval (typically 150–300 ms) after which a Follower assumes the Leader has crashed and becomes a Candidate. Randomisation prevents split votes.
  • AppendEntries RPCLeader → Follower: 'append these entries; prev entry was at (index, term).' Doubles as a heartbeat when entry list is empty. Followers reject on mismatch, leader backs up.
  • Election restrictionSafety rule: voter grants vote only if candidate's log is at least as up-to-date as voter's (compare last entry's term; if equal, longer log wins).
  • CommitAn entry is committed when it has been replicated to a majority of servers IN THE CURRENT TERM. The leader then applies it to the state machine and replies to the client.
  • Log compaction (snapshot)Compress old applied entries into a single snapshot of the state machine. New servers receive snapshot instead of replaying full log.
  • Joint consensus (membership change)Transitional phase requiring agreement in BOTH old-config majority AND new-config majority. Prevents two disjoint majorities during transition.

Formulas

Derivations

Why majority quorums. Any two majorities of servers intersect (Fisher: ). So any committed entry exists on at least one server in any future quorum. Election quorum + commit quorum intersect ⇒ new leader has the entry.

Why prior-term entries can't be directly committed (Figure 8). Suppose at term 2, leader replicates entry to a majority but crashes before committing. New leader (term 3) doesn't have . replicates a new entry . Then recovers, becomes leader again at term 4, sees on majority (which it is, from term 2), 'commits' it. But meanwhile 's entry was also committed by some other leader path → divergence. Solution: only commit prior-term entries transitively by committing a current-term entry on top.

Why randomised timeouts prevent livelock. With fixed timeouts, multiple followers timeout simultaneously, all become candidates, split votes, no majority, repeat. With random timeouts (typically 150–300 ms), one candidate usually starts and wins before others timeout.

Examples

  • Leader election. 5 servers , term 4, leader . crashes. 's timeout fires first (say 200ms vs 's 240ms). → Candidate, term = 5, votes self, sends RequestVote. each see term 5 > 4 → update term, check log-up-to-date; grant vote if 's log ≥ theirs. collects 3 votes (self + 2) ≥ majority (3) → Leader for term 5. Sends heartbeats.
  • AppendEntries consistency repair. Follower missing some entries. Leader's nextIndex[] = 10. AppendEntries with prevIndex = 9 rejected (no match). Leader decrements to 9. Retry: prevIndex = 8 — also rejected. Keep backing up until match at prevIndex = 5. Now leader sends entries 6+ to ; overwrites any conflicting entries.
  • Higher term observed. Leader at term 4 sends AppendEntries. replies with term 5 (it just participated in an election). sees term 5 > 4 → steps down to Follower; updates term to 5. Catches up by receiving heartbeats from the new leader.
  • Commit a prior-term entry. Entry from term 3 replicated on majority but never committed (old leader crashed). New leader (term 5) sees on majority. Cannot commit directly. Instead, replicate a new entry at term 5. Once is committed (current-term majority), is transitively committed.

Diagrams

  • Raft state machine: Follower → (timeout) → Candidate → (majority votes) → Leader. Any state + higher term observed → Follower.
  • Term timeline: terms as monotonic logical periods; one leader per term; election failures shown as 'no leader' terms.
  • AppendEntries flow: leader sends prevIndex+term; follower accepts (match) or rejects (no match); leader backs up.
  • Commitment: leader replicates entry to majority + applies + replies client; followers apply on next heartbeat.
  • Figure-8 unsafe scenario: prior-term entry committed by majority count → new leader overwrites — illustrates why prior-term entries need transitive commit.

Edge cases

  • Split vote — multiple candidates split votes; no majority. Random timeouts make repeat split improbable.
  • Recovered old leader — sends AppendEntries with old term; current followers ignore + reply with higher term → old leader steps down.
  • Network partition isolating leader — minority side cannot commit (no majority); majority side elects new leader. When partition heals, old leader sees higher term → steps down.
  • Slow follower — leader retries indefinitely; follower can catch up via snapshot if too far behind.
  • All-server crash — Raft cannot make progress; data is durable on stable logs; restart yields a consistent log up to last committed entry.
  • Cluster membership transition — joint consensus phase ensures no two disjoint majorities possible.

Common mistakes

  • Saying 'Raft handles Byzantine failures'. No — Raft is fail-stop only. Byzantine Raft variants exist but are different protocols.
  • **Saying 'majority commit means commit at servers'.** No — majority = . For , majority = 3.
  • Forgetting the election restriction. It's the central safety mechanism. State it explicitly: 'candidate's log must be at least as up-to-date as voter's.'
  • Saying 'new leader commits prior-term entries directly'. No — that's the Figure 8 bug. Commit prior-term entries only transitively, by committing a new current-term entry on top.
  • Confusing 'term' with 'index'. Term = logical period of one election. Index = position in log. Each entry has both.

Shortcuts

  • Failure model: fail-stop only (not Byzantine).
  • Three sub-problems: Election + Replication + Safety.
  • Three states: Follower / Candidate / Leader.
  • Election: random timeout → candidate → majority votes → leader.
  • Higher term → step down to Follower.
  • Majority replication → commit.
  • AppendEntries consistency: prevIndex+term must match; else back up.
  • Safety: election restriction = up-to-date log wins.
  • No direct commit of prior-term entries.

Proofs / Algorithms

Election restriction ⇒ committed entries preserved. Any committed entry is on a majority of servers. Any newly elected leader has votes from a majority. The two majorities intersect (Fisher inequality). At the intersection server, is in its log. The election restriction requires the new leader's log to be at least as up-to-date as the voter's — so the new leader's log includes . Hence new leader knows about every committed entry from prior terms.

Why direct prior-term commit is unsafe. See Figure 8 in the Raft paper. Walk-through: leader at term 2 replicates entry to majority but crashes. New leader (term 3) doesn't have , replicates . recovers, regains leadership at term 4, sees on majority — if it commits directly, but another path has committed , two committed entries at the same index ⇒ divergence. Safe rule: only commit prior-term entries when also committing a current-term entry on top.