Distributed Systems
CS3.401Prof. Kishore Kothapalli•Monsoon 2025-26•4 credits
Cheatsheet
Ultra-condensed. Revise a chapter in minutes.
Unit 1 — Introduction, Challenges & CAP Theorem
Foundations — Definition, Motivation, Challenges, CAP
One-liners
- Tanenbaum: 'many computers, appear as one'.
- Five features: many + concurrent + independent failure + no global clock + no shared memory.
- CAP: pick 2; P is unavoidable → trade C vs A.
- CP examples: Spanner, HBase. AP: Dynamo, Cassandra, DNS.
- Two-Generals: deterministic consensus over unreliable channels is impossible.
Formulas
Definitions
- Distributed system: appears as one computer to the user.
- Consistency: all replicas show the same recent value.
- Availability: every non-failing node responds.
- Partition tolerance: works through network partitions.
Algorithms
- Two-Generals: send msg → ack → ack-of-ack → ... never terminates with certainty.
Comparisons
- Parallel vs Distributed: Parallel: shared memory + single clock (tightly coupled). Distributed: no shared memory + no global clock (loosely coupled, message passing only).
- CP system vs AP system: CP blocks during partition to keep replicas in sync (Spanner). AP keeps serving stale data and reconciles later (Dynamo).
Keywords
TanenbaumCAPconsistencyavailabilitypartition toleranceTwo-GeneralsFLPhorizontal scalingBrewerPACELC
Unit 2 — Models, Events & Logical Time
Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync
One-liners
- Lamport $\to$: same-process / send-recv / transitive. Concurrent = neither precedes.
- Scalar R2: $c_i = \max(c_i, t_m) + d$.
- Vector R2: componentwise max, then $V[i]$++.
- Strong consistency: vector ✓, matrix ✓, scalar ✗.
- Singhal-Kshemkalyani: send changed entries; needs FIFO.
- Matrix clock GC: $\min_k mt[k, i] \ge t$ → discard.
- Cristian's: $T_s + \text{RTT}/2$; Berkeley: average; NTP: stratum tree.
Formulas
- Scalar R1/R2: $c_i := c_i + d;\ c_i := \max(c_i, t_m) + d$
- Vector R1/R2: $V[i]$++; componentwise max then ++
- Comparison: $V_h < V_k \iff V_h \le V_k \land \exists i\ V_h[i] < V_k[i]$
- Matrix GC: $\min_k mt[k, i] \ge t$
Definitions
- Happened-before: same-process / send-recv / transitive.
- Strong consistency: $\Leftrightarrow$ (both directions).
- Vector clock entry: causal-event count for that process.
- Matrix entry $mt[j, k]$: i's knowledge of j's knowledge of k's clock.
Algorithms
- Scalar Lamport: R1 then R2 + tie-break (t, i).
- Vector clock: R1 ++ own; R2 max-then-++.
- Singhal-Kshemkalyani: send (k, V[k]) for k where LU[k] > LS[j].
- Cristian's: RTT measure → $T_s + \text{RTT}/2$.
- Berkeley: poll slaves → average → send deltas.
Comparisons
- Scalar clock vs Vector clock: Scalar: 1 int, consistent only (forward). Vector: n ints, strongly consistent. Vector detects concurrent events; scalar can't.
- Vector clock vs Matrix clock: Vector: causal precedence. Matrix: causal + second-order knowledge → enables obsolete-message GC.
- Cristian's vs Berkeley: Cristian's: single server, UTC, $T_s + \text{RTT}/2$. Berkeley: master polls all, computes average, no UTC, internal LAN agreement.
- NTP vs Cristian's: NTP: hierarchical strata; Internet-wide; uses 4 timestamps for offset+delay. Cristian's: 1-hop polling; single server; simple but vulnerable.
Keywords
happened-beforeLamport clockvector clockmatrix clockstrong consistencySinghal-KshemkalyaniFIFOCristian'sBerkeleyNTPstratumUTCRTT
Unit 3 — Global Snapshots
Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts
One-liners
- Consistent cut: no message arrow future → past.
- C1: send recorded ⇒ msg in channel XOR received.
- C2: send NOT recorded ⇒ msg NOT in channel AND NOT received.
- Chandy-Lamport: FIFO, markers, $O(|E|)$ msgs.
- Lai-Yang: non-FIFO, white/red, heavy history.
- Acharya-Badrinath: causal, $2N$ msgs, SENT/RECD.
Formulas
- Acharya-Badrinath channel: $\{RECD_j[i]+1, \ldots, SENT_i[j]\}$
Definitions
- Snapshot = global state recorded without stopping the system.
- Marker (CL) = control msg separating pre/post on FIFO channel.
- White/red (Lai-Yang) = colour-based pre/post on non-FIFO channels.
- SENT/RECD (Acharya-B) = per-process message counters.
Algorithms
- Chandy-Lamport: initiator records → marker on outgoing; on first marker: record + propagate; on later marker: stop recording.
- Lai-Yang: white turns red on snapshot; red msg forces receiver's snapshot; channel state from history.
- Acharya-Badrinath: initiator broadcasts token; each replies with (LS, SENT, RECD); channel state computed.
Comparisons
- Chandy-Lamport vs Lai-Yang: CL: FIFO required, marker-based, light storage. Lai-Yang: non-FIFO OK, colour-based, heavy history.
- Chandy-Lamport vs Acharya-Badrinath: CL: FIFO, $O(|E|)$ msgs. Acharya-B: causal, $2N$ msgs — much cheaper but stronger channel assumption.
Keywords
snapshotconsistent cutC1C2Chandy-LamportmarkerFIFOLai-YangAcharya-Badrinathcausal channel
Unit 4 — Causal Order Message Delivery
BSS Algorithm + Causal vs FIFO vs Total Order
One-liners
- Causal: $\text{send}(m_1) \to \text{send}(m_2)$ ⇒ $m_1$ delivered before $m_2$.
- BSS (a): $V_m[j] = V_i[j] + 1$.
- BSS (b): $\forall k \ne j:\ V_m[k] \le V_i[k]$.
- FIFO ⊊ Causal ⊊ Total order.
Formulas
- Cond (a): $V_m[j] = V_i[j] + 1$
- Cond (b): $\forall k \ne j:\ V_m[k] \le V_i[k]$
Definitions
- Causal order delivery: cause-then-effect at every receiver.
- BSS: vector-clock-based causal delivery; buffer if conditions fail.
- Total order: all msgs in same global sequence at every receiver.
Algorithms
- BSS receive: check (a) AND (b); deliver or buffer; re-check buffered msgs after each delivery.
Comparisons
- FIFO vs Causal: FIFO orders only same-sender pairs. Causal orders all pairs related by happened-before.
- Causal vs Total: Causal: orders causally-related pairs. Total: orders ALL pairs (including concurrent). Total requires consensus.
Keywords
causal orderBSSBirman-Schiper-Stephensonvector clockFIFOtotal orderstate-machine replication
Unit 5 — Distributed Mutual Exclusion
Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison
One-liners
- Lamport: 3(N-1) msgs, FIFO, L1+L2 entry.
- Ricart-Agrawala: 2(N-1) msgs, no FIFO, deferred REPLY.
- Maekawa: $K = D = \sqrt{N}$. V1 deadlocks; V2 adds FAILED/INQUIRE/YIELD; SD = 2T.
- Suzuki-Kasami: 0 or N msgs; freshness $RN[i] = LN[i]+1$.
- Raymond: O(log N), Holder pointer, root bottleneck.
- Lamport needs FIFO; R-A doesn't.
Formulas
- Lamport: $3(N-1)$
- R-A: $2(N-1)$
- Maekawa: $3\sqrt{N}$ → $5\sqrt{N}$
- S-K: $0$ or $N$
- Raymond: $O(\log N)$
Definitions
- Safety = ≤1 in CS; Liveness = eventual entry; Fairness = timestamp order.
- SD = time from one exit to next entry.
- Throughput = $1/(\text{SD}+E)$.
- Quorum (Maekawa) = set $R_i$ with pairwise intersection.
- Token freshness (S-K) = $RN[i] = LN[i] + 1$.
Algorithms
- Lamport entry: L1 (later msg from every site) ∧ L2 (own at top).
- R-A defer: in CS, OR requesting with smaller ts.
- Maekawa V2: FAILED (already higher replied), INQUIRE (higher arrived; ask), YIELD (relinquish).
- S-K: broadcast REQ; token to $P_i$ iff fresh.
- Raymond: Holder up, token down, requests aggregate.
Comparisons
- Lamport vs Ricart-Agrawala: Lamport: 3(N-1) msgs, FIFO. R-A: 2(N-1) msgs, no FIFO, deferred REPLY replaces RELEASE.
- Maekawa V1 vs Maekawa V2: V1: 3√N msgs, deadlocks. V2: up to 5√N, three new messages (FAILED, INQUIRE, YIELD) break deadlock.
- Suzuki-Kasami vs Raymond: S-K: broadcast REQ, 0 or N msgs, scales worse. Raymond: tree-routed REQ, O(log N) msgs, root bottleneck.
- Non-token vs Token-based: Non-token: each request asks permission; no token to lose; more msgs per CS. Token: single privilege passes around; vulnerable to token loss.
Keywords
Lamport DMERicart-AgrawalaRoucairol-CarvalhoMaekawaquorumFAILEDINQUIREYIELDSuzuki-KasamiRaymondHolder pointersafetylivenessfairnesssynchronisation delaythroughput
Unit 6 — Distributed Deadlock Detection
Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion
One-liners
- Cycle in WFG ⇒ deadlock (Single, AND).
- **Knot** (SCC, no outgoing) ⇒ deadlock (OR).
- Detection is the dominant strategy in DS.
- CMH probe $(i, j, k)$; $i = k$ on return ⇒ deadlock.
- Mitchell-Merritt probes go OPPOSITE WFG edges.
- Chandy diffusion handles OR via query/reply waves.
Formulas
- CMH msgs: $\le m(s-1)/2$
Definitions
- WFG: $P_i \to P_j$ iff $P_i$ blocks on $P_j$.
- Knot: SCC with no outgoing edges to non-knot.
- Phantom: detected cycle that never existed simultaneously.
- CMH probe = $(i, j, k) = (\text{init, sender, receiver})$.
Algorithms
- CMH: $P_i$ blocked → probe $(i, i, j)$ → forwarded along WFG → if $k=i$ on return, deadlock.
- Mitchell-Merritt: blocked → new label → transmit backward → own label returns ⇒ deadlock.
- Chandy diffusion: query forward + reply wave; all-branch reply ⇒ deadlock.
- Ho-Ramamoorthy: collect twice (2-phase) or use cross-table consistency (1-phase).
Comparisons
- Single/AND model vs OR model: Single/AND: cycle in WFG ⇒ deadlock. OR: only a KNOT (SCC with no outgoing edges) ⇒ deadlock.
- CMH probe direction vs Mitchell-Merritt direction: CMH probes follow WFG edges. Mitchell-Merritt probes go OPPOSITE — from waiter back toward holder.
- Prevention/Avoidance vs Detection: Prevention/Avoidance need global state — expensive in DS. Detection runs on-demand — dominant strategy.
Keywords
WFGcycleknotSingle resourceAND modelOR modelAND-ORphantomCMH probeMitchell-MerrittChandy diffusionHo-Ramamoorthyedge-chasingdiffusion computation
Unit 7 — Consensus & Byzantine Agreement
Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP
One-liners
- Crash: $f + 1$ rounds, $(f+1)n^2$ msgs.
- Byzantine: $n \ge 3f + 1$, $\ge f + 1$ rounds.
- FLP: no deterministic consensus in async even with 1 crash.
- OM(m): $n \ge 3f + 1$, exponential msgs, recursive.
- Phase King: $n \ge 4f + 1$, polynomial msgs, $2(f+1)$ rounds.
- N=3, f=1 Byzantine impossible (indistinguishability).
Formulas
- Crash msgs: $(f+1)n^2$
- Byz bounds: $n \ge 3f+1, R \ge f+1$
- Phase King: $n \ge 4f+1$
- PK decision: mult > $n/2 + f$ ⇒ keep majority; else king
Definitions
- Crash failure: silent halt.
- Byzantine failure: arbitrary including lies & conflicting msgs.
- Consensus: each has input, agree on one.
- Interactive consistency: agree on a vector.
- BA: source broadcasts; others agree on source value.
- FLP: no async deterministic consensus.
Algorithms
- Crash: $f+1$ rounds, each min over received.
- OM(m): recursive, each lieutenant runs OM(m-1), majority at end.
- Phase King: $f+1$ phases × 2 rounds; multiplicity threshold > $n/2 + f$.
Comparisons
- Crash failure vs Byzantine failure: Crash: silent halt; n > f. Byzantine: arbitrary lies; n ≥ 3f+1 + f+1 rounds + async impossible.
- OM(m) vs Phase King: OM: $n ≥ 3f+1$, $f+1$ rounds, exponential msgs, complex recursion. PK: $n ≥ 4f+1$, $2(f+1)$ rounds, polynomial msgs, simple two-round phases.
- Consensus vs Byzantine Agreement: Consensus: each has own input. BA: source broadcasts. BA solves consensus + interactive consistency.
Keywords
consensusByzantinecrash failureFLPOM(m)Lamport-Shostak-PeasePhase Kinginteractive consistencyagreementvalidityterminationindistinguishability
Unit 8 — Distributed Transactions, 2PC & 3PC
ACID + 2PC + 3PC + Blocking & In-Doubt States
One-liners
- ACID = Atomicity + Consistency + Isolation + Durability.
- 2PC = PREPARE + DECIDE; coord drives.
- 2PC blocks when: all <ready> AND coord crashed.
- <commit> redo / <abort> undo / <ready> ask / nothing abort.
- 3PC = PREPARE + PRE-COMMIT (K acks) + COMMIT.
- 3PC unused in practice: no-partition assumption unrealistic.
Formulas
Definitions
- Fail-stop: silent halt; no lies.
- <ready T>: participant voted yes (forced stable).
- In-doubt: <ready T> only; must hold all locks.
- <pre-commit T>: replicated decision intent in 3PC.
Algorithms
- 2PC: coord <prepare> → PREPARE → READY/NO → coord <commit/abort> → COMMIT/ABORT → ack.
- 3PC: 2PC + PRE-COMMIT phase with K acks.
Comparisons
- 2PC vs 3PC: 2PC blocks when coord crashes after all <ready>. 3PC's PRE-COMMIT replicates decision at K+1 sites → non-blocking under ≤K failures; assumes no partition.
- Blocking scenario (2PC) vs Network partition (2PC): Blocking: coord crashed + all <ready>. Partition: cut-off sites act as if coord crashed → may block, but no incorrect outcome.
Keywords
2PC3PCACIDfail-stopPREPAREREADYCOMMITABORTPRE-COMMITin-doubtblockingcoordinatorparticipantstable log
Unit 9 — Raft Consensus
Leader Election + Log Replication + Safety
One-liners
- Raft = Paxos rewritten for understandability.
- Fail-stop only (not Byzantine). Majority must be up.
- Three sub-problems: Election + Replication + Safety.
- Three states: Follower / Candidate / Leader.
- Higher term observed → step down.
- Random election timeouts prevent split votes.
- Election restriction: candidate's log ≥ voter's log.
- No direct commit of prior-term entries.
Formulas
- Majority: $\lceil (N+1)/2 \rceil$
Definitions
- Term: monotonic logical period of one election.
- AppendEntries: leader → follower (entries + prev index/term).
- Commit: replicated to majority in current term.
- Election restriction: up-to-date log requirement for voting.
- Joint consensus: transitional membership-change config.
Algorithms
- Election: timeout → ++term → vote self → RequestVote → majority → Leader.
- Log replication: leader appends → AppendEntries → majority ack → commit → apply.
- AppendEntries consistency: prevIndex+term match → accept; mismatch → back up.
- Safety: prior-term entry committed only via new current-term entry on top.
Comparisons
- Raft vs 2PC: Raft: replicated state machine, leader-based, tolerates leader crash via re-election. 2PC: atomic commit across sites; blocks on coordinator crash.
- Raft vs Paxos: Same correctness, same fault tolerance. Raft simpler decomposition, more understandable; Paxos older and more theoretical.
Keywords
replicated state machinefail-stopleader electiontermAppendEntriesmajority commitelection restrictionFigure 8joint consensusetcdConsulCockroachDB
Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)
GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject
One-liners
- Cut property ⇒ MWOE ∈ MST.
- Rule A: $L < L'$ → absorb.
- Rule B: $L = L'$ + mutual MWOE → merge level+1.
- Rule C: else WAIT.
- Max level ≤ $\log_2 N$.
- Complexity $O(N \log N + E)$ msgs.
- Test reply: reject (same F) / accept (diff F, $L_q \ge L_p$) / defer (diff F, $L_q < L_p$).
Formulas
- Rule B name: $w(\text{core edge})$
- Max level: $\log_2 N$
- Complexity: $O(N \log N + E)$
Definitions
- Fragment: subtree of some MST.
- MWOE: lightest edge with one endpoint outside.
- Core edge: most recent Rule-B merger edge.
- Test reply: reject / accept / defer.
Algorithms
- GHS round: each fragment finds MWOE; apply A/B/C rules; repeat until single fragment.
- Test/Accept/Reject: probe basic edges in weight order; classify based on level/fragment.
Comparisons
- Prim vs GHS: Prim: single growing fragment, centralised. GHS: many parallel fragments (distributed Kruskal), each grows via MWOE.
- Rule A vs Rule B: A: $L < L'$ → smaller absorbs into bigger, level unchanged. B: $L = L'$ + mutual MWOE → merge into new fragment at level $L+1$.
Keywords
GHSGallager-Humblet-SpiraMSTMWOEfragmentlevelcore edgeRule ARule BRule Ccut propertycycle propertytestacceptrejectdeferinitiateconnectreportchangeroot
Unit 11 — Google File System (GFS)
GFS — Architecture, Reads, Writes, Consistency, Recovery
One-liners
- Three components: single master + chunkservers + clients.
- 64 MB chunks, 3× replicas across racks.
- Master metadata in MEMORY; locations NOT logged (heartbeat-rebuilt).
- Lease (60 s) grants primary serialisation; data pipelined + control star.
- Four states: defined / consistent / undefined / inconsistent.
- Atomic record append at-least-once → defined interspersed with inconsistent.
- Snapshots: copy-on-write; chunk copy on first subsequent write.
- Stale replica detection: chunk version number via heartbeat.
- Deleted files retained 3 days hidden before GC.
Formulas
Definitions
- Chunk: 64 MB unit, 64-bit handle, 3× replicas.
- Lease: master → primary for 60 s; primary serialises writes.
- Atomic record append: GFS picks offset; at-least-once.
- Defined: consistent + reflects mutation entirely.
- Pipelined data flow: linear chain along closest replica path.
Algorithms
- Read: client → master (file, idx) → handle+replicas → client cache → nearest replica → data.
- Write: client → master (lease) → push data pipelined → client → primary (write req) → primary serialises + forwards → secondaries apply + ACK → primary → client SUCCESS.
- Snapshot: master duplicates metadata + ++refcount; chunk copy on first subsequent write.
- Stale GC: chunk version mismatch via heartbeat → master commands chunkserver to delete.
Comparisons
- Defined region vs Consistent-but-undefined region: Defined: same bytes everywhere AND reflects a single writer's mutation. Undefined: same bytes everywhere BUT mingled fragments from concurrent writers.
- Data flow vs Control flow: Data flow: pipelined linearly along closest replica chain (bandwidth optimal). Control flow: star, client → primary → secondaries (small msgs).
- GFS vs HDFS: GFS: 64 MB chunks, atomic record append. HDFS: 128 MB chunks, write-once-read-many (early). HDFS NameNode = GFS master; DataNode = chunkserver.
Keywords
GFSGoogle File Systemchunkchunkservermasterleaseatomic record appendpipelined data flowconsistency statedefinedconsistentundefinedinconsistentcopy-on-write snapshotstale replicachunk versionheartbeatgarbage collectionHDFS