Saral Shiksha Yojna
Courses/Distributed Systems

Distributed Systems

CS3.401
Prof. Kishore KothapalliMonsoon 2025-264 credits

Cheatsheet

Ultra-condensed. Revise a chapter in minutes.

Unit 1 — Introduction, Challenges & CAP Theorem

Foundations — Definition, Motivation, Challenges, CAP
One-liners
  • Tanenbaum: 'many computers, appear as one'.
  • Five features: many + concurrent + independent failure + no global clock + no shared memory.
  • CAP: pick 2; P is unavoidable → trade C vs A.
  • CP examples: Spanner, HBase. AP: Dynamo, Cassandra, DNS.
  • Two-Generals: deterministic consensus over unreliable channels is impossible.
Formulas
Definitions
  • Distributed system: appears as one computer to the user.
  • Consistency: all replicas show the same recent value.
  • Availability: every non-failing node responds.
  • Partition tolerance: works through network partitions.
Algorithms
  • Two-Generals: send msg → ack → ack-of-ack → ... never terminates with certainty.
Comparisons
  • Parallel vs Distributed: Parallel: shared memory + single clock (tightly coupled). Distributed: no shared memory + no global clock (loosely coupled, message passing only).
  • CP system vs AP system: CP blocks during partition to keep replicas in sync (Spanner). AP keeps serving stale data and reconciles later (Dynamo).
Keywords
TanenbaumCAPconsistencyavailabilitypartition toleranceTwo-GeneralsFLPhorizontal scalingBrewerPACELC

Unit 2 — Models, Events & Logical Time

Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync
One-liners
  • Lamport $\to$: same-process / send-recv / transitive. Concurrent = neither precedes.
  • Scalar R2: $c_i = \max(c_i, t_m) + d$.
  • Vector R2: componentwise max, then $V[i]$++.
  • Strong consistency: vector ✓, matrix ✓, scalar ✗.
  • Singhal-Kshemkalyani: send changed entries; needs FIFO.
  • Matrix clock GC: $\min_k mt[k, i] \ge t$ → discard.
  • Cristian's: $T_s + \text{RTT}/2$; Berkeley: average; NTP: stratum tree.
Formulas
  • Scalar R1/R2: $c_i := c_i + d;\ c_i := \max(c_i, t_m) + d$
  • Vector R1/R2: $V[i]$++; componentwise max then ++
  • Comparison: $V_h < V_k \iff V_h \le V_k \land \exists i\ V_h[i] < V_k[i]$
  • Matrix GC: $\min_k mt[k, i] \ge t$
Definitions
  • Happened-before: same-process / send-recv / transitive.
  • Strong consistency: $\Leftrightarrow$ (both directions).
  • Vector clock entry: causal-event count for that process.
  • Matrix entry $mt[j, k]$: i's knowledge of j's knowledge of k's clock.
Algorithms
  • Scalar Lamport: R1 then R2 + tie-break (t, i).
  • Vector clock: R1 ++ own; R2 max-then-++.
  • Singhal-Kshemkalyani: send (k, V[k]) for k where LU[k] > LS[j].
  • Cristian's: RTT measure → $T_s + \text{RTT}/2$.
  • Berkeley: poll slaves → average → send deltas.
Comparisons
  • Scalar clock vs Vector clock: Scalar: 1 int, consistent only (forward). Vector: n ints, strongly consistent. Vector detects concurrent events; scalar can't.
  • Vector clock vs Matrix clock: Vector: causal precedence. Matrix: causal + second-order knowledge → enables obsolete-message GC.
  • Cristian's vs Berkeley: Cristian's: single server, UTC, $T_s + \text{RTT}/2$. Berkeley: master polls all, computes average, no UTC, internal LAN agreement.
  • NTP vs Cristian's: NTP: hierarchical strata; Internet-wide; uses 4 timestamps for offset+delay. Cristian's: 1-hop polling; single server; simple but vulnerable.
Keywords
happened-beforeLamport clockvector clockmatrix clockstrong consistencySinghal-KshemkalyaniFIFOCristian'sBerkeleyNTPstratumUTCRTT

Unit 3 — Global Snapshots

Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts
One-liners
  • Consistent cut: no message arrow future → past.
  • C1: send recorded ⇒ msg in channel XOR received.
  • C2: send NOT recorded ⇒ msg NOT in channel AND NOT received.
  • Chandy-Lamport: FIFO, markers, $O(|E|)$ msgs.
  • Lai-Yang: non-FIFO, white/red, heavy history.
  • Acharya-Badrinath: causal, $2N$ msgs, SENT/RECD.
Formulas
  • Acharya-Badrinath channel: $\{RECD_j[i]+1, \ldots, SENT_i[j]\}$
Definitions
  • Snapshot = global state recorded without stopping the system.
  • Marker (CL) = control msg separating pre/post on FIFO channel.
  • White/red (Lai-Yang) = colour-based pre/post on non-FIFO channels.
  • SENT/RECD (Acharya-B) = per-process message counters.
Algorithms
  • Chandy-Lamport: initiator records → marker on outgoing; on first marker: record + propagate; on later marker: stop recording.
  • Lai-Yang: white turns red on snapshot; red msg forces receiver's snapshot; channel state from history.
  • Acharya-Badrinath: initiator broadcasts token; each replies with (LS, SENT, RECD); channel state computed.
Comparisons
  • Chandy-Lamport vs Lai-Yang: CL: FIFO required, marker-based, light storage. Lai-Yang: non-FIFO OK, colour-based, heavy history.
  • Chandy-Lamport vs Acharya-Badrinath: CL: FIFO, $O(|E|)$ msgs. Acharya-B: causal, $2N$ msgs — much cheaper but stronger channel assumption.
Keywords
snapshotconsistent cutC1C2Chandy-LamportmarkerFIFOLai-YangAcharya-Badrinathcausal channel

Unit 4 — Causal Order Message Delivery

BSS Algorithm + Causal vs FIFO vs Total Order
One-liners
  • Causal: $\text{send}(m_1) \to \text{send}(m_2)$ ⇒ $m_1$ delivered before $m_2$.
  • BSS (a): $V_m[j] = V_i[j] + 1$.
  • BSS (b): $\forall k \ne j:\ V_m[k] \le V_i[k]$.
  • FIFO ⊊ Causal ⊊ Total order.
Formulas
  • Cond (a): $V_m[j] = V_i[j] + 1$
  • Cond (b): $\forall k \ne j:\ V_m[k] \le V_i[k]$
Definitions
  • Causal order delivery: cause-then-effect at every receiver.
  • BSS: vector-clock-based causal delivery; buffer if conditions fail.
  • Total order: all msgs in same global sequence at every receiver.
Algorithms
  • BSS receive: check (a) AND (b); deliver or buffer; re-check buffered msgs after each delivery.
Comparisons
  • FIFO vs Causal: FIFO orders only same-sender pairs. Causal orders all pairs related by happened-before.
  • Causal vs Total: Causal: orders causally-related pairs. Total: orders ALL pairs (including concurrent). Total requires consensus.
Keywords
causal orderBSSBirman-Schiper-Stephensonvector clockFIFOtotal orderstate-machine replication

Unit 5 — Distributed Mutual Exclusion

Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison
One-liners
  • Lamport: 3(N-1) msgs, FIFO, L1+L2 entry.
  • Ricart-Agrawala: 2(N-1) msgs, no FIFO, deferred REPLY.
  • Maekawa: $K = D = \sqrt{N}$. V1 deadlocks; V2 adds FAILED/INQUIRE/YIELD; SD = 2T.
  • Suzuki-Kasami: 0 or N msgs; freshness $RN[i] = LN[i]+1$.
  • Raymond: O(log N), Holder pointer, root bottleneck.
  • Lamport needs FIFO; R-A doesn't.
Formulas
  • Lamport: $3(N-1)$
  • R-A: $2(N-1)$
  • Maekawa: $3\sqrt{N}$ → $5\sqrt{N}$
  • S-K: $0$ or $N$
  • Raymond: $O(\log N)$
Definitions
  • Safety = ≤1 in CS; Liveness = eventual entry; Fairness = timestamp order.
  • SD = time from one exit to next entry.
  • Throughput = $1/(\text{SD}+E)$.
  • Quorum (Maekawa) = set $R_i$ with pairwise intersection.
  • Token freshness (S-K) = $RN[i] = LN[i] + 1$.
Algorithms
  • Lamport entry: L1 (later msg from every site) ∧ L2 (own at top).
  • R-A defer: in CS, OR requesting with smaller ts.
  • Maekawa V2: FAILED (already higher replied), INQUIRE (higher arrived; ask), YIELD (relinquish).
  • S-K: broadcast REQ; token to $P_i$ iff fresh.
  • Raymond: Holder up, token down, requests aggregate.
Comparisons
  • Lamport vs Ricart-Agrawala: Lamport: 3(N-1) msgs, FIFO. R-A: 2(N-1) msgs, no FIFO, deferred REPLY replaces RELEASE.
  • Maekawa V1 vs Maekawa V2: V1: 3√N msgs, deadlocks. V2: up to 5√N, three new messages (FAILED, INQUIRE, YIELD) break deadlock.
  • Suzuki-Kasami vs Raymond: S-K: broadcast REQ, 0 or N msgs, scales worse. Raymond: tree-routed REQ, O(log N) msgs, root bottleneck.
  • Non-token vs Token-based: Non-token: each request asks permission; no token to lose; more msgs per CS. Token: single privilege passes around; vulnerable to token loss.
Keywords
Lamport DMERicart-AgrawalaRoucairol-CarvalhoMaekawaquorumFAILEDINQUIREYIELDSuzuki-KasamiRaymondHolder pointersafetylivenessfairnesssynchronisation delaythroughput

Unit 6 — Distributed Deadlock Detection

Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion
One-liners
  • Cycle in WFG ⇒ deadlock (Single, AND).
  • **Knot** (SCC, no outgoing) ⇒ deadlock (OR).
  • Detection is the dominant strategy in DS.
  • CMH probe $(i, j, k)$; $i = k$ on return ⇒ deadlock.
  • Mitchell-Merritt probes go OPPOSITE WFG edges.
  • Chandy diffusion handles OR via query/reply waves.
Formulas
  • CMH msgs: $\le m(s-1)/2$
Definitions
  • WFG: $P_i \to P_j$ iff $P_i$ blocks on $P_j$.
  • Knot: SCC with no outgoing edges to non-knot.
  • Phantom: detected cycle that never existed simultaneously.
  • CMH probe = $(i, j, k) = (\text{init, sender, receiver})$.
Algorithms
  • CMH: $P_i$ blocked → probe $(i, i, j)$ → forwarded along WFG → if $k=i$ on return, deadlock.
  • Mitchell-Merritt: blocked → new label → transmit backward → own label returns ⇒ deadlock.
  • Chandy diffusion: query forward + reply wave; all-branch reply ⇒ deadlock.
  • Ho-Ramamoorthy: collect twice (2-phase) or use cross-table consistency (1-phase).
Comparisons
  • Single/AND model vs OR model: Single/AND: cycle in WFG ⇒ deadlock. OR: only a KNOT (SCC with no outgoing edges) ⇒ deadlock.
  • CMH probe direction vs Mitchell-Merritt direction: CMH probes follow WFG edges. Mitchell-Merritt probes go OPPOSITE — from waiter back toward holder.
  • Prevention/Avoidance vs Detection: Prevention/Avoidance need global state — expensive in DS. Detection runs on-demand — dominant strategy.
Keywords
WFGcycleknotSingle resourceAND modelOR modelAND-ORphantomCMH probeMitchell-MerrittChandy diffusionHo-Ramamoorthyedge-chasingdiffusion computation

Unit 7 — Consensus & Byzantine Agreement

Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP
One-liners
  • Crash: $f + 1$ rounds, $(f+1)n^2$ msgs.
  • Byzantine: $n \ge 3f + 1$, $\ge f + 1$ rounds.
  • FLP: no deterministic consensus in async even with 1 crash.
  • OM(m): $n \ge 3f + 1$, exponential msgs, recursive.
  • Phase King: $n \ge 4f + 1$, polynomial msgs, $2(f+1)$ rounds.
  • N=3, f=1 Byzantine impossible (indistinguishability).
Formulas
  • Crash msgs: $(f+1)n^2$
  • Byz bounds: $n \ge 3f+1, R \ge f+1$
  • Phase King: $n \ge 4f+1$
  • PK decision: mult > $n/2 + f$ ⇒ keep majority; else king
Definitions
  • Crash failure: silent halt.
  • Byzantine failure: arbitrary including lies & conflicting msgs.
  • Consensus: each has input, agree on one.
  • Interactive consistency: agree on a vector.
  • BA: source broadcasts; others agree on source value.
  • FLP: no async deterministic consensus.
Algorithms
  • Crash: $f+1$ rounds, each min over received.
  • OM(m): recursive, each lieutenant runs OM(m-1), majority at end.
  • Phase King: $f+1$ phases × 2 rounds; multiplicity threshold > $n/2 + f$.
Comparisons
  • Crash failure vs Byzantine failure: Crash: silent halt; n > f. Byzantine: arbitrary lies; n ≥ 3f+1 + f+1 rounds + async impossible.
  • OM(m) vs Phase King: OM: $n ≥ 3f+1$, $f+1$ rounds, exponential msgs, complex recursion. PK: $n ≥ 4f+1$, $2(f+1)$ rounds, polynomial msgs, simple two-round phases.
  • Consensus vs Byzantine Agreement: Consensus: each has own input. BA: source broadcasts. BA solves consensus + interactive consistency.
Keywords
consensusByzantinecrash failureFLPOM(m)Lamport-Shostak-PeasePhase Kinginteractive consistencyagreementvalidityterminationindistinguishability

Unit 8 — Distributed Transactions, 2PC & 3PC

ACID + 2PC + 3PC + Blocking & In-Doubt States
One-liners
  • ACID = Atomicity + Consistency + Isolation + Durability.
  • 2PC = PREPARE + DECIDE; coord drives.
  • 2PC blocks when: all <ready> AND coord crashed.
  • <commit> redo / <abort> undo / <ready> ask / nothing abort.
  • 3PC = PREPARE + PRE-COMMIT (K acks) + COMMIT.
  • 3PC unused in practice: no-partition assumption unrealistic.
Formulas
Definitions
  • Fail-stop: silent halt; no lies.
  • <ready T>: participant voted yes (forced stable).
  • In-doubt: <ready T> only; must hold all locks.
  • <pre-commit T>: replicated decision intent in 3PC.
Algorithms
  • 2PC: coord <prepare> → PREPARE → READY/NO → coord <commit/abort> → COMMIT/ABORT → ack.
  • 3PC: 2PC + PRE-COMMIT phase with K acks.
Comparisons
  • 2PC vs 3PC: 2PC blocks when coord crashes after all <ready>. 3PC's PRE-COMMIT replicates decision at K+1 sites → non-blocking under ≤K failures; assumes no partition.
  • Blocking scenario (2PC) vs Network partition (2PC): Blocking: coord crashed + all <ready>. Partition: cut-off sites act as if coord crashed → may block, but no incorrect outcome.
Keywords
2PC3PCACIDfail-stopPREPAREREADYCOMMITABORTPRE-COMMITin-doubtblockingcoordinatorparticipantstable log

Unit 9 — Raft Consensus

Leader Election + Log Replication + Safety
One-liners
  • Raft = Paxos rewritten for understandability.
  • Fail-stop only (not Byzantine). Majority must be up.
  • Three sub-problems: Election + Replication + Safety.
  • Three states: Follower / Candidate / Leader.
  • Higher term observed → step down.
  • Random election timeouts prevent split votes.
  • Election restriction: candidate's log ≥ voter's log.
  • No direct commit of prior-term entries.
Formulas
  • Majority: $\lceil (N+1)/2 \rceil$
Definitions
  • Term: monotonic logical period of one election.
  • AppendEntries: leader → follower (entries + prev index/term).
  • Commit: replicated to majority in current term.
  • Election restriction: up-to-date log requirement for voting.
  • Joint consensus: transitional membership-change config.
Algorithms
  • Election: timeout → ++term → vote self → RequestVote → majority → Leader.
  • Log replication: leader appends → AppendEntries → majority ack → commit → apply.
  • AppendEntries consistency: prevIndex+term match → accept; mismatch → back up.
  • Safety: prior-term entry committed only via new current-term entry on top.
Comparisons
  • Raft vs 2PC: Raft: replicated state machine, leader-based, tolerates leader crash via re-election. 2PC: atomic commit across sites; blocks on coordinator crash.
  • Raft vs Paxos: Same correctness, same fault tolerance. Raft simpler decomposition, more understandable; Paxos older and more theoretical.
Keywords
replicated state machinefail-stopleader electiontermAppendEntriesmajority commitelection restrictionFigure 8joint consensusetcdConsulCockroachDB

Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)

GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject
One-liners
  • Cut property ⇒ MWOE ∈ MST.
  • Rule A: $L < L'$ → absorb.
  • Rule B: $L = L'$ + mutual MWOE → merge level+1.
  • Rule C: else WAIT.
  • Max level ≤ $\log_2 N$.
  • Complexity $O(N \log N + E)$ msgs.
  • Test reply: reject (same F) / accept (diff F, $L_q \ge L_p$) / defer (diff F, $L_q < L_p$).
Formulas
  • Rule B name: $w(\text{core edge})$
  • Max level: $\log_2 N$
  • Complexity: $O(N \log N + E)$
Definitions
  • Fragment: subtree of some MST.
  • MWOE: lightest edge with one endpoint outside.
  • Core edge: most recent Rule-B merger edge.
  • Test reply: reject / accept / defer.
Algorithms
  • GHS round: each fragment finds MWOE; apply A/B/C rules; repeat until single fragment.
  • Test/Accept/Reject: probe basic edges in weight order; classify based on level/fragment.
Comparisons
  • Prim vs GHS: Prim: single growing fragment, centralised. GHS: many parallel fragments (distributed Kruskal), each grows via MWOE.
  • Rule A vs Rule B: A: $L < L'$ → smaller absorbs into bigger, level unchanged. B: $L = L'$ + mutual MWOE → merge into new fragment at level $L+1$.
Keywords
GHSGallager-Humblet-SpiraMSTMWOEfragmentlevelcore edgeRule ARule BRule Ccut propertycycle propertytestacceptrejectdeferinitiateconnectreportchangeroot

Unit 11 — Google File System (GFS)

GFS — Architecture, Reads, Writes, Consistency, Recovery
One-liners
  • Three components: single master + chunkservers + clients.
  • 64 MB chunks, 3× replicas across racks.
  • Master metadata in MEMORY; locations NOT logged (heartbeat-rebuilt).
  • Lease (60 s) grants primary serialisation; data pipelined + control star.
  • Four states: defined / consistent / undefined / inconsistent.
  • Atomic record append at-least-once → defined interspersed with inconsistent.
  • Snapshots: copy-on-write; chunk copy on first subsequent write.
  • Stale replica detection: chunk version number via heartbeat.
  • Deleted files retained 3 days hidden before GC.
Formulas
Definitions
  • Chunk: 64 MB unit, 64-bit handle, 3× replicas.
  • Lease: master → primary for 60 s; primary serialises writes.
  • Atomic record append: GFS picks offset; at-least-once.
  • Defined: consistent + reflects mutation entirely.
  • Pipelined data flow: linear chain along closest replica path.
Algorithms
  • Read: client → master (file, idx) → handle+replicas → client cache → nearest replica → data.
  • Write: client → master (lease) → push data pipelined → client → primary (write req) → primary serialises + forwards → secondaries apply + ACK → primary → client SUCCESS.
  • Snapshot: master duplicates metadata + ++refcount; chunk copy on first subsequent write.
  • Stale GC: chunk version mismatch via heartbeat → master commands chunkserver to delete.
Comparisons
  • Defined region vs Consistent-but-undefined region: Defined: same bytes everywhere AND reflects a single writer's mutation. Undefined: same bytes everywhere BUT mingled fragments from concurrent writers.
  • Data flow vs Control flow: Data flow: pipelined linearly along closest replica chain (bandwidth optimal). Control flow: star, client → primary → secondaries (small msgs).
  • GFS vs HDFS: GFS: 64 MB chunks, atomic record append. HDFS: 128 MB chunks, write-once-read-many (early). HDFS NameNode = GFS master; DataNode = chunkserver.
Keywords
GFSGoogle File Systemchunkchunkservermasterleaseatomic record appendpipelined data flowconsistency statedefinedconsistentundefinedinconsistentcopy-on-write snapshotstale replicachunk versionheartbeatgarbage collectionHDFS