Courses/Distributed Systems

Distributed Systems

CS3.401

Prof. Kishore Kothapalli•Monsoon 2025-26•4 credits

Cheatsheet

Ultra-condensed. Revise a chapter in minutes.

Unit 1 — Introduction, Challenges & CAP Theorem

Foundations — Definition, Motivation, Challenges, CAP

One-liners

Tanenbaum: 'many computers, appear as one'.
Five features: many + concurrent + independent failure + no global clock + no shared memory.
CAP: pick 2; P is unavoidable → trade C vs A.
CP examples: Spanner, HBase. AP: Dynamo, Cassandra, DNS.
Two-Generals: deterministic consensus over unreliable channels is impossible.

Formulas

$C A P t r ian g l e : C, A, P$
$R e a l sy s t e m s : p i c k C P or A P (P f or ce d)$

Definitions

Distributed system: appears as one computer to the user.
Consistency: all replicas show the same recent value.
Availability: every non-failing node responds.
Partition tolerance: works through network partitions.

Algorithms

Two-Generals: send msg → ack → ack-of-ack → ... never terminates with certainty.

Comparisons

Parallel vs Distributed: Parallel: shared memory + single clock (tightly coupled). Distributed: no shared memory + no global clock (loosely coupled, message passing only).
CP system vs AP system: CP blocks during partition to keep replicas in sync (Spanner). AP keeps serving stale data and reconciles later (Dynamo).

Keywords

TanenbaumCAPconsistencyavailabilitypartition toleranceTwo-GeneralsFLPhorizontal scalingBrewerPACELC

Unit 2 — Models, Events & Logical Time

Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync

One-liners

Lamport $\to$: same-process / send-recv / transitive. Concurrent = neither precedes.
Scalar R2: $c_i = \max(c_i, t_m) + d$.
Vector R2: componentwise max, then $V[i]$++.
Strong consistency: vector ✓, matrix ✓, scalar ✗.
Singhal-Kshemkalyani: send changed entries; needs FIFO.
Matrix clock GC: $\min_k mt[k, i] \ge t$ → discard.
Cristian's: $T_s + \text{RTT}/2$; Berkeley: average; NTP: stratum tree.

Formulas

$Scalar R1/R2: $c_i := c_i + d;\ c_i := \max(c_i, t_m) + d$$
$Vector R1/R2: $V[i]$++; componentwise max then ++$
$Comparison: $V_h < V_k \iff V_h \le V_k \land \exists i\ V_h[i] < V_k[i]$$
$Matrix GC: $\min_k mt[k, i] \ge t$$

Definitions

Happened-before: same-process / send-recv / transitive.
Strong consistency: $\Leftrightarrow$ (both directions).
Vector clock entry: causal-event count for that process.
Matrix entry $mt[j, k]$: i's knowledge of j's knowledge of k's clock.

Algorithms

Scalar Lamport: R1 then R2 + tie-break (t, i).
Vector clock: R1 ++ own; R2 max-then-++.
Singhal-Kshemkalyani: send (k, V[k]) for k where LU[k] > LS[j].
Cristian's: RTT measure → $T_s + \text{RTT}/2$.
Berkeley: poll slaves → average → send deltas.

Comparisons

Scalar clock vs Vector clock: Scalar: 1 int, consistent only (forward). Vector: n ints, strongly consistent. Vector detects concurrent events; scalar can't.
Vector clock vs Matrix clock: Vector: causal precedence. Matrix: causal + second-order knowledge → enables obsolete-message GC.
Cristian's vs Berkeley: Cristian's: single server, UTC, $T_s + \text{RTT}/2$. Berkeley: master polls all, computes average, no UTC, internal LAN agreement.
NTP vs Cristian's: NTP: hierarchical strata; Internet-wide; uses 4 timestamps for offset+delay. Cristian's: 1-hop polling; single server; simple but vulnerable.

Keywords

happened-beforeLamport clockvector clockmatrix clockstrong consistencySinghal-KshemkalyaniFIFOCristian'sBerkeleyNTPstratumUTCRTT

Unit 3 — Global Snapshots

Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts

One-liners

Consistent cut: no message arrow future → past.
C1: send recorded ⇒ msg in channel XOR received.
C2: send NOT recorded ⇒ msg NOT in channel AND NOT received.
Chandy-Lamport: FIFO, markers, $O(|E|)$ msgs.
Lai-Yang: non-FIFO, white/red, heavy history.
Acharya-Badrinath: causal, $2N$ msgs, SENT/RECD.

Formulas

$C 1 : se n d r ecor d e d \Rightarrow in - c hann e l \oplus r ece i v e d$
$C 2 : se n d N O T r ecor d e d \Rightarrow N O T in c hann e l A N D N O T r ece i v e d$
$Acharya-Badrinath channel: $\{RECD_j[i]+1, \ldots, SENT_i[j]\}$$

Definitions

Snapshot = global state recorded without stopping the system.
Marker (CL) = control msg separating pre/post on FIFO channel.
White/red (Lai-Yang) = colour-based pre/post on non-FIFO channels.
SENT/RECD (Acharya-B) = per-process message counters.

Algorithms

Chandy-Lamport: initiator records → marker on outgoing; on first marker: record + propagate; on later marker: stop recording.
Lai-Yang: white turns red on snapshot; red msg forces receiver's snapshot; channel state from history.
Acharya-Badrinath: initiator broadcasts token; each replies with (LS, SENT, RECD); channel state computed.

Comparisons

Chandy-Lamport vs Lai-Yang: CL: FIFO required, marker-based, light storage. Lai-Yang: non-FIFO OK, colour-based, heavy history.
Chandy-Lamport vs Acharya-Badrinath: CL: FIFO, $O(|E|)$ msgs. Acharya-B: causal, $2N$ msgs — much cheaper but stronger channel assumption.

Keywords

snapshotconsistent cutC1C2Chandy-LamportmarkerFIFOLai-YangAcharya-Badrinathcausal channel

Unit 4 — Causal Order Message Delivery

BSS Algorithm + Causal vs FIFO vs Total Order

One-liners

Causal: $\text{send}(m_1) \to \text{send}(m_2)$ ⇒ $m_1$ delivered before $m_2$.
BSS (a): $V_m[j] = V_i[j] + 1$.
BSS (b): $\forall k \ne j:\ V_m[k] \le V_i[k]$.
FIFO ⊊ Causal ⊊ Total order.

Formulas

$Cond (a): $V_m[j] = V_i[j] + 1$$
$Cond (b): $\forall k \ne j:\ V_m[k] \le V_i[k]$$

Definitions

Causal order delivery: cause-then-effect at every receiver.
BSS: vector-clock-based causal delivery; buffer if conditions fail.
Total order: all msgs in same global sequence at every receiver.

Algorithms

BSS receive: check (a) AND (b); deliver or buffer; re-check buffered msgs after each delivery.

Comparisons

FIFO vs Causal: FIFO orders only same-sender pairs. Causal orders all pairs related by happened-before.
Causal vs Total: Causal: orders causally-related pairs. Total: orders ALL pairs (including concurrent). Total requires consensus.

Keywords

causal orderBSSBirman-Schiper-Stephensonvector clockFIFOtotal orderstate-machine replication

Unit 5 — Distributed Mutual Exclusion

Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison

One-liners

Lamport: 3(N-1) msgs, FIFO, L1+L2 entry.
Ricart-Agrawala: 2(N-1) msgs, no FIFO, deferred REPLY.
Maekawa: $K = D = \sqrt{N}$. V1 deadlocks; V2 adds FAILED/INQUIRE/YIELD; SD = 2T.
Suzuki-Kasami: 0 or N msgs; freshness $RN[i] = LN[i]+1$.
Raymond: O(log N), Holder pointer, root bottleneck.
Lamport needs FIFO; R-A doesn't.

Formulas

$Lamport: $3(N-1)$$
$R-A: $2(N-1)$$
$Maekawa: $3\sqrt{N}$ → $5\sqrt{N}$$
$S-K: $0$ or $N$$
$Raymond: $O(\log N)$$

Definitions

Safety = ≤1 in CS; Liveness = eventual entry; Fairness = timestamp order.
SD = time from one exit to next entry.
Throughput = $1/(\text{SD}+E)$.
Quorum (Maekawa) = set $R_i$ with pairwise intersection.
Token freshness (S-K) = $RN[i] = LN[i] + 1$.

Algorithms

Lamport entry: L1 (later msg from every site) ∧ L2 (own at top).
R-A defer: in CS, OR requesting with smaller ts.
Maekawa V2: FAILED (already higher replied), INQUIRE (higher arrived; ask), YIELD (relinquish).
S-K: broadcast REQ; token to $P_i$ iff fresh.
Raymond: Holder up, token down, requests aggregate.

Comparisons

Lamport vs Ricart-Agrawala: Lamport: 3(N-1) msgs, FIFO. R-A: 2(N-1) msgs, no FIFO, deferred REPLY replaces RELEASE.
Maekawa V1 vs Maekawa V2: V1: 3√N msgs, deadlocks. V2: up to 5√N, three new messages (FAILED, INQUIRE, YIELD) break deadlock.
Suzuki-Kasami vs Raymond: S-K: broadcast REQ, 0 or N msgs, scales worse. Raymond: tree-routed REQ, O(log N) msgs, root bottleneck.
Non-token vs Token-based: Non-token: each request asks permission; no token to lose; more msgs per CS. Token: single privilege passes around; vulnerable to token loss.

Keywords

Lamport DMERicart-AgrawalaRoucairol-CarvalhoMaekawaquorumFAILEDINQUIREYIELDSuzuki-KasamiRaymondHolder pointersafetylivenessfairnesssynchronisation delaythroughput

Unit 6 — Distributed Deadlock Detection

Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion

One-liners

Cycle in WFG ⇒ deadlock (Single, AND).
**Knot** (SCC, no outgoing) ⇒ deadlock (OR).
Detection is the dominant strategy in DS.
CMH probe $(i, j, k)$; $i = k$ on return ⇒ deadlock.
Mitchell-Merritt probes go OPPOSITE WFG edges.
Chandy diffusion handles OR via query/reply waves.

Formulas

$S in g l e / A N D : cy c l e \Rightarrow d e a d l oc k$
$O R : k n o t \Rightarrow d e a d l oc k$
$CMH msgs: $\le m(s-1)/2$$

Definitions

WFG: $P_i \to P_j$ iff $P_i$ blocks on $P_j$.
Knot: SCC with no outgoing edges to non-knot.
Phantom: detected cycle that never existed simultaneously.
CMH probe = $(i, j, k) = (\text{init, sender, receiver})$.

Algorithms

CMH: $P_i$ blocked → probe $(i, i, j)$ → forwarded along WFG → if $k=i$ on return, deadlock.
Mitchell-Merritt: blocked → new label → transmit backward → own label returns ⇒ deadlock.
Chandy diffusion: query forward + reply wave; all-branch reply ⇒ deadlock.
Ho-Ramamoorthy: collect twice (2-phase) or use cross-table consistency (1-phase).

Comparisons

Single/AND model vs OR model: Single/AND: cycle in WFG ⇒ deadlock. OR: only a KNOT (SCC with no outgoing edges) ⇒ deadlock.
CMH probe direction vs Mitchell-Merritt direction: CMH probes follow WFG edges. Mitchell-Merritt probes go OPPOSITE — from waiter back toward holder.
Prevention/Avoidance vs Detection: Prevention/Avoidance need global state — expensive in DS. Detection runs on-demand — dominant strategy.

Keywords

WFGcycleknotSingle resourceAND modelOR modelAND-ORphantomCMH probeMitchell-MerrittChandy diffusionHo-Ramamoorthyedge-chasingdiffusion computation

Unit 7 — Consensus & Byzantine Agreement

Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP

One-liners

Crash: $f + 1$ rounds, $(f+1)n^2$ msgs.
Byzantine: $n \ge 3f + 1$, $\ge f + 1$ rounds.
FLP: no deterministic consensus in async even with 1 crash.
OM(m): $n \ge 3f + 1$, exponential msgs, recursive.
Phase King: $n \ge 4f + 1$, polynomial msgs, $2(f+1)$ rounds.
N=3, f=1 Byzantine impossible (indistinguishability).

Formulas

$Crash msgs: $(f+1)n^2$$
$Byz bounds: $n \ge 3f+1, R \ge f+1$$
$Phase King: $n \ge 4f+1$$
$PK decision: mult > $n/2 + f$ \Rightarrow keep majority; else king$

Definitions

Crash failure: silent halt.
Byzantine failure: arbitrary including lies & conflicting msgs.
Consensus: each has input, agree on one.
Interactive consistency: agree on a vector.
BA: source broadcasts; others agree on source value.
FLP: no async deterministic consensus.

Algorithms

Crash: $f+1$ rounds, each min over received.
OM(m): recursive, each lieutenant runs OM(m-1), majority at end.
Phase King: $f+1$ phases × 2 rounds; multiplicity threshold > $n/2 + f$.

Comparisons

Crash failure vs Byzantine failure: Crash: silent halt; n > f. Byzantine: arbitrary lies; n ≥ 3f+1 + f+1 rounds + async impossible.
OM(m) vs Phase King: OM: $n ≥ 3f+1$, $f+1$ rounds, exponential msgs, complex recursion. PK: $n ≥ 4f+1$, $2(f+1)$ rounds, polynomial msgs, simple two-round phases.
Consensus vs Byzantine Agreement: Consensus: each has own input. BA: source broadcasts. BA solves consensus + interactive consistency.

Keywords

consensusByzantinecrash failureFLPOM(m)Lamport-Shostak-PeasePhase Kinginteractive consistencyagreementvalidityterminationindistinguishability

Unit 8 — Distributed Transactions, 2PC & 3PC

ACID + 2PC + 3PC + Blocking & In-Doubt States

One-liners

ACID = Atomicity + Consistency + Isolation + Durability.
2PC = PREPARE + DECIDE; coord drives.
2PC blocks when: all <ready> AND coord crashed.
<commit> redo / <abort> undo / <ready> ask / nothing abort.
3PC = PREPARE + PRE-COMMIT (K acks) + COMMIT.
3PC unused in practice: no-partition assumption unrealistic.

Formulas

$P o in t o f n or e t u r n : coor d^{'} s < co mmi tT > o n s t ab l e l o g$
$B l oc k in g : a l l < r e a d y > + coor d cr a s h e d$
$3 P C r eco v er y : an y < p r e - co mmi t >\Rightarrow co mmi t, e l se ab or t$

Definitions

Fail-stop: silent halt; no lies.
<ready T>: participant voted yes (forced stable).
In-doubt: <ready T> only; must hold all locks.
<pre-commit T>: replicated decision intent in 3PC.

Algorithms

2PC: coord <prepare> → PREPARE → READY/NO → coord <commit/abort> → COMMIT/ABORT → ack.
3PC: 2PC + PRE-COMMIT phase with K acks.

Comparisons

2PC vs 3PC: 2PC blocks when coord crashes after all <ready>. 3PC's PRE-COMMIT replicates decision at K+1 sites → non-blocking under ≤K failures; assumes no partition.
Blocking scenario (2PC) vs Network partition (2PC): Blocking: coord crashed + all <ready>. Partition: cut-off sites act as if coord crashed → may block, but no incorrect outcome.

Keywords

2PC3PCACIDfail-stopPREPAREREADYCOMMITABORTPRE-COMMITin-doubtblockingcoordinatorparticipantstable log

Unit 9 — Raft Consensus

Leader Election + Log Replication + Safety

One-liners

Raft = Paxos rewritten for understandability.
Fail-stop only (not Byzantine). Majority must be up.
Three sub-problems: Election + Replication + Safety.
Three states: Follower / Candidate / Leader.
Higher term observed → step down.
Random election timeouts prevent split votes.
Election restriction: candidate's log ≥ voter's log.
No direct commit of prior-term entries.

Formulas

$Majority: $\lceil (N+1)/2 \rceil$$
$E l ec t i o n : r an d o m t im eo u t \to C an d i d a t e \to maj or i t y v o t e \to L e a d er$
$E l ec t i o n r es t r i c t i o n : l a s t - t er m w in s; t i e \to l o n g er l o g$

Definitions

Term: monotonic logical period of one election.
AppendEntries: leader → follower (entries + prev index/term).
Commit: replicated to majority in current term.
Election restriction: up-to-date log requirement for voting.
Joint consensus: transitional membership-change config.

Algorithms

Election: timeout → ++term → vote self → RequestVote → majority → Leader.
Log replication: leader appends → AppendEntries → majority ack → commit → apply.
AppendEntries consistency: prevIndex+term match → accept; mismatch → back up.
Safety: prior-term entry committed only via new current-term entry on top.

Comparisons

Raft vs 2PC: Raft: replicated state machine, leader-based, tolerates leader crash via re-election. 2PC: atomic commit across sites; blocks on coordinator crash.
Raft vs Paxos: Same correctness, same fault tolerance. Raft simpler decomposition, more understandable; Paxos older and more theoretical.

Keywords

replicated state machinefail-stopleader electiontermAppendEntriesmajority commitelection restrictionFigure 8joint consensusetcdConsulCockroachDB

Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)

GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject

One-liners

Cut property ⇒ MWOE ∈ MST.
Rule A: $L < L'$ → absorb.
Rule B: $L = L'$ + mutual MWOE → merge level+1.
Rule C: else WAIT.
Max level ≤ $\log_2 N$.
Complexity $O(N \log N + E)$ msgs.
Test reply: reject (same F) / accept (diff F, $L_q \ge L_p$) / defer (diff F, $L_q < L_p$).

Formulas

$C u tp r o p er t y : l i g h t es t cr oss in g e d g e \in M S T$
$Rule B name: $w(\text{core edge})$$
$Max level: $\log_2 N$$
$Complexity: $O(N \log N + E)$$

Definitions

Fragment: subtree of some MST.
MWOE: lightest edge with one endpoint outside.
Core edge: most recent Rule-B merger edge.
Test reply: reject / accept / defer.

Algorithms

GHS round: each fragment finds MWOE; apply A/B/C rules; repeat until single fragment.
Test/Accept/Reject: probe basic edges in weight order; classify based on level/fragment.

Comparisons

Prim vs GHS: Prim: single growing fragment, centralised. GHS: many parallel fragments (distributed Kruskal), each grows via MWOE.
Rule A vs Rule B: A: $L < L'$ → smaller absorbs into bigger, level unchanged. B: $L = L'$ + mutual MWOE → merge into new fragment at level $L+1$.

Keywords

GHSGallager-Humblet-SpiraMSTMWOEfragmentlevelcore edgeRule ARule BRule Ccut propertycycle propertytestacceptrejectdeferinitiateconnectreportchangeroot

Unit 11 — Google File System (GFS)

GFS — Architecture, Reads, Writes, Consistency, Recovery

One-liners

Three components: single master + chunkservers + clients.
64 MB chunks, 3× replicas across racks.
Master metadata in MEMORY; locations NOT logged (heartbeat-rebuilt).
Lease (60 s) grants primary serialisation; data pipelined + control star.
Four states: defined / consistent / undefined / inconsistent.
Atomic record append at-least-once → defined interspersed with inconsistent.
Snapshots: copy-on-write; chunk copy on first subsequent write.
Stale replica detection: chunk version number via heartbeat.
Deleted files retained 3 days hidden before GC.

Formulas

$C h u nk s i z e : 64 M B$
$R e pl i c a t i o n : 3 \times (co n f i g u r ab l e p er f i l e)$
$L e a se : 60 s, r e n e w ab l e$
$H i dd e n r e t e n t i o n : 3 d a y s$
$C h ec k s u m : p er 64 K B b l oc k$

Definitions

Chunk: 64 MB unit, 64-bit handle, 3× replicas.
Lease: master → primary for 60 s; primary serialises writes.
Atomic record append: GFS picks offset; at-least-once.
Defined: consistent + reflects mutation entirely.
Pipelined data flow: linear chain along closest replica path.

Algorithms

Read: client → master (file, idx) → handle+replicas → client cache → nearest replica → data.
Write: client → master (lease) → push data pipelined → client → primary (write req) → primary serialises + forwards → secondaries apply + ACK → primary → client SUCCESS.
Snapshot: master duplicates metadata + ++refcount; chunk copy on first subsequent write.
Stale GC: chunk version mismatch via heartbeat → master commands chunkserver to delete.

Comparisons

Defined region vs Consistent-but-undefined region: Defined: same bytes everywhere AND reflects a single writer's mutation. Undefined: same bytes everywhere BUT mingled fragments from concurrent writers.
Data flow vs Control flow: Data flow: pipelined linearly along closest replica chain (bandwidth optimal). Control flow: star, client → primary → secondaries (small msgs).
GFS vs HDFS: GFS: 64 MB chunks, atomic record append. HDFS: 128 MB chunks, write-once-read-many (early). HDFS NameNode = GFS master; DataNode = chunkserver.

Keywords

GFSGoogle File Systemchunkchunkservermasterleaseatomic record appendpipelined data flowconsistency statedefinedconsistentundefinedinconsistentcopy-on-write snapshotstale replicachunk versionheartbeatgarbage collectionHDFS