Courses/Distributed Systems

Distributed Systems

CS3.401

Prof. Kishore Kothapalli•Monsoon 2025-26•4 credits

Definitions

Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).

94 terms · 0 have plain-English versions

Unit 1 — Introduction, Challenges & CAP Theorem

Foundations — Definition, Motivation, Challenges, CAP

Distributed system (Tanenbaum): A group of computers working together to appear as a single computer to the end user.
Consistency (CAP): Every node returns the same most-recent successful write — all nodes have the same view of data at every instant.
Availability (CAP): Every non-failing node responds to every request in reasonable time, with a non-error response.
Partition tolerance: System continues to operate despite arbitrary message loss or network partitions between groups of nodes.
Two-Generals problem: Classical impossibility result showing two parties cannot reach deterministic agreement over an unreliable channel — every ack needs its own ack ad infinitum.
Independent failure: One node crashing does not directly cause others to crash; the system as a whole keeps operating. Foundation of fault tolerance.

Unit 2 — Models, Events & Logical Time

Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync

Happened-before $(\to)$: Lamport's partial order on events: (a) same-process order, (b) send → recv, (c) transitive. Events not related by → are concurrent.
Logical clock consistency: $e \to e^{'} \Rightarrow C (e) < C (e^{'})$ . Required minimum; satisfied by scalar, vector, matrix.
Strong consistency of logical clocks: $e \to e^{'} \Leftrightarrow C (e) < C (e^{'})$ . Both directions. Vector and matrix satisfy this; scalar does not.
Scalar (Lamport) clock: Single integer per process. R1: increment before event. R2: max(local, msg ts) + d on receive. Total-order tie-break: (t, i).
Vector clock: n-dim vector per process. R1: V[i]++. R2: componentwise max, then R1. Compares with V_h < V_k iff componentwise ≤ and ∃ strict.
Matrix clock: n × n matrix per process. mt[i, k] = own knowledge of k's clock; mt[j, k] = i's knowledge of j's knowledge of k's clock. Enables obsolete-message GC.
Singhal-Kshemkalyani optimisation: Vector-clock optimisation: send only entries changed since last send to that recipient. Uses LS[j] (last sent value of V[i]) and LU[j] (last update of V[j]). Requires FIFO.
Cristian's algorithm: Client polls a UTC-aware server. Sets local clock to T_s + RTT/2. Single-master; UTC reference.
Berkeley algorithm: Master polls all slaves, computes average local time, sends each its delta. No UTC; for internal LAN agreement only.
NTP: Hierarchical Internet protocol. Stratum 0 (atomic / GPS), 1 (primary servers), 2+ (secondary). Computes offset+delay from four timestamps. Accuracy: ms WAN, sub-ms LAN.

Unit 3 — Global Snapshots

Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts

Global snapshot: A recorded state of all processes plus the messages in flight on all channels, captured without stopping the system.
Consistent cut: A cut where every recorded receive has its corresponding send also recorded. Equivalently, no message arrow goes future → past.
C1 condition: Send recorded ⇒ msg is either in channel state OR recorded as received (not both, not neither). Conservation of messages.
C2 condition: Send NOT recorded ⇒ msg NOT in channel state AND NOT recorded as received. Cause-before-effect.
Marker (Chandy-Lamport): Special control message that separates pre-snapshot from post-snapshot messages on a FIFO channel. First marker on a channel triggers recording; subsequent markers stop it.
White/Red colouring (Lai-Yang): Process colour state: white = pre-snapshot, red = post-snapshot. A red message forces the receiver to finalise its own snapshot. Used when channels are non-FIFO.
SENT / RECD arrays (Acharya-Badrinath): Per-process counters of messages sent to / received from each other process. Channel state derived as messages numbered between sent and received counts.

Unit 4 — Causal Order Message Delivery

BSS Algorithm + Causal vs FIFO vs Total Order

Causal order delivery: If $send (m_{1}) \to send (m_{2})$ and both target the same destination, $m_{1}$ is delivered before $m_{2}$ . Stronger than FIFO.
FIFO order: Messages from the same sender arrive in the order they were sent. Weaker than causal.
Total order: All processes deliver every message in the same global sequence — concurrent messages also get a consistent order. Required for state-machine replication.
BSS algorithm: Birman-Schiper-Stephenson causal delivery via vector clocks. Two conditions: $V_{m} [j] = V_{i} [j] + 1$ (FIFO from sender) AND $V_{m} [k] \leq V_{i} [k] \forall k \neq = j$ (causally prior msgs arrived).

Unit 5 — Distributed Mutual Exclusion

Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison

DME — safety / liveness / fairness: Safety: ≤ 1 in CS. Liveness: every requester eventually enters. Fairness: served in timestamp arrival order.
Synchronisation delay (SD): Time after one site leaves the CS before the next enters. Determines throughput as $1/ (SD + E)$ .
Lamport DME: Non-token. FIFO required. Per-site request queue. Entry: L1 (later-ts msg from every site) ∧ L2 (own at top). 3(N-1) msgs per CS.
Ricart-Agrawala: Non-token. No FIFO. REQUEST + REPLY only; REPLY deferred when own ts is smaller. 2(N-1) msgs per CS.
Roucairol-Carvalho: Optimisation of R-A: once you have a REPLY from $S_{j}$ , don't re-request from $S_{j}$ unless you've replied to $S_{j}$ in between. 0 to 2(N-1) msgs.
Maekawa quorum: Request set $R_{i}$ with $∣ R_{i} ∣ = K$ , every pair $R_{i} \cap R_{j} \neq = \emptyset$ , optimum $K = N$ . Quorum-based permission.
Maekawa V2 messages: FAILED (replied to higher), INQUIRE (higher arrived; ask 'in CS?'), YIELD (relinquish lock after FAILED). Breaks V1's cyclic deadlock.
Suzuki-Kasami: Token-based with broadcast REQUEST. Token holds Q + LN[]. Each site has RN[]. Token sent only on fresh request ( $R N [i] = L N [i] + 1$ ).
Raymond's tree algorithm: Token-based on a logical tree. Holder pointer at each node points toward root. Token migrates along Holder chain. $O (lo g N)$ in balanced tree.
Token-based vs non-token DME: Token: a single privilege passes around; requires fault-tolerant token. Non-token: each request asks permission from peers; no token to lose, but more messages per CS.

Unit 6 — Distributed Deadlock Detection

Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion

Resource models: Single (one pending request); AND (need all); OR (need any one); AND-OR (combination); P-of-Q (any p of q).
Wait-For Graph (WFG): Directed graph where nodes = processes and edge $P_{i} \to P_{j}$ means $P_{i}$ is blocked waiting on $P_{j}$ .
Knot: SCC where every node's only reachable nodes are within the SCC (no outgoing edges from the SCC). Deadlock criterion in OR model.
Phantom deadlock: Apparent deadlock detected from a non-atomic WFG snapshot, but never actually existed simultaneously.
Chandy-Misra-Haas (CMH) probe: Edge-chasing algorithm for AND-model deadlock. Probe $(i, j, k)$ travels along WFG edges; deadlock when initiator $i$ equals receiver $k$ on a returned probe.
Mitchell-Merritt algorithm: Single-resource model. Public/private labels at each node; probes propagate OPPOSITE to WFG edges. Deadlock when own public label returns.
Chandy diffusion (OR): Diffusion-based detection for OR model. Queries from initiator → dependent set; replies propagate back; deadlock when all branches reply (knot condition).
Ho-Ramamoorthy 2-phase: Centralised. Collect process status twice; build WFG from edges appearing in both snapshots. Reduces (but doesn't eliminate) false deadlocks.

Unit 7 — Consensus & Byzantine Agreement

Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP

Crash failure: Process halts (crashes) and stops sending msgs; never lies. Once crashed, stays crashed.
Byzantine failure: Process behaves arbitrarily — may lie, send conflicting msgs to different peers, stay silent, replay old msgs, etc.
Consensus: Each process has own initial value; all non-faulty agree on a single value; if all start with $v$ , decide $v$ .
Interactive consistency: Agree on a vector $(v_{1}, \dots, v_{n})$ such that if $P_{k}$ is non-faulty with value $v_{k}$ , the $k$ -th slot is $v_{k}$ .
Byzantine agreement (BA): Designated source broadcasts; all non-faulty agree on the source's value (or on a default if source faulty). Solves consensus + interactive consistency.
FLP impossibility (Fischer-Lynch-Paterson, 1985): No deterministic algorithm solves consensus in an asynchronous system, even with just ONE crash failure. Motivates randomised / partially-synchronous algorithms.
Lamport-Shostak-Pease OM(m): Recursive oral-messages algorithm for Byzantine agreement. OM(0): direct. OM(t): each lieutenant becomes general for OM(t-1); majority at end. $f + 1$ rounds; $n \geq 3 f + 1$ ; $O (N^{f + 1})$ messages.
Phase King: Polynomial Byzantine consensus algorithm. $n \geq 4 f + 1$ ; $f + 1$ phases × 2 rounds; adopt majority if multiplicity $> n /2 + f$ else trust king. At least one honest king guarantees convergence.

Unit 8 — Distributed Transactions, 2PC & 3PC

ACID + 2PC + 3PC + Blocking & In-Doubt States

Distributed transaction: Transaction touching data at multiple servers. Coordinator at originating site; participants at others.
ACID: Atomicity (all-or-nothing), Consistency (preserves invariants), Isolation (no partial views), Durability (committed survives failure).
Fail-stop model: Failed sites stop participating; never send incorrect messages; may recover. Used by 2PC and Raft.
<prepare T> / <ready T> / <commit T> / <abort T>: 2PC log records. <prepare>: coord initiated. <ready>: participant voted yes (forced stable). <commit>/<abort>: final decision (forced stable; point of no return).
In-doubt state (2PC): Participant has <ready T> but no decision record; must hold all T's locks until decision known. Blocks indefinitely if coord unreachable.
Blocking problem (2PC): All participants in <ready> state AND coordinator crashed = none can decide unilaterally → all block holding locks.
<pre-commit T> (3PC): Replicated 'intent to commit' record. Sent by coord after collecting READYs; persisted at $K + 1$ sites before final COMMIT. Enables non-blocking recovery.
3PC assumptions: No network partitions + at least 1 site up + at most $K$ failures. Strong → why 3PC is not used in practice.

Unit 9 — Raft Consensus

Leader Election + Log Replication + Safety

Replicated state machine: A cluster of servers maintaining identical log replicas so they execute the same commands in the same order. Gives a single-system image to clients.
Term (Raft): Monotonic logical period beginning with an election. At most one leader per term. Servers update their term whenever they see a higher one.
Leader / Follower / Candidate: Server states. Leader sends heartbeats + accepts client commands. Follower passively replicates. Candidate is between (during election).
Election timeout: Random interval (typically 150–300 ms) after which a Follower assumes the Leader has crashed and becomes a Candidate. Randomisation prevents split votes.
AppendEntries RPC: Leader → Follower: 'append these entries; prev entry was at (index, term).' Doubles as a heartbeat when entry list is empty. Followers reject on mismatch, leader backs up.
Election restriction: Safety rule: voter grants vote only if candidate's log is at least as up-to-date as voter's (compare last entry's term; if equal, longer log wins).
Commit: An entry is committed when it has been replicated to a majority of servers IN THE CURRENT TERM. The leader then applies it to the state machine and replies to the client.
Log compaction (snapshot): Compress old applied entries into a single snapshot of the state machine. New servers receive snapshot instead of replaying full log.
Joint consensus (membership change): Transitional phase requiring agreement in BOTH old-config majority AND new-config majority. Prevents two disjoint majorities during transition.

Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)

GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject

Spanning tree: An acyclic subgraph that connects all $N$ vertices using exactly $N - 1$ edges.
MST: Spanning tree with minimum total edge weight. Unique if edge weights are distinct.
Cut property: For any non-empty proper subset $X$ of vertices, the lightest edge crossing the cut is in the MST. Foundation of GHS.
Cycle property: For any cycle in $G$ , the heaviest edge is NOT in the MST.
Fragment: A subtree of some MST. Has a (name, level) pair. Initial single-node fragments are level 0 with name = node id.
Core edge: The most recently used combining edge of a fragment (Rule B merger). Its weight becomes the fragment's new name.
MWOE (Minimum-Weight Outgoing Edge): Lightest edge of a fragment with exactly one endpoint outside the fragment. Adding it preserves the fragment property.
Rule A (absorption): $L < L^{'}$ → $F$ absorbs into $F^{'}$ , taking $F^{'}$ 's name and level. Smaller fragment rushes into bigger.
Rule B (merger): $L = L^{'}$ AND mutual MWOE → merge into new fragment, level $L + 1$ , name = $w (MWOE)$ .
Rule C (wait): Else — wait until conditions for A or B are met.
GHS message types: connect, initiate, test, accept, reject, report, changeroot.
Test reply rules: Same fragment → reject. Different fragment AND $L_{q} \geq L_{p}$ → accept. Different fragment AND $L_{q} < L_{p}$ → defer.

Unit 11 — Google File System (GFS)

GFS — Architecture, Reads, Writes, Consistency, Recovery

GFS architecture: Single master (metadata + log) + chunkservers (64 MB chunks, 3× replicas across racks) + clients (library). Data never flows through master.
Chunk: Fixed-size 64 MB unit of storage. Each has a 64-bit unique handle. Replicated 3× by default across racks.
Master metadata: File and chunk namespaces (logged); file → chunk handle mapping (logged); chunk replica locations (NOT logged, rebuilt from heartbeats); version number (logged), current primary, lease expiry.
Lease: Master grants to one replica = primary, default 60 s. Primary serialises mutations during lease. Decouples master from per-write activity. Renewable via heartbeat.
Atomic record append: GFS picks the offset; appends record at least once. Avoids distributed locking for many-writer logs. May produce duplicates + padding (defined interspersed with inconsistent).
Consistency state — defined: Consistent (same bytes everywhere) AND reflects the writer's mutation in its entirety (no mingling).
Consistency state — consistent but undefined: All replicas see the same bytes, but the bytes don't reflect any single writer's mutation in full — mingled fragments from concurrent writers.
Consistency state — inconsistent: Replicas see DIFFERENT bytes. Result of a failed mutation or a stale replica.
Pipelined data flow: Bytes flow linearly through replica chain along closest path; control flow separately in star (client → primary → secondaries). Maximises bandwidth utilisation.
Copy-on-write snapshot: Master duplicates metadata + increments chunk refcounts. Chunks copied locally only on first subsequent write — snapshot is O(1) until something changes.
Stale replica detection: Each chunk has a version number; master increments on lease grant; replicas with old version detected via heartbeat and garbage-collected.
Garbage collection (GFS): Deleted files renamed hidden + kept 3 days; periodic scan removes them. Orphaned chunks + stale replicas reclaimed via heartbeat-driven master commands.