Distributed Systems
CS3.401Prof. Kishore Kothapalli•Monsoon 2025-26•4 credits
Definitions
Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).
94 terms · 0 have plain-English versions
Unit 1 — Introduction, Challenges & CAP Theorem
Foundations — Definition, Motivation, Challenges, CAP
- Distributed system (Tanenbaum)
- A group of computers working together to appear as a single computer to the end user.
- Consistency (CAP)
- Every node returns the same most-recent successful write — all nodes have the same view of data at every instant.
- Availability (CAP)
- Every non-failing node responds to every request in reasonable time, with a non-error response.
- Partition tolerance
- System continues to operate despite arbitrary message loss or network partitions between groups of nodes.
- Two-Generals problem
- Classical impossibility result showing two parties cannot reach deterministic agreement over an unreliable channel — every ack needs its own ack ad infinitum.
- Independent failure
- One node crashing does not directly cause others to crash; the system as a whole keeps operating. Foundation of fault tolerance.
Unit 2 — Models, Events & Logical Time
Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync
- Happened-before $(\to)$
- Lamport's partial order on events: (a) same-process order, (b) send → recv, (c) transitive. Events not related by → are concurrent.
- Logical clock consistency
- . Required minimum; satisfied by scalar, vector, matrix.
- Strong consistency of logical clocks
- . Both directions. Vector and matrix satisfy this; scalar does not.
- Scalar (Lamport) clock
- Single integer per process. R1: increment before event. R2: max(local, msg ts) + d on receive. Total-order tie-break: (t, i).
- Vector clock
- n-dim vector per process. R1: V[i]++. R2: componentwise max, then R1. Compares with V_h < V_k iff componentwise ≤ and ∃ strict.
- Matrix clock
- n × n matrix per process. mt[i, k] = own knowledge of k's clock; mt[j, k] = i's knowledge of j's knowledge of k's clock. Enables obsolete-message GC.
- Singhal-Kshemkalyani optimisation
- Vector-clock optimisation: send only entries changed since last send to that recipient. Uses LS[j] (last sent value of V[i]) and LU[j] (last update of V[j]). Requires FIFO.
- Cristian's algorithm
- Client polls a UTC-aware server. Sets local clock to T_s + RTT/2. Single-master; UTC reference.
- Berkeley algorithm
- Master polls all slaves, computes average local time, sends each its delta. No UTC; for internal LAN agreement only.
- NTP
- Hierarchical Internet protocol. Stratum 0 (atomic / GPS), 1 (primary servers), 2+ (secondary). Computes offset+delay from four timestamps. Accuracy: ms WAN, sub-ms LAN.
Unit 3 — Global Snapshots
Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts
- Global snapshot
- A recorded state of all processes plus the messages in flight on all channels, captured without stopping the system.
- Consistent cut
- A cut where every recorded receive has its corresponding send also recorded. Equivalently, no message arrow goes future → past.
- C1 condition
- Send recorded ⇒ msg is either in channel state OR recorded as received (not both, not neither). Conservation of messages.
- C2 condition
- Send NOT recorded ⇒ msg NOT in channel state AND NOT recorded as received. Cause-before-effect.
- Marker (Chandy-Lamport)
- Special control message that separates pre-snapshot from post-snapshot messages on a FIFO channel. First marker on a channel triggers recording; subsequent markers stop it.
- White/Red colouring (Lai-Yang)
- Process colour state: white = pre-snapshot, red = post-snapshot. A red message forces the receiver to finalise its own snapshot. Used when channels are non-FIFO.
- SENT / RECD arrays (Acharya-Badrinath)
- Per-process counters of messages sent to / received from each other process. Channel state derived as messages numbered between sent and received counts.
Unit 4 — Causal Order Message Delivery
BSS Algorithm + Causal vs FIFO vs Total Order
- Causal order delivery
- If and both target the same destination, is delivered before . Stronger than FIFO.
- FIFO order
- Messages from the same sender arrive in the order they were sent. Weaker than causal.
- Total order
- All processes deliver every message in the same global sequence — concurrent messages also get a consistent order. Required for state-machine replication.
- BSS algorithm
- Birman-Schiper-Stephenson causal delivery via vector clocks. Two conditions: (FIFO from sender) AND (causally prior msgs arrived).
Unit 5 — Distributed Mutual Exclusion
Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison
- DME — safety / liveness / fairness
- Safety: ≤ 1 in CS. Liveness: every requester eventually enters. Fairness: served in timestamp arrival order.
- Synchronisation delay (SD)
- Time after one site leaves the CS before the next enters. Determines throughput as .
- Lamport DME
- Non-token. FIFO required. Per-site request queue. Entry: L1 (later-ts msg from every site) ∧ L2 (own at top). 3(N-1) msgs per CS.
- Ricart-Agrawala
- Non-token. No FIFO. REQUEST + REPLY only; REPLY deferred when own ts is smaller. 2(N-1) msgs per CS.
- Roucairol-Carvalho
- Optimisation of R-A: once you have a REPLY from , don't re-request from unless you've replied to in between. 0 to 2(N-1) msgs.
- Maekawa quorum
- Request set with , every pair , optimum . Quorum-based permission.
- Maekawa V2 messages
- FAILED (replied to higher), INQUIRE (higher arrived; ask 'in CS?'), YIELD (relinquish lock after FAILED). Breaks V1's cyclic deadlock.
- Suzuki-Kasami
- Token-based with broadcast REQUEST. Token holds Q + LN[]. Each site has RN[]. Token sent only on fresh request ().
- Raymond's tree algorithm
- Token-based on a logical tree. Holder pointer at each node points toward root. Token migrates along Holder chain. in balanced tree.
- Token-based vs non-token DME
- Token: a single privilege passes around; requires fault-tolerant token. Non-token: each request asks permission from peers; no token to lose, but more messages per CS.
Unit 6 — Distributed Deadlock Detection
Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion
- Resource models
- Single (one pending request); AND (need all); OR (need any one); AND-OR (combination); P-of-Q (any p of q).
- Wait-For Graph (WFG)
- Directed graph where nodes = processes and edge means is blocked waiting on .
- Knot
- SCC where every node's only reachable nodes are within the SCC (no outgoing edges from the SCC). Deadlock criterion in OR model.
- Phantom deadlock
- Apparent deadlock detected from a non-atomic WFG snapshot, but never actually existed simultaneously.
- Chandy-Misra-Haas (CMH) probe
- Edge-chasing algorithm for AND-model deadlock. Probe travels along WFG edges; deadlock when initiator equals receiver on a returned probe.
- Mitchell-Merritt algorithm
- Single-resource model. Public/private labels at each node; probes propagate OPPOSITE to WFG edges. Deadlock when own public label returns.
- Chandy diffusion (OR)
- Diffusion-based detection for OR model. Queries from initiator → dependent set; replies propagate back; deadlock when all branches reply (knot condition).
- Ho-Ramamoorthy 2-phase
- Centralised. Collect process status twice; build WFG from edges appearing in both snapshots. Reduces (but doesn't eliminate) false deadlocks.
Unit 7 — Consensus & Byzantine Agreement
Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP
- Crash failure
- Process halts (crashes) and stops sending msgs; never lies. Once crashed, stays crashed.
- Byzantine failure
- Process behaves arbitrarily — may lie, send conflicting msgs to different peers, stay silent, replay old msgs, etc.
- Consensus
- Each process has own initial value; all non-faulty agree on a single value; if all start with , decide .
- Interactive consistency
- Agree on a vector such that if is non-faulty with value , the -th slot is .
- Byzantine agreement (BA)
- Designated source broadcasts; all non-faulty agree on the source's value (or on a default if source faulty). Solves consensus + interactive consistency.
- FLP impossibility (Fischer-Lynch-Paterson, 1985)
- No deterministic algorithm solves consensus in an asynchronous system, even with just ONE crash failure. Motivates randomised / partially-synchronous algorithms.
- Lamport-Shostak-Pease OM(m)
- Recursive oral-messages algorithm for Byzantine agreement. OM(0): direct. OM(t): each lieutenant becomes general for OM(t-1); majority at end. rounds; ; messages.
- Phase King
- Polynomial Byzantine consensus algorithm. ; phases × 2 rounds; adopt majority if multiplicity else trust king. At least one honest king guarantees convergence.
Unit 8 — Distributed Transactions, 2PC & 3PC
ACID + 2PC + 3PC + Blocking & In-Doubt States
- Distributed transaction
- Transaction touching data at multiple servers. Coordinator at originating site; participants at others.
- ACID
- Atomicity (all-or-nothing), Consistency (preserves invariants), Isolation (no partial views), Durability (committed survives failure).
- Fail-stop model
- Failed sites stop participating; never send incorrect messages; may recover. Used by 2PC and Raft.
- <prepare T> / <ready T> / <commit T> / <abort T>
- 2PC log records. <prepare>: coord initiated. <ready>: participant voted yes (forced stable). <commit>/<abort>: final decision (forced stable; point of no return).
- In-doubt state (2PC)
- Participant has <ready T> but no decision record; must hold all T's locks until decision known. Blocks indefinitely if coord unreachable.
- Blocking problem (2PC)
- All participants in <ready> state AND coordinator crashed = none can decide unilaterally → all block holding locks.
- <pre-commit T> (3PC)
- Replicated 'intent to commit' record. Sent by coord after collecting READYs; persisted at sites before final COMMIT. Enables non-blocking recovery.
- 3PC assumptions
- No network partitions + at least 1 site up + at most failures. Strong → why 3PC is not used in practice.
Unit 9 — Raft Consensus
Leader Election + Log Replication + Safety
- Replicated state machine
- A cluster of servers maintaining identical log replicas so they execute the same commands in the same order. Gives a single-system image to clients.
- Term (Raft)
- Monotonic logical period beginning with an election. At most one leader per term. Servers update their term whenever they see a higher one.
- Leader / Follower / Candidate
- Server states. Leader sends heartbeats + accepts client commands. Follower passively replicates. Candidate is between (during election).
- Election timeout
- Random interval (typically 150–300 ms) after which a Follower assumes the Leader has crashed and becomes a Candidate. Randomisation prevents split votes.
- AppendEntries RPC
- Leader → Follower: 'append these entries; prev entry was at (index, term).' Doubles as a heartbeat when entry list is empty. Followers reject on mismatch, leader backs up.
- Election restriction
- Safety rule: voter grants vote only if candidate's log is at least as up-to-date as voter's (compare last entry's term; if equal, longer log wins).
- Commit
- An entry is committed when it has been replicated to a majority of servers IN THE CURRENT TERM. The leader then applies it to the state machine and replies to the client.
- Log compaction (snapshot)
- Compress old applied entries into a single snapshot of the state machine. New servers receive snapshot instead of replaying full log.
- Joint consensus (membership change)
- Transitional phase requiring agreement in BOTH old-config majority AND new-config majority. Prevents two disjoint majorities during transition.
Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)
GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject
- Spanning tree
- An acyclic subgraph that connects all vertices using exactly edges.
- MST
- Spanning tree with minimum total edge weight. Unique if edge weights are distinct.
- Cut property
- For any non-empty proper subset of vertices, the lightest edge crossing the cut is in the MST. Foundation of GHS.
- Cycle property
- For any cycle in , the heaviest edge is NOT in the MST.
- Fragment
- A subtree of some MST. Has a (name, level) pair. Initial single-node fragments are level 0 with name = node id.
- Core edge
- The most recently used combining edge of a fragment (Rule B merger). Its weight becomes the fragment's new name.
- MWOE (Minimum-Weight Outgoing Edge)
- Lightest edge of a fragment with exactly one endpoint outside the fragment. Adding it preserves the fragment property.
- Rule A (absorption)
- → absorbs into , taking 's name and level. Smaller fragment rushes into bigger.
- Rule B (merger)
- AND mutual MWOE → merge into new fragment, level , name = .
- Rule C (wait)
- Else — wait until conditions for A or B are met.
- GHS message types
- connect, initiate, test, accept, reject, report, changeroot.
- Test reply rules
- Same fragment → reject. Different fragment AND → accept. Different fragment AND → defer.
Unit 11 — Google File System (GFS)
GFS — Architecture, Reads, Writes, Consistency, Recovery
- GFS architecture
- Single master (metadata + log) + chunkservers (64 MB chunks, 3× replicas across racks) + clients (library). Data never flows through master.
- Chunk
- Fixed-size 64 MB unit of storage. Each has a 64-bit unique handle. Replicated 3× by default across racks.
- Master metadata
- File and chunk namespaces (logged); file → chunk handle mapping (logged); chunk replica locations (NOT logged, rebuilt from heartbeats); version number (logged), current primary, lease expiry.
- Lease
- Master grants to one replica = primary, default 60 s. Primary serialises mutations during lease. Decouples master from per-write activity. Renewable via heartbeat.
- Atomic record append
- GFS picks the offset; appends record at least once. Avoids distributed locking for many-writer logs. May produce duplicates + padding (defined interspersed with inconsistent).
- Consistency state — defined
- Consistent (same bytes everywhere) AND reflects the writer's mutation in its entirety (no mingling).
- Consistency state — consistent but undefined
- All replicas see the same bytes, but the bytes don't reflect any single writer's mutation in full — mingled fragments from concurrent writers.
- Consistency state — inconsistent
- Replicas see DIFFERENT bytes. Result of a failed mutation or a stale replica.
- Pipelined data flow
- Bytes flow linearly through replica chain along closest path; control flow separately in star (client → primary → secondaries). Maximises bandwidth utilisation.
- Copy-on-write snapshot
- Master duplicates metadata + increments chunk refcounts. Chunks copied locally only on first subsequent write — snapshot is O(1) until something changes.
- Stale replica detection
- Each chunk has a version number; master increments on lease grant; replicas with old version detected via heartbeat and garbage-collected.
- Garbage collection (GFS)
- Deleted files renamed hidden + kept 3 days; periodic scan removes them. Orphaned chunks + stale replicas reclaimed via heartbeat-driven master commands.