Saral Shiksha Yojna
Courses/Distributed Systems

Distributed Systems

CS3.401
Prof. Kishore KothapalliMonsoon 2025-264 credits

Definitions

Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).

94 terms · 0 have plain-English versions

Unit 1 — Introduction, Challenges & CAP Theorem

Foundations — Definition, Motivation, Challenges, CAP
Distributed system (Tanenbaum)
A group of computers working together to appear as a single computer to the end user.
Consistency (CAP)
Every node returns the same most-recent successful write — all nodes have the same view of data at every instant.
Availability (CAP)
Every non-failing node responds to every request in reasonable time, with a non-error response.
Partition tolerance
System continues to operate despite arbitrary message loss or network partitions between groups of nodes.
Two-Generals problem
Classical impossibility result showing two parties cannot reach deterministic agreement over an unreliable channel — every ack needs its own ack ad infinitum.
Independent failure
One node crashing does not directly cause others to crash; the system as a whole keeps operating. Foundation of fault tolerance.

Unit 2 — Models, Events & Logical Time

Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync
Happened-before $(\to)$
Lamport's partial order on events: (a) same-process order, (b) send → recv, (c) transitive. Events not related by → are concurrent.
Logical clock consistency
. Required minimum; satisfied by scalar, vector, matrix.
Strong consistency of logical clocks
. Both directions. Vector and matrix satisfy this; scalar does not.
Scalar (Lamport) clock
Single integer per process. R1: increment before event. R2: max(local, msg ts) + d on receive. Total-order tie-break: (t, i).
Vector clock
n-dim vector per process. R1: V[i]++. R2: componentwise max, then R1. Compares with V_h < V_k iff componentwise ≤ and ∃ strict.
Matrix clock
n × n matrix per process. mt[i, k] = own knowledge of k's clock; mt[j, k] = i's knowledge of j's knowledge of k's clock. Enables obsolete-message GC.
Singhal-Kshemkalyani optimisation
Vector-clock optimisation: send only entries changed since last send to that recipient. Uses LS[j] (last sent value of V[i]) and LU[j] (last update of V[j]). Requires FIFO.
Cristian's algorithm
Client polls a UTC-aware server. Sets local clock to T_s + RTT/2. Single-master; UTC reference.
Berkeley algorithm
Master polls all slaves, computes average local time, sends each its delta. No UTC; for internal LAN agreement only.
NTP
Hierarchical Internet protocol. Stratum 0 (atomic / GPS), 1 (primary servers), 2+ (secondary). Computes offset+delay from four timestamps. Accuracy: ms WAN, sub-ms LAN.

Unit 3 — Global Snapshots

Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts
Global snapshot
A recorded state of all processes plus the messages in flight on all channels, captured without stopping the system.
Consistent cut
A cut where every recorded receive has its corresponding send also recorded. Equivalently, no message arrow goes future → past.
C1 condition
Send recorded ⇒ msg is either in channel state OR recorded as received (not both, not neither). Conservation of messages.
C2 condition
Send NOT recorded ⇒ msg NOT in channel state AND NOT recorded as received. Cause-before-effect.
Marker (Chandy-Lamport)
Special control message that separates pre-snapshot from post-snapshot messages on a FIFO channel. First marker on a channel triggers recording; subsequent markers stop it.
White/Red colouring (Lai-Yang)
Process colour state: white = pre-snapshot, red = post-snapshot. A red message forces the receiver to finalise its own snapshot. Used when channels are non-FIFO.
SENT / RECD arrays (Acharya-Badrinath)
Per-process counters of messages sent to / received from each other process. Channel state derived as messages numbered between sent and received counts.

Unit 4 — Causal Order Message Delivery

BSS Algorithm + Causal vs FIFO vs Total Order
Causal order delivery
If and both target the same destination, is delivered before . Stronger than FIFO.
FIFO order
Messages from the same sender arrive in the order they were sent. Weaker than causal.
Total order
All processes deliver every message in the same global sequence — concurrent messages also get a consistent order. Required for state-machine replication.
BSS algorithm
Birman-Schiper-Stephenson causal delivery via vector clocks. Two conditions: (FIFO from sender) AND (causally prior msgs arrived).

Unit 5 — Distributed Mutual Exclusion

Lamport, Ricart-Agrawala, Maekawa, Suzuki-Kasami, Raymond — Complete Comparison
DME — safety / liveness / fairness
Safety: ≤ 1 in CS. Liveness: every requester eventually enters. Fairness: served in timestamp arrival order.
Synchronisation delay (SD)
Time after one site leaves the CS before the next enters. Determines throughput as .
Lamport DME
Non-token. FIFO required. Per-site request queue. Entry: L1 (later-ts msg from every site) ∧ L2 (own at top). 3(N-1) msgs per CS.
Ricart-Agrawala
Non-token. No FIFO. REQUEST + REPLY only; REPLY deferred when own ts is smaller. 2(N-1) msgs per CS.
Roucairol-Carvalho
Optimisation of R-A: once you have a REPLY from , don't re-request from unless you've replied to in between. 0 to 2(N-1) msgs.
Maekawa quorum
Request set with , every pair , optimum . Quorum-based permission.
Maekawa V2 messages
FAILED (replied to higher), INQUIRE (higher arrived; ask 'in CS?'), YIELD (relinquish lock after FAILED). Breaks V1's cyclic deadlock.
Suzuki-Kasami
Token-based with broadcast REQUEST. Token holds Q + LN[]. Each site has RN[]. Token sent only on fresh request ().
Raymond's tree algorithm
Token-based on a logical tree. Holder pointer at each node points toward root. Token migrates along Holder chain. in balanced tree.
Token-based vs non-token DME
Token: a single privilege passes around; requires fault-tolerant token. Non-token: each request asks permission from peers; no token to lose, but more messages per CS.

Unit 6 — Distributed Deadlock Detection

Resource Models, WFG, CMH Probe, Mitchell-Merritt, Chandy Diffusion
Resource models
Single (one pending request); AND (need all); OR (need any one); AND-OR (combination); P-of-Q (any p of q).
Wait-For Graph (WFG)
Directed graph where nodes = processes and edge means is blocked waiting on .
Knot
SCC where every node's only reachable nodes are within the SCC (no outgoing edges from the SCC). Deadlock criterion in OR model.
Phantom deadlock
Apparent deadlock detected from a non-atomic WFG snapshot, but never actually existed simultaneously.
Chandy-Misra-Haas (CMH) probe
Edge-chasing algorithm for AND-model deadlock. Probe travels along WFG edges; deadlock when initiator equals receiver on a returned probe.
Mitchell-Merritt algorithm
Single-resource model. Public/private labels at each node; probes propagate OPPOSITE to WFG edges. Deadlock when own public label returns.
Chandy diffusion (OR)
Diffusion-based detection for OR model. Queries from initiator → dependent set; replies propagate back; deadlock when all branches reply (knot condition).
Ho-Ramamoorthy 2-phase
Centralised. Collect process status twice; build WFG from edges appearing in both snapshots. Reduces (but doesn't eliminate) false deadlocks.

Unit 7 — Consensus & Byzantine Agreement

Crash Consensus + Byzantine Agreement (OM(m), Phase King) + FLP
Crash failure
Process halts (crashes) and stops sending msgs; never lies. Once crashed, stays crashed.
Byzantine failure
Process behaves arbitrarily — may lie, send conflicting msgs to different peers, stay silent, replay old msgs, etc.
Consensus
Each process has own initial value; all non-faulty agree on a single value; if all start with , decide .
Interactive consistency
Agree on a vector such that if is non-faulty with value , the -th slot is .
Byzantine agreement (BA)
Designated source broadcasts; all non-faulty agree on the source's value (or on a default if source faulty). Solves consensus + interactive consistency.
FLP impossibility (Fischer-Lynch-Paterson, 1985)
No deterministic algorithm solves consensus in an asynchronous system, even with just ONE crash failure. Motivates randomised / partially-synchronous algorithms.
Lamport-Shostak-Pease OM(m)
Recursive oral-messages algorithm for Byzantine agreement. OM(0): direct. OM(t): each lieutenant becomes general for OM(t-1); majority at end. rounds; ; messages.
Phase King
Polynomial Byzantine consensus algorithm. ; phases × 2 rounds; adopt majority if multiplicity else trust king. At least one honest king guarantees convergence.

Unit 8 — Distributed Transactions, 2PC & 3PC

ACID + 2PC + 3PC + Blocking & In-Doubt States
Distributed transaction
Transaction touching data at multiple servers. Coordinator at originating site; participants at others.
ACID
Atomicity (all-or-nothing), Consistency (preserves invariants), Isolation (no partial views), Durability (committed survives failure).
Fail-stop model
Failed sites stop participating; never send incorrect messages; may recover. Used by 2PC and Raft.
<prepare T> / <ready T> / <commit T> / <abort T>
2PC log records. <prepare>: coord initiated. <ready>: participant voted yes (forced stable). <commit>/<abort>: final decision (forced stable; point of no return).
In-doubt state (2PC)
Participant has <ready T> but no decision record; must hold all T's locks until decision known. Blocks indefinitely if coord unreachable.
Blocking problem (2PC)
All participants in <ready> state AND coordinator crashed = none can decide unilaterally → all block holding locks.
<pre-commit T> (3PC)
Replicated 'intent to commit' record. Sent by coord after collecting READYs; persisted at sites before final COMMIT. Enables non-blocking recovery.
3PC assumptions
No network partitions + at least 1 site up + at most failures. Strong → why 3PC is not used in practice.

Unit 9 — Raft Consensus

Leader Election + Log Replication + Safety
Replicated state machine
A cluster of servers maintaining identical log replicas so they execute the same commands in the same order. Gives a single-system image to clients.
Term (Raft)
Monotonic logical period beginning with an election. At most one leader per term. Servers update their term whenever they see a higher one.
Leader / Follower / Candidate
Server states. Leader sends heartbeats + accepts client commands. Follower passively replicates. Candidate is between (during election).
Election timeout
Random interval (typically 150–300 ms) after which a Follower assumes the Leader has crashed and becomes a Candidate. Randomisation prevents split votes.
AppendEntries RPC
Leader → Follower: 'append these entries; prev entry was at (index, term).' Doubles as a heartbeat when entry list is empty. Followers reject on mismatch, leader backs up.
Election restriction
Safety rule: voter grants vote only if candidate's log is at least as up-to-date as voter's (compare last entry's term; if equal, longer log wins).
Commit
An entry is committed when it has been replicated to a majority of servers IN THE CURRENT TERM. The leader then applies it to the state machine and replies to the client.
Log compaction (snapshot)
Compress old applied entries into a single snapshot of the state machine. New servers receive snapshot instead of replaying full log.
Joint consensus (membership change)
Transitional phase requiring agreement in BOTH old-config majority AND new-config majority. Prevents two disjoint majorities during transition.

Unit 10 — Distributed Minimum Spanning Tree (GHS Algorithm)

GHS Rules A/B/C + Fragment Levels + Test/Accept/Reject
Spanning tree
An acyclic subgraph that connects all vertices using exactly edges.
MST
Spanning tree with minimum total edge weight. Unique if edge weights are distinct.
Cut property
For any non-empty proper subset of vertices, the lightest edge crossing the cut is in the MST. Foundation of GHS.
Cycle property
For any cycle in , the heaviest edge is NOT in the MST.
Fragment
A subtree of some MST. Has a (name, level) pair. Initial single-node fragments are level 0 with name = node id.
Core edge
The most recently used combining edge of a fragment (Rule B merger). Its weight becomes the fragment's new name.
MWOE (Minimum-Weight Outgoing Edge)
Lightest edge of a fragment with exactly one endpoint outside the fragment. Adding it preserves the fragment property.
Rule A (absorption)
absorbs into , taking 's name and level. Smaller fragment rushes into bigger.
Rule B (merger)
AND mutual MWOE → merge into new fragment, level , name = .
Rule C (wait)
Else — wait until conditions for A or B are met.
GHS message types
connect, initiate, test, accept, reject, report, changeroot.
Test reply rules
Same fragment → reject. Different fragment AND → accept. Different fragment AND → defer.

Unit 11 — Google File System (GFS)

GFS — Architecture, Reads, Writes, Consistency, Recovery
GFS architecture
Single master (metadata + log) + chunkservers (64 MB chunks, 3× replicas across racks) + clients (library). Data never flows through master.
Chunk
Fixed-size 64 MB unit of storage. Each has a 64-bit unique handle. Replicated 3× by default across racks.
Master metadata
File and chunk namespaces (logged); file → chunk handle mapping (logged); chunk replica locations (NOT logged, rebuilt from heartbeats); version number (logged), current primary, lease expiry.
Lease
Master grants to one replica = primary, default 60 s. Primary serialises mutations during lease. Decouples master from per-write activity. Renewable via heartbeat.
Atomic record append
GFS picks the offset; appends record at least once. Avoids distributed locking for many-writer logs. May produce duplicates + padding (defined interspersed with inconsistent).
Consistency state — defined
Consistent (same bytes everywhere) AND reflects the writer's mutation in its entirety (no mingling).
Consistency state — consistent but undefined
All replicas see the same bytes, but the bytes don't reflect any single writer's mutation in full — mingled fragments from concurrent writers.
Consistency state — inconsistent
Replicas see DIFFERENT bytes. Result of a failed mutation or a stale replica.
Pipelined data flow
Bytes flow linearly through replica chain along closest path; control flow separately in star (client → primary → secondaries). Maximises bandwidth utilisation.
Copy-on-write snapshot
Master duplicates metadata + increments chunk refcounts. Chunks copied locally only on first subsequent write — snapshot is O(1) until something changes.
Stale replica detection
Each chunk has a version number; master increments on lease grant; replicas with old version detected via heartbeat and garbage-collected.
Garbage collection (GFS)
Deleted files renamed hidden + kept 3 days; periodic scan removes them. Orphaned chunks + stale replicas reclaimed via heartbeat-driven master commands.