Saral Shiksha Yojna
Courses/Distributed Systems

Distributed Systems

CS3.401
Prof. Kishore KothapalliMonsoon 2025-264 credits
Revision Notes/Unit 2 — Models, Events & Logical Time/Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync

Logical Clocks (Scalar / Vector / Matrix) + Physical Time Sync

NotesStory

Intuition

Without a global clock, we need a discipline for talking about 'before' and 'after'. Lamport's happened-before relation captures *causal* ordering; logical clocks (scalar, vector, matrix) implement that relation in code. The trade-off is storage vs information: scalar = 1 int (cheap, but loses converse); vector = n ints (strongly consistent); matrix = n × n ints (also tracks what others know about each other — enables obsolete-message GC).

Explanation

The DS model. A collection of processors connected by channels . Processors may fail; share no global memory; have no global clock. Channels may not respect FIFO ordering, may not deliver in causal order, may lose messages.

State of a process / channel. State of = local memory + history of activity on it. State of = set of messages sent on it but not yet received.

Three event types in a distributed program. (1) Local action (internal) — doesn't involve another process. (2) Message send. (3) Message receive. Each event changes the state of one (or two, for send/recv) processes.

**Lamport's happened-before relation .** Defined by three rules: (a) same-process order — if are in the same process and occurs before , then ; (b) send precedes receive — if and , then ; (c) transitive — . Two events are concurrent iff neither nor .

Logical clock — definition + consistency. A function assigning a timestamp to each event such that if then (clock consistency / monotonicity). Strong consistency: — both directions. Scalar clocks satisfy only ; vector and matrix clocks satisfy .

Scalar (Lamport) clock rules. Each keeps integer , initially 0. R1 (before any event): (typically ). R2 (on receive of with timestamp ): , then apply R1, then deliver. Total order tie-breaker: — timestamp first, then process ID. 'Height' of event = its scalar timestamp.

Scalar clock limitation — why not strongly consistent. does NOT imply . Two events at different processes can have ordered scalar timestamps without any causal relation. Example: 's third local event has ; runs independently to . but the events are concurrent.

Vector clock rules. Each keeps vector , initially . R1 (before any event): . R2 (on receive of with vector ): , then apply R1, then deliver. Comparison: . Concurrent: iff neither nor .

Vector clock strong consistency. — both directions. The -th component of 's vector clock = number of events at that causally precede the current event.

Vector clock limitations. Message size per message → grows linearly with the number of processes. In large clusters this is significant overhead.

Singhal-Kshemkalyani optimisation. Send only entries that have *changed* since the last send to that recipient. Each maintains = value of when last sent to , and = value of when was last updated. On send to , send only where . Requires FIFO channels because the recipient must reconstruct the full vector incrementally.

**Matrix clock — what's in .** 's matrix has its own vector clock, and 's knowledge of 's knowledge of 's clock — second-order knowledge. Killer use case: garbage collection of obsolete messages. Once , knows that every process has seen 's clock pass — so messages with timestamps can be safely discarded.

Comparison table. Scalar: 1 int, consistent only, total ordering. Vector: ints, strongly consistent, detects concurrency. Matrix: ints, strongly consistent + knows what others know, used for GC / replicated databases.

Physical time — why needed. Each node's quartz crystal drifts (seconds per day). Many applications need globally meaningful time: file timestamps for make, transaction timestamps, log ordering, distributed debugging, security tokens, time-based protocol timeouts.

Cristian's algorithm. Single master server. Client sends request at ; server replies with UTC time ; client receives at . RTT = . Knowing min one-way transit time , the actual server time is in . Client sets clock to (best estimate). Drawback: server is a single point of failure.

Berkeley algorithm. Master daemon polls all slaves for their local times. Computes the average (after correcting for transit delay, discarding outliers). Sends each slave the delta it should adjust. No UTC source needed — used in internal LANs where the cluster only needs to agree among itself, not with the outside world.

Decentralised averaging. At fixed agreed times, every machine broadcasts its local time. Each computes the average of all received timestamps and sets its clock to that. Fully decentralised; no master, no UTC.

NTP (Network Time Protocol). Most widely used Internet protocol for clock sync (~10–20 M servers). Hierarchical strata: Stratum 0 (atomic clocks, GPS); Stratum 1 (primary servers directly connected to stratum 0); Stratum 2+ (secondary servers synchronising from above). Clients exchange four timestamps with servers and compute offset + delay. Accuracy: ms on WANs, sub-ms on LANs. Cannot achieve perfect sync — different nodes may use different NTP servers; stratum delay accumulates.

Why still need logical clocks if NTP exists? NTP guarantees only approximate sync (offsets in milliseconds). Many DS algorithms (mutex, snapshots, deadlock detection) need *causal correctness*, not numeric closeness — even tiny clock skews break safety properties. Logical clocks capture causality exactly, regardless of physical drift.

Definitions

  • Happened-before $(\to)$Lamport's partial order on events: (a) same-process order, (b) send → recv, (c) transitive. Events not related by → are concurrent.
  • Logical clock consistency. Required minimum; satisfied by scalar, vector, matrix.
  • Strong consistency of logical clocks. Both directions. Vector and matrix satisfy this; scalar does not.
  • Scalar (Lamport) clockSingle integer per process. R1: increment before event. R2: max(local, msg ts) + d on receive. Total-order tie-break: (t, i).
  • Vector clockn-dim vector per process. R1: V[i]++. R2: componentwise max, then R1. Compares with V_h < V_k iff componentwise ≤ and ∃ strict.
  • Matrix clockn × n matrix per process. mt[i, k] = own knowledge of k's clock; mt[j, k] = i's knowledge of j's knowledge of k's clock. Enables obsolete-message GC.
  • Singhal-Kshemkalyani optimisationVector-clock optimisation: send only entries changed since last send to that recipient. Uses LS[j] (last sent value of V[i]) and LU[j] (last update of V[j]). Requires FIFO.
  • Cristian's algorithmClient polls a UTC-aware server. Sets local clock to T_s + RTT/2. Single-master; UTC reference.
  • Berkeley algorithmMaster polls all slaves, computes average local time, sends each its delta. No UTC; for internal LAN agreement only.
  • NTPHierarchical Internet protocol. Stratum 0 (atomic / GPS), 1 (primary servers), 2+ (secondary). Computes offset+delay from four timestamps. Accuracy: ms WAN, sub-ms LAN.

Formulas

Derivations

Scalar clock is consistent but not strongly consistent. Forward direction: if , the rules ensure by induction on the path. Counter-example for the converse: and run independently. 's event has ; 's event has . but neither nor — concurrent.

Vector clock captures causality exactly. Forward: if via a chain, every send/recv preserves componentwise max, so componentwise; at least the originating process's component strictly increments. Reverse: if componentwise, then for the originating process of , its component strictly increased at — which means either is at that process after or received a message chain from that process after .

Why Singhal-Kshemkalyani needs FIFO. The recipient updates only for the components included in the message — relying on every previous update having already arrived. If a later message's partial-vector arrives before an earlier message's update, the recipient misses an intermediate value of and corrupts causality. FIFO guarantees previous updates from the same sender arrive first.

Matrix clock obsolete-message GC argument. = 's knowledge of 's knowledge of 's clock. If , then every has been observed (by ) to have seen 's clock pass . By transitivity, every already received all of 's messages with timestamp — so can safely discard them from its outgoing buffer.

Examples

  • 3-process vector clock trace. sends to at ; sends to at (after recv ); receives . After : . recv: max + inc → . send: . recv: max + inc → . Conclusion: 's recv) since componentwise.
  • Scalar clock counter-example to converse. at (3 local events). at (5 local events, no comms with ). but they are concurrent — scalar fails the direction.
  • Cristian's worked example. Client sends at ms. Server responds with UTC = s. Client receives at ms. RTT = 10 ms. Client sets clock = s. Uncertainty interval ± min one-way delay.
  • Matrix GC scenario. Replicated database with 4 sites. Each update propagated and ack'd. Once , site 1 knows all sites have seen its updates up to logical time 100 — safe to discard those updates from its outbound buffer.

Diagrams

  • Three-process happened-before chain: events as nodes, same-process arrows + send/recv arrows; concurrent events are not connected.
  • Vector-clock lattice for 2 processes: 2D grid of (V[1], V[2]) values; partial order ≤ defines causality; concurrent points are non-comparable.
  • Comparison table: rows = scalar / vector / matrix; cols = storage, consistency (forward only / both), application (total order / causality / GC).
  • Cristian's RTT measurement diagram: client request at T0 → server reply at Ts → client receive at T1; estimated time = Ts + RTT/2.
  • NTP stratum hierarchy: stratum 0 (atomic / GPS) at top, fanning out to stratum 1 (primary), stratum 2 (secondary), etc.

Edge cases

  • Concurrent scalar timestamps after tie-break. When two events have equal scalar timestamp, the tie-break is — lexicographic by process ID. This gives total order but is arbitrary across processes.
  • Vector-clock storage explosion in systems with thousands of processes. Use Singhal-Kshemkalyani (send only changed entries) or matrix-clock GC.
  • Singhal-Kshemkalyani without FIFO corrupts vector reconstruction at the recipient.
  • Cristian's with asymmetric routes breaks the RTT/2 assumption — better symmetric estimate is to .
  • Berkeley with malicious slave reporting wildly wrong time can poison the average. Use outlier rejection.

Common mistakes

  • 'Scalar clocks are strongly consistent.' Wrong — only forward direction holds. Concurrent events can have ordered scalar timestamps.
  • 'Vector R2 is just max(V, Vm).' Wrong — must also increment own component after taking max.
  • 'NTP gives perfect sync.' Wrong — only approximate (ms on WAN, sub-ms on LAN). Not sufficient for causality-sensitive DS algorithms.
  • **Confusing tie-break with global ordering of events.** The tie-break gives a *total* order, but it does not reflect causality between concurrent events.
  • 'Singhal-Kshemkalyani works on non-FIFO channels.' Wrong — it requires FIFO.
  • **Treating matrix-clock as 'i's clock from j's view'** — backwards. = 's knowledge of 's knowledge of .

Shortcuts

  • Scalar R2: .
  • Vector R2: componentwise max, then ++own.
  • Strong consistency: vector ✓, matrix ✓, scalar ✗.
  • Singhal-Kshemkalyani trio: LS, LU, FIFO required.
  • Matrix GC: → discard.
  • Cristian's: , single master, UTC.
  • Berkeley: poll + average, no UTC, LAN-only.
  • NTP strata: 0 atomic → 1 primary → 2+ secondary. ms WAN, sub-ms LAN.

Proofs / Algorithms

Vector clock is strongly consistent. Forward: if via a chain of events, each step preserves componentwise (R1 only increments own, R2 only does max). The originating process's component strictly increments at , and this strict increase propagates along the chain. Backward: if componentwise with strict-component , then means either has the event after or received a chain originating from . By construction .

Matrix-clock GC correctness. When , every process has , meaning has received evidence that 's vector clock at shows . Since vector clocks are strongly consistent, this means every message sent with timestamp has been acknowledged transitively by every . So can discard its sent buffer up to time .