Revision Notes/Unit 3 — Global Snapshots/Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts/Story

Chandy-Lamport, Lai-Yang, Acharya-Badrinath + Consistent Cuts

Unit 3 — Global Snapshots

You Can't Stop The World

To debug a distributed program — or replicate it, or recover from a crash — you'd love to take a photograph: "here's the global state at noon Tuesday." But there's no shared "noon Tuesday" in a distributed system, and stopping all processes to synchronise would defeat the purpose.

The Chandy-Lamport result (1985) is that you don't need to. You can take a snapshot of a running distributed system whose result is a consistent global state — one the system *could have been in* during an equivalent execution, even if it wasn't actually in that exact state at any wall-clock instant. The recorded state is indistinguishable from a real instantaneous state for any purpose that respects causal order.

What "Consistent" Means — C1 And C2

A snapshot is consistent when no message arrow goes future → past. Formally:

C1 (conservation): if a send is recorded, the corresponding message is either still in the channel state OR has been recorded as received. Not both (you'd be double-counting), not neither (you'd have lost it). Every sent message is somewhere.
C2 (cause-before-effect): if a send is NOT recorded, the message is NEITHER in the channel state NOR recorded as received. You can't have a receive whose send is in the future.

Together C1 and C2 guarantee the snapshot reflects a state the system could have legitimately been in.

The Banking Trap

Three banks, total $2500.$ P_1 $r ecor d s a t$ t_0 $, ba l an ce$ 1000. Between $t_{0}$ and $P_{2}$ 's snapshot at $t_{2}$ , $P_{1}$ transfers $400 t o$ P_2 $.$ P_2 $r ecor d s a t$ t_2 $, ba l an ce i s n o w$ 700. Suppose $P_{1}$ had $150 s t i l l in f l i g h tt o$ P_2 $o n c hann e l$ C_{12} $(r ecor d e d a sc hann e l s t a t e) an d$ C_{21} $ha s$ 100 in flight.

Total: $1000 + 700 + 200 + 150 + 100 = 2150 \ne 2500.

This is an inconsistent cut. The $400 t r an s f e r^{'} s * se n d * ha pp e n e d b e f or e$ P_1 $^{'} ss na p s h o t b u t i t s * r ece i v e * ha pp e n e d b e f or e$ P_2 $^{'} ss na p s h o t — so t h er ece i v e w a se f f ec t i v e l y co u n t e d (t h e m o n ey e x i s t s in$ P_2 $^{'} sr ecor d e d$ 700) but the send was also counted (the money exists in $P_{1}$ 's recorded $1000) . T h es am e$ 400 appears twice. A correct snapshot would either have both send-and-receive on the same side of the cut, or have the money on the channel, but not both.

Chandy-Lamport — FIFO Channels

The classical algorithm, by Mani Chandy and Leslie Lamport. Requires FIFO channels (a non-trivial assumption — but TCP gives FIFO).

The mechanism is a special marker message that separates pre-snapshot from post-snapshot traffic.

Initiator: record its own state. Send a MARKER on every outgoing channel BEFORE any other message.

**On receiving a MARKER on channel $C$ from $P_{j}$ :**

First marker this process has received → record own state. Set $C$ 's recorded state to $\emptyset$ (nothing came on $C$ between this process's recording and the marker — they're simultaneous). Send MARKER on all outgoing channels. Start recording incoming messages on all *other* incoming channels.
Already recorded → stop recording $C$ . $C$ 's state = whatever was received on $C$ between when this process recorded its state and now.

Termination: every process has received a MARKER on every incoming channel.

Complexity: $O (∣ E ∣)$ messages (one marker per channel direction), $O (d)$ time where $d$ is network diameter.

Why FIFO?

FIFO is the algorithm's only assumption — but it's essential. Here's why.

Messages sent BEFORE the marker on a channel should arrive before the marker (and be recorded as channel state). Messages sent AFTER the marker should arrive after (and be excluded from the channel state). FIFO guarantees this ordering: anything arriving after the marker on $C$ was sent after the marker on $C$ , hence is logically post-snapshot.

Without FIFO, a post-snapshot message could overtake the marker on $C$ and be wrongly counted as channel state — corrupting the snapshot.

Lai-Yang — Non-FIFO Channels

If channels are non-FIFO (unordered, like UDP), Chandy-Lamport breaks. Lai-Yang uses message colouring instead:

Every process starts white.
A process turns red when it records its state.
Every white process records its snapshot at its convenience but no later than receiving the first red message.
Every white process records the history of all white messages it sends or receives.

Channel state $C_{ij}$ is computed from message histories: the white messages received by $P_{j}$ on $C_{ij}$ after $P_{j}$ recorded, minus the white messages sent by $P_{i}$ before recording.

Trade-off: works on non-FIFO channels, but needs heavy storage — each process keeps a complete history of white messages until the snapshot completes.

Acharya-Badrinath — Causal Channels

If channels deliver in causal order (stronger than FIFO), there's a beautifully lightweight algorithm. Each $P_{i}$ maintains two arrays: $S E N T_{i} [1.. N]$ counts messages sent to each peer, $R E C D_{i} [1.. N]$ counts messages received from each peer.

Protocol: initiator broadcasts a token to all processes (including itself). On receiving the token, each $P_{i}$ records its local snapshot plus $S E N T_{i}, R E C D_{i}$ and sends back to the initiator. The initiator computes the channel state from $P_{i}$ to $P_{j}$ as messages indexed ${R E C D_{j} [i] + 1, \dots, S E N T_{i} [j]}$ — the messages $P_{i}$ has sent but $P_{j}$ hasn't yet received.

Complexity: $2 N$ messages (token + reply per process). Much simpler than Chandy-Lamport in causal-channel systems.

Choosing The Right Algorithm

| Algorithm | Channel | Messages | Storage | |---|---|---|---| | Chandy-Lamport | FIFO | $O (∣ E ∣)$ | None extra | | Lai-Yang | Non-FIFO (any) | $O (∣ E ∣)$ | Heavy history | | Acharya-Badrinath | Causal | $2 N$ | Just counters |

In production: FIFO is common (TCP), so Chandy-Lamport dominates. Lai-Yang is mostly academic. Acharya-Badrinath wins when you already have causal delivery (e.g., via vector clocks).

What You Walk In Carrying

Why snapshots are hard: no global clock + can't stop the system + messages in transit. Definition of a consistent cut + C1 and C2 conditions. Banking trap example with conservation failure. Chandy-Lamport marker rules + FIFO requirement + complexity. Lai-Yang white/red colouring + non-FIFO support + history overhead. Acharya-Badrinath SENT/RECD arrays + causal requirement + $2 N$ complexity. Comparison table. Worked banking trace showing total preserved when snapshot is consistent.

Distributed Systems