Revision Notes/Unit 11 — Google File System (GFS)/GFS — Architecture, Reads, Writes, Consistency, Recovery

GFS — Architecture, Reads, Writes, Consistency, Recovery

Intuition

GFS (2003) is the design Google used to store all of search and Gmail at the time. The radical choices: a single master for all metadata (vs many believed this couldn't scale); 64 MB chunks (vs typical 4 KB block sizes); failure is the norm (commodity hardware in racks); atomic record append as the primary write operation (vs random in-place writes). The 'inconsistent' / 'undefined' regions in the consistency model are unusual — applications are expected to cope.

Explanation

Why a distributed file system? Files too large for one disk; need horizontal scaling of storage. With many commodity machines, failures are the norm, not the exception — 1M servers × 1 failure/1000 days ⇒ 1000 failures/day. Persistence and availability impossible on a single failing node. Network bandwidth is precious — minimise data movement.

GFS assumptions about its environment. Commodity hardware (cheap, failure-prone). Component failure is the norm — must self-heal. TBs of storage, multi-GB files. Workload: large streaming reads and large sequential appends are common; small random reads exist but rare. Many concurrent appends to the same file. High sustained bandwidth > low latency.

GFS architecture — three components. Single Master — stores all metadata, coordinates the system. Multiple Chunkservers — store actual file data; grouped into racks; connected through switches. Multiple Clients — applications using a GFS client library. Clients contact master for metadata, then talk directly to chunkservers for read/write. Data never flows through the master.

How files are stored. Each file split into fixed-size 64 MB chunks. Each chunk has a globally unique 64-bit chunk handle. Each chunk is replicated (default 3 times) across chunkservers on different racks for fault tolerance. Chunks stored on chunkservers as regular Linux files.

What does the master store? Three types of metadata, all kept in memory for speed: (1) File and chunk namespaces (logged for persistence). (2) Mapping from files to chunk handles (logged). (3) Locations of chunk replicas — NOT logged (master asks chunkservers via heartbeat on restart, because chunkservers' disks may fail). Also tracks: version number of each chunk (logged), current primary, lease expiry time.

Master responsibilities. Replica placement, new chunk + replica creation, load balancing, garbage collection / unused storage reclaim, lease management.

Why a single master? Drawbacks? Simplicity — single source of truth for metadata; easy to make globally optimal placement/replication decisions. Drawbacks: single point of failure (mitigated by replicating operation log and checkpointing master state). Potential bottleneck (mitigated because clients cache locations and bypass master for data).

Read operation — step by step. (1) Client computes chunk index from (filename, byte offset) using fixed 64 MB chunk size. (2) Client sends (filename, chunk_index) to master. (3) Master replies with (chunk handle, list of replica locations). (4) Client caches this info using (filename, chunk_index) as key. (5) Client contacts the nearest replica with (chunk handle, byte range) and gets the data.

Write operation — step by step. (1) Client asks master: who is the primary for this chunk? If no primary exists, master picks one up-to-date replica, grants it a lease, increments chunk version. (2) Master returns primary + secondary locations; client caches. (3) Client pushes data to all replicas in a pipelined chain along the closest path (each replica forwards to next nearest). Data sits in each replica's LRU cache. (4) Once all replicas ACK data receipt, client sends a write request to the primary. (5) Primary serialises all concurrent write requests into an order, applies them locally, then forwards the serial order to all secondaries. (6) Secondaries apply in that order and ACK back. (7) All ACK → primary returns SUCCESS. Any fail → primary returns failure; client retries (region inconsistent in the meantime).

Leases — purpose and mechanism. The master grants a lease on a chunk to one replica, designating it the primary for some duration (default 60 s). The primary serialises all mutations for that chunk during the lease, ensuring a single consistent mutation order across replicas WITHOUT the master being involved in every write. Leases renewed via heartbeat.

Four consistency states (after mutation). Defined — consistent AND clients see what the mutation wrote in its entirety. Consistent — all clients see the same data regardless of which replica they read. Undefined — consistent BUT may not reflect any single mutation (e.g., mingled fragments from concurrent writes). Inconsistent — clients see DIFFERENT data at different replicas.

Mutation outcomes table. Serial successful write → defined. Concurrent successful writes → consistent but undefined (interleaved fragments from different clients). Serial or concurrent successful record append → defined interspersed with inconsistent (retries leave padding/duplicates at failed replicas). Failed mutation → inconsistent.

Atomic record append. Client supplies only the data; GFS picks the offset and atomically appends the record at least once. Avoids the need for distributed locking when many clients append concurrently (e.g., logging, producer-consumer queues). The chunk is padded if the record would cross a 64 MB boundary, and the client retries on the next chunk. Outcome region: defined-interspersed-with-inconsistent (because of retry padding).

Data flow vs control flow — why pipelined? Decoupled: Control flow (write request) goes client → primary → secondaries (small message, star). Data flow (bytes) is pipelined linearly along the closest chain of replicas over TCP, NOT broadcast in star form. Fully utilises each machine's outbound bandwidth; avoids bottlenecks; minimises latency.

Garbage collection. Deleted files renamed to a hidden name, kept for 3 days for accidental-delete recovery. Regular scan removes hidden files older than 3 days. Orphaned chunks (no file points to them) and stale replicas (out-of-date version number) reclaimed via heartbeat: master tells the chunkserver which chunks it doesn't recognise, and the chunkserver deletes them.

Stale replica detection. Each chunk has a version number. When the master grants a new lease, it increments the version. Replicas that miss the update (e.g., were down) carry an old version → master detects this from heartbeat and treats them as stale → garbage-collected.

Fault tolerance mechanisms. Fast recovery — both master and chunkservers restart in seconds. Chunk replication — default 3 replicas on different racks. Master replication — operation log + periodic checkpoints replicated to multiple machines; a 'shadow' master serves read-only metadata if the master is down. Data integrity — every chunkserver maintains a checksum per 64 KB block; verified on every read. Bad blocks reported to master, which re-replicates from a good copy. Diagnostic logs — chunkservers log all RPC requests/replies for post-mortem.

GFS snapshots — copy-on-write. On a snapshot request: master duplicates the metadata for the source file/directory tree and increments reference counts on the chunks. Chunks are NOT copied immediately. On the first subsequent write to a chunk with refcount > 1, the chunkserver makes a local copy (no network transfer) and the master updates the metadata. Snapshots are effectively O(1) until something is written.

Strengths and trade-offs of GFS. Strengths: very high sustained bandwidth, scales to PBs, tolerates frequent failures, simple design. Trade-offs: relaxed consistency model (defined / undefined / inconsistent regions) — applications must handle duplicates, padding, and stale reads. Single master can be a metadata bottleneck for very small files. Designed for append-mostly, large-file workloads — bad for random writes or many tiny files.

Definitions

GFS architecture — Single master (metadata + log) + chunkservers (64 MB chunks, 3× replicas across racks) + clients (library). Data never flows through master.
Chunk — Fixed-size 64 MB unit of storage. Each has a 64-bit unique handle. Replicated 3× by default across racks.
Master metadata — File and chunk namespaces (logged); file → chunk handle mapping (logged); chunk replica locations (NOT logged, rebuilt from heartbeats); version number (logged), current primary, lease expiry.
Lease — Master grants to one replica = primary, default 60 s. Primary serialises mutations during lease. Decouples master from per-write activity. Renewable via heartbeat.
Atomic record append — GFS picks the offset; appends record at least once. Avoids distributed locking for many-writer logs. May produce duplicates + padding (defined interspersed with inconsistent).
Consistency state — defined — Consistent (same bytes everywhere) AND reflects the writer's mutation in its entirety (no mingling).
Consistency state — consistent but undefined — All replicas see the same bytes, but the bytes don't reflect any single writer's mutation in full — mingled fragments from concurrent writers.
Consistency state — inconsistent — Replicas see DIFFERENT bytes. Result of a failed mutation or a stale replica.
Pipelined data flow — Bytes flow linearly through replica chain along closest path; control flow separately in star (client → primary → secondaries). Maximises bandwidth utilisation.
Copy-on-write snapshot — Master duplicates metadata + increments chunk refcounts. Chunks copied locally only on first subsequent write — snapshot is O(1) until something changes.
Stale replica detection — Each chunk has a version number; master increments on lease grant; replicas with old version detected via heartbeat and garbage-collected.
Garbage collection (GFS) — Deleted files renamed hidden + kept 3 days; periodic scan removes them. Orphaned chunks + stale replicas reclaimed via heartbeat-driven master commands.

Formulas

$Chunk size: 64 MB$
$Default replication: 3 \times across racks$
$Lease duration: 60 s (renewable)$
$Hidden retention: 3 days$
$Checksum granularity: per 64 KB block$
$Consistency: defined \subseteq consistent \subseteq all states; undefined \neq \subseteq defined; inconsistent \neq \subseteq consistent$

Derivations

Why 64 MB chunks? Trade-off analysis. Pro: fewer metadata entries at master (smaller memory footprint, $O (files + chunks)$ — at 64 MB/chunk, a 1 PB file is 16 M chunks, not 256 B chunks); fewer client-master interactions; sustained large reads/writes; reduces TCP connection overhead. Con: small files become single chunks → hotspots if many clients hit the same small file (mitigated by higher replication factor for hot files).

Why does the master NOT persistently log chunk locations? Chunkservers are the AUTHORITATIVE source — they may add/remove/lose chunks independently of the master. Logging locations would be STALE on master restart. Instead, chunkservers report their chunks via heartbeat at startup and continuously; the master rebuilds the location map dynamically.

Atomic record append at-least-once. Client appends $X$ . Primary picks offset $O$ , applies $X$ , forwards to secondaries. If a secondary fails, primary returns 'partial success' to client; client retries. Retry appends at a NEW offset $O^{'} > O$ (since primary advances offset). So $X$ appears at both $O$ and $O^{'}$ — duplicate. Failed replica at $O$ has nothing (padding). Hence 'defined interspersed with inconsistent'.

Why pipelined data flow is faster than star. Star: client uploads $D$ bytes to each of 3 replicas — uses $3 D$ of client's upload. Pipelined: client uploads $D$ to nearest replica; that replica forwards $D$ to next while still receiving (TCP fan-out); third replica receives from second. Each machine uses both its inbound and outbound bandwidth fully. Time to fully replicate: $D / bandwidth + propagation$ — much faster than star.

Examples

Read flow trace. Client wants byte 1.5 GB of /foo. Chunk index = $⌊ 1.5 \cdot 1 0^{9} /64 \cdot 1 0^{6} ⌋ = 23$ . Sends (/foo, 23) to master. Master replies: chunk handle = 0xABC, replicas on CS-A, CS-B, CS-C. Client caches. Client contacts nearest (CS-A) with (0xABC, byte range). CS-A returns the requested bytes. Master not contacted again for this chunk.
Write flow trace. Client writes to /foo at chunk 5. Client → master: 'who's primary for /foo:5?' Master: 'no primary, granting lease to CS-A; primary = CS-A, secondaries = CS-B, CS-C; version = 17'. Client pushes data to CS-B (nearest) → CS-B forwards to CS-A → CS-A forwards to CS-C, pipelined. All cache the data. Client → CS-A (primary): 'do the write'. CS-A picks serial order (orders concurrent writes from other clients too), applies, forwards order to CS-B, CS-C with serial #. CS-B, CS-C apply in that order, ACK. CS-A → client: SUCCESS.
Concurrent write consistency outcome. Two clients write overlapping regions concurrently to /foo offset $X$ . Primary picks an order. Result: all replicas see the same bytes (consistent) but the bytes are interleaved fragments from both clients (undefined — no single write's full bytes preserved). Application must detect this with checksums and retry if needed.
Atomic record append outcome. Client A appends 100 B to /log. Primary writes at offset 1000; secondary B fails partway. Primary returns failure. Client A retries: primary writes at offset 1200 (new position); all succeed. /log now has client A's 100 B at offset 1200; offset 1000 has partial data on CS-A (inconsistent on that replica) and padding/nothing on CS-B. Application detects the bad record via embedded checksum.
Stale replica detected. CS-B was down during a lease cycle. Chunk version 17 on CS-B but master and CS-A, CS-C at version 18. CS-B reports its chunks on heartbeat; master sees version 17 ≠ 18 → marks stale → tells CS-B to delete that replica.

Diagrams

GFS three-component architecture: master (metadata, log) at top; chunkservers (64 MB chunks, 3× racks) in middle; clients (with cached metadata) at bottom. Arrows: client→master for metadata only; client↔chunkserver for data.
Write flow with data + control split: control flow client → primary → secondaries (star). Data flow pipelined linearly along closest chain.
Read flow: client → master (filename, chunk_idx) → reply (handle, replicas) → client caches → client → nearest replica → data.
Consistency state lattice: defined ⊂ consistent; undefined ⊂ consistent but not defined; inconsistent disjoint from consistent.
Lease lifecycle: master grants → primary serialises → primary forwards order → secondaries apply → renew via heartbeat or expire (60s).

Edge cases

Master crash — operation log + checkpoints replicated; shadow master serves read-only metadata until primary restored.
Concurrent writes produce consistent-but-undefined regions — applications must handle (e.g., checksums in records).
Append retry leaves inconsistent regions on failed replicas — application detects via embedded checksums.
Hot small file — single chunk, many clients hammer one chunkserver. Mitigation: raise replication factor for hot files.
Stale replica returning after long downtime — version mismatch detected via heartbeat → garbage-collected.
Snapshot during active write — copy-on-write means first write after snapshot triggers chunk copy.
Network partition isolating master — clients can use cached metadata for a while; lease expiry forces re-contact.

Common mistakes

Saying 'GFS chunkservers store metadata'. No — master stores all metadata. Chunkservers store data + their own chunk listings (reported via heartbeat).
Saying 'GFS uses 4 KB blocks like Unix'. No — 64 MB chunks. Trade-off favours sustained large reads/writes.
Saying 'GFS guarantees linearisable writes'. No — atomic record append is at-least-once; concurrent writes produce undefined regions; failed mutations leave inconsistent regions. Applications must cope.
Confusing 'defined' and 'consistent'. Consistent = same bytes everywhere. Defined = consistent + reflects mutation entirely. Concurrent writes are consistent but undefined.
Saying 'data flows through master'. No — only control flow. Data flows client ↔ chunkservers directly.
Saying 'snapshots immediately copy all chunks'. No — copy-on-write; chunks copied only on first subsequent write.

Shortcuts

64 MB chunks; 3× replicas across racks; 60 s lease; 3-day hidden retention.
Master metadata in MEM (logged) + chunk locations from heartbeat (not logged).
Read: client → master → cache → nearest replica.
Write: lease → push data pipelined → primary serialises → secondaries follow.
4 consistency states: defined / consistent / undefined / inconsistent.
Atomic record append at-least-once → 'defined interspersed with inconsistent'.
Snapshot: copy-on-write; chunk copy on first subsequent write.
Stale replica: version mismatch via heartbeat → GC.

Proofs / Algorithms

Pipelined data flow optimum bandwidth. Three replicas, client uploads $D$ bytes. Star: client's outbound = $3 D$ ; bottleneck. Pipelined: each link sends $D$ once; client's outbound = $D$ ; each replica's outbound = $D$ (forwarded to next); each replica's inbound = $D$ (from prev). Total time $\approx D / bandwidth$ if links are equal — vs $3 D / client-bandwidth$ for star. Pipelined wins by factor 3.

Consistency outcomes table (informal). Serial write at one primary, all replicas apply same write → defined. Concurrent writes at one primary, primary serialises but mingled bytes survive at offset boundaries → consistent but undefined. Record append at-least-once: success at all → defined at last offset; partial success → padding/duplicate at earlier offsets → inconsistent fragments. Failed mutation: replicas diverge → inconsistent.

Distributed Systems