Saral Shiksha Yojna
Courses/Distributed Systems

Distributed Systems

CS3.401
Prof. Kishore KothapalliMonsoon 2025-264 credits

GFS — Architecture, Reads, Writes, Consistency, Recovery

NotesStory
Unit 11 — Google File System (GFS)

A File System For A Datacenter

In 2003, Google needed to store everything: the web crawl (multi-petabyte), search indices, Gmail messages. Traditional file systems were built around the assumption that disks rarely fail and that file accesses are mostly small random IO. Google's workload was the opposite: thousands of commodity machines that fail constantly, multi-GB files, mostly streaming reads + large sequential appends.

GFS (Ghemawat, Gobioff, Leung 2003) is the file system they built. The design choices look weird at first glance but are exactly right for the workload.

The Workload Assumptions

  • Commodity hardware → failure is the norm. 1M servers × 1 failure / 1000 days = 1000 failures/day.
  • Files are multi-GB; total data multi-PB.
  • Workload: large streaming reads + large sequential appends. Small random IO is rare.
  • Many concurrent appends to the same file (logs, producer-consumer queues).
  • High sustained bandwidth matters more than low latency.

These assumptions justify everything that follows.

Three Components

Single Master: stores all metadata (file namespaces, file→chunk mapping, chunk locations, version numbers). Coordinates the system.

Multiple Chunkservers: store actual file data, grouped into racks. Default 3 replicas per chunk on different racks.

Multiple Clients: applications using a GFS client library. Contact master for metadata, then talk DIRECTLY to chunkservers for data. Data never flows through the master.

64 MB Chunks — Why So Big?

Most file systems use 4 KB blocks. GFS uses 64 MB chunks. The math:

  • Pro: a 1 PB file is 16 M chunks at 64 MB — manageable for the master. At 4 KB blocks it would be 256 B chunks (256 billion!) — master would drown in metadata.
  • Pro: fewer client-master interactions; large streaming reads/writes more efficient; reduces TCP overhead.
  • Con: small files become single chunks → hotspots if many clients hammer one. Mitigation: raise replication factor for hot small files.

What The Master Stores

Three things, all in MEMORY for speed:

1. File and chunk namespaces. Logged for persistence. 2. File → chunk handle mapping. Logged. 3. Chunk replica locations. NOT logged — master rebuilds from heartbeats on restart.

Why not log locations? Chunkservers are the *authoritative* source — they may add/remove/lose chunks independently. A logged location would be stale on restart. Heartbeat-rebuilt locations are always current.

Also tracked: chunk version numbers (logged), current primary per chunk, lease expiry times.

Read Flow

1. Client computes chunk index = . 2. Client → master: (filename, chunk_index). 3. Master → client: (chunk_handle, list of replica locations). 4. Client caches the location info. 5. Client → nearest replica with (chunk_handle, byte range). Replica returns data.

Master not contacted again for the same chunk until cache invalidated.

Write Flow — Leases And Pipelining

1. Client → master: who's the primary for this chunk? If no primary, master picks one up-to-date replica, grants a lease (default 60s), increments chunk version. 2. Master → client: primary + secondary locations. 3. Client pushes data along a pipelined chain of closest replicas — each forwards to the next-nearest. All cache the data. 4. Once all replicas have the data, client sends a write request to the primary. 5. Primary serialises concurrent writes into an order, applies locally, forwards order to secondaries with serial numbers. 6. Secondaries apply in that order and ACK. 7. All ACK → primary returns SUCCESS to client.

Why this split? Data flow (bytes) is heavy and benefits from pipelining (each replica forwards while still receiving — full bandwidth utilisation). Control flow (the actual write command) is small and star-shaped (client → primary → secondaries). The separation is GFS's key bandwidth optimisation.

Leases — A Lightweight Coordination Trick

The master grants a lease to one replica = the primary for that chunk. During the lease (60s by default), the primary serialises all mutations for that chunk — without the master being involved in every write. Leases renew via heartbeat.

This is how GFS achieves high write throughput without a master bottleneck.

Atomic Record Append

The killer GFS primitive. Client supplies only the *data*; GFS picks the offset and atomically appends the record at least once.

Why this matters: many clients can concurrently append to the same file (a log, a queue) without distributed locking. GFS handles the offset arbitration internally via the primary.

Retry semantics: if append fails at any replica, primary returns failure; client retries. Retry appends at a NEW offset (since primary has advanced). Result: the record may appear at two offsets (the failed one with partial data + the successful one). This is the defined interspersed with inconsistent region.

Applications cope by embedding record-level checksums and unique IDs.

Four Consistency States

After a mutation, a file region is in one of four states:

| State | Meaning | |---|---| | Defined | Consistent + reflects mutation entirely (single writer's data preserved) | | Consistent | All replicas see the same bytes (but may not be a clean write) | | Undefined | Consistent BUT bytes are mingled fragments from concurrent writes | | Inconsistent | Replicas see DIFFERENT data (failed mutation or stale replica) |

Outcomes per scenario:

  • Serial successful write → defined.
  • Concurrent successful writes → consistent but undefined (mingled).
  • Atomic record append (success) → defined.
  • Atomic record append with retry → defined interspersed with inconsistent (duplicates + padding).
  • Failed mutation → inconsistent.

This relaxed model is the trade-off for GFS's performance. Applications are expected to handle it.

Garbage Collection

Deleted files: renamed to a hidden name + kept for 3 days (for accidental-delete recovery). Periodic scan removes after 3 days.

Orphaned chunks: master tells the chunkserver via heartbeat which chunks it doesn't recognise; chunkserver deletes them.

Stale replicas: detected via chunk version number. Each chunk has a version; master increments on every new lease. A replica that missed an update carries an old version → master detects from heartbeat → garbage-collected.

Snapshots — Copy-On-Write

GFS snapshots are cheap. On snapshot request:

1. Master duplicates the metadata for the source file/directory tree. 2. Increments reference counts on the chunks. 3. Chunks are NOT copied immediately.

On the first subsequent write to a chunk with refcount > 1: chunkserver makes a LOCAL copy (no network transfer); master updates metadata to point the snapshot at the new copy.

Snapshot operation is O(1) until something changes.

Fault Tolerance Mechanisms

  • Fast recovery: master and chunkservers restart in seconds.
  • Chunk replication: 3× by default, across racks.
  • Master replication: operation log + periodic checkpoints replicated to multiple machines; a shadow master serves read-only metadata if the master is down.
  • Data integrity: every chunkserver maintains a checksum per 64 KB block; verified on every read. Bad blocks reported to master → re-replicated from a good copy.
  • Diagnostic logs: chunkservers log all RPC requests/replies for post-mortem analysis.

Where GFS Shows Up Today

  • HDFS (Hadoop Distributed File System) is essentially GFS reimplemented. 128 MB chunks instead of 64 MB. NameNode = master; DataNode = chunkserver.
  • Colossus is Google's successor to GFS — addresses GFS's single-master bottleneck with a sharded metadata layer.

What You Walk In Carrying

GFS's workload assumptions (large files, mostly append, failure-norm). Three-component architecture (single master + chunkservers + clients; data never through master). 64 MB chunks + 3× replication across racks + the trade-off. What the master stores (in-memory, locations NOT logged, rebuilt from heartbeats). Read flow + write flow with the lease + pipelined data + star control. Atomic record append semantics + at-least-once + outcome. Four consistency states + the table of outcomes. Pipelined data flow vs star control flow + why. Garbage collection (3-day hidden retention + stale replica detection via version number). Copy-on-write snapshots. Fault tolerance mechanisms.