Foundations — Definition, Motivation, Challenges, CAP
What's A Distributed System Really?
A distributed system, in Tanenbaum's words, is a group of computers working together to appear as a single computer to the end user. The whole field flows from the gap between "appear as a single computer" (what users want) and "a group of computers" (what actually exists). Closing that gap — making many fail-prone, clock-skewed, network-separated machines behave like one — is the entire job.
Five features every distributed system has whether it wants them or not: many computers, running concurrently, that fail independently, with no global clock, and no shared memory. Each of these is a problem before it's a feature. No shared memory means no semaphores → we need distributed mutual exclusion. No global clock means no notion of "X happened before Y" → we invent logical clocks. Independent failure means a single message might never arrive → we need consensus protocols that survive that.
Why Bother?
Horizontal scaling is the headline reason. You can buy bigger CPUs and bigger memory for a while; eventually physics gets in the way. Adding more cheap machines doesn't have that ceiling.
Some applications are inherently distributed — banking across cities, video conferencing, satellite imagery — and can't be made centralised. Some need fault tolerance: the system must survive any single node going down. Some want low latency by placing data near users. Some want to share data across organisational boundaries.
What Makes It Hard
Six concrete challenges. Unreliable communication — messages can drop, get delayed, arrive out of order, even be intercepted. Lack of global knowledge — each node sees only its local memory. Lack of synchronisation — local clocks drift; "when did X happen?" is ambiguous. Concurrency control — no shared memory means no semaphores; you build mutual exclusion from message passing alone. Failure and recovery — nodes and channels fail; bringing a recovered node back in sync is its own problem. Deadlocks, termination detection, file systems — every OS classic gets harder distributed.
These six are what every unit of this course is about. Logical clocks fix challenge 3. Snapshots fix challenge 2. Distributed mutex fixes 4. Consensus fixes 5. And so on.
Three Architecture Tiers
Centralised — single CPU, single memory, single clock. Tightly bound; one point of failure; doesn't scale; simple model.
Parallel — multiple CPUs sharing memory and a clock. Multicore desktops, GPU clusters. Tightly coupled; hardware coherence guarantees you can reason about memory like a single machine.
Distributed — multiple CPUs, no shared memory, no global clock, communicate by message passing. Loosely coupled, autonomous nodes. This is the model the rest of the course assumes.
CAP — The Headline Trade-off
Brewer's CAP theorem says a networked shared-data system can guarantee only TWO of three properties: Consistency (every node sees the same most-recent write), Availability (every non-failing node responds in reasonable time), Partition tolerance (operates despite arbitrary network partitions).
The exam trap: students say "pick any two of three." That's technically the statement, but it hides the real-world punchline: partitions are unavoidable in real networks, so P is non-negotiable. Every production distributed system must choose between C and A *during a partition*. The textbook 'CA' point is only achievable in a non-partitioned (effectively single-node or fully-LAN) system.
Real-world placements: CP systems — Google Spanner, HBase, MongoDB with majority writes, traditional RDBMS clusters. They block during partitions to preserve consistency. AP systems — DNS, Amazon Dynamo, Cassandra, CouchDB, web caches. They keep serving (possibly stale) data and reconcile later. CA — single-site systems; can't survive partitions at all.
When picking sides, think about WHICH property your application can't live without. Selling airline seat 6B: must be consistent (no double-booking). Showing a Facebook profile: must be available (slightly stale is fine). The choice is product, not technology.
The Two-Generals Problem
Two armies on opposite sides of a valley plan to attack a city in between. They can only communicate through couriers who must cross the valley — and might be captured. They need to agree on an attack time.
General A sends "attack at dawn." General B receives it but doesn't know if A knows it arrived, so B sends an ack. A receives the ack but doesn't know if B knows the ack arrived, so A sends an ack-of-ack. And so on, forever. There is no finite protocol that guarantees both generals know with certainty.
Take-away: deterministic consensus over an unreliable channel is impossible. Distributed systems work with bounded uncertainty — timeouts, retries, probabilistic guarantees, eventual consistency. Even TCP's three-way handshake doesn't guarantee perfect mutual agreement; the final ACK might be lost. We accept 'good enough' because perfect is provably impossible.
The Two-Generals result motivates the FLP impossibility theorem (no deterministic consensus in asynchronous systems with even one crash failure) and explains every "blocking" failure mode in 2PC, Raft, and friends.
What You Walk In Carrying
Tanenbaum's definition + five features. Six challenges (unreliable comm, no global knowledge, no sync, concurrency, failure-recovery, OS-classics-but-harder). Centralised/Parallel/Distributed distinction with the three-property comparison. CAP: three letters, two-at-a-time, P-unavoidable-in-practice. CP vs AP examples. Two-Generals impossibility and what it means for real protocols.