Foundations of Computer Vision — Marr, Three Rs, Gestalt, Why CV is Hard
Intuition
Computer vision is the science of extracting meaning from images — answering the questions What? Where? Who? When? Why? How? and How many? from a 2D array of pixel intensities. The dirty secret of the field is that pixels are just numbers; everything else (objects, edges, intentions) must be COMPUTED. More than 50% of the human brain is devoted to vision because the problem is computationally vast, not because eyes are complicated.
Explanation
Marr's definition (1982). David Marr's one-line goal for vision: 'to know what is where, by looking.' That's the entire mission statement. 'What' is recognition (class, identity, material). 'Where' is localisation (2D/3D position). 'By looking' is the constraint: we get only images, no other sensors. Modern CV adds When (action recognition), Why (intent, scene understanding), and How many (counting).
Why >50% of the brain is devoted to vision. A famous comparative-neuroanatomy fact and a strong hint that vision is harder than it feels. Evolution spent ~540 million years (since the Cambrian explosion) building this system; the human visual cortex is denser than the entire computational budget of any other sense. The 'subjective ease' of seeing fools us into underestimating the problem — which is exactly what happened in 1966.
The 1966 MIT Summer Vision Project. Seymour Papert wrote a memo proposing a small team of summer students would build 'a significant part of a visual system' between June and August. Sixty years later we are still working on these problems. The lesson the lecture wants you to draw: computer vision is not a summer project. It requires deep mathematical foundations — linear algebra, signal processing, optimisation, geometry, statistics — and was vastly underestimated by the AI pioneers.
The Three Rs (Jitendra Malik). Malik decomposes vision into three coupled sub-problems. Reorganisation — grouping pixels into meaningful regions (segmentation, edge detection, perceptual grouping; the Gestalt heritage). Recognition — connecting visual input to memory; assigning categories or identities (ImageNet classification, face recognition, scene classification). Reconstruction — measuring or recreating quantitative aspects of the scene (3D shape, depth, geometry; Structure-from-Motion, Hawk-Eye trajectory, stereo). The three are not disjoint — autonomous driving uses all three together (recognise pedestrians + segment road + reconstruct 3D distances).
Why CV is HARD — seven concrete reasons. (1) Pixels are just numbers. A computer sees — no notion of 'edge' or 'object' until we compute one. The semantic gap between raw intensity arrays and human concepts is enormous. (2) Intra-class variation. The 'chair' class spans hundreds of shapes (office chair, bean bag, throne, rocking chair). They share *function* (sittability — Gibson's term: affordance) far more than appearance. No fixed visual template captures all chairs. (3) Viewpoint variation. The same 3D object projects very different 2D images at different angles. (4) Illumination. Pixel intensities are unreliable as identity cues — colour, shadow, and contrast vary drastically. (5) Occlusion and clutter. Real objects are partially hidden; backgrounds are messy (Indian roads vs KITTI is a real distribution-shift problem). (6) Scale variation. The same object may occupy 10 pixels or 10,000 pixels. (7) Ambiguity / inverse problem. Many 3D scenes can produce the same 2D image; vision is fundamentally underdetermined and requires priors.
Should computer vision mimic human vision? Argue both sides — a common essay question. *YES (inspiration):* Human vision is the most sophisticated system known, evolved over ~540 Myr. Many CV targets — recognition, segmentation, 3D understanding — are exactly the things humans do well. Concepts like receptive fields, hierarchical features, and attention came from neuroscience. *NO (not a constraint):* Humans have systematic failures (Adelson's checkerboard illusion, Kitaoka's rotating snakes, the 'invisible gorilla' inattentional-blindness experiment — Simons & Chabris). Machines can use sensors humans don't have (LiDAR, infrared, hyperspectral) and aren't bound by biological constraints. Draw inspiration when useful; don't be limited.
Gestalt principles — perceptual grouping. Pre-deep-learning, Gestalt psychologists (1920s) catalogued how humans group elements into wholes ("the whole is other than the sum of its parts"). Memorise these for any 'how do humans see' question — they motivate classical segmentation algorithms (mean-shift, normalized cuts) and inform attention mechanisms in modern networks. Proximity — nearby elements group (a grid of dots clusters into rows). Similarity — like-coloured/shaped elements group (red dots among gray form a perceived line). Closure — the mind completes incomplete shapes (WWF panda logo). Continuation — the eye follows smooth paths (two crossing dotted lines are seen as two continuous lines, not four segments). Common Fate — elements moving together group (motion segmentation). Figure-Ground — separation of focal object from background (Rubin's vase). Symmetry — symmetric arrangements are perceived as a unit.
Affordance over appearance. The 'chair' example again: a beanbag and a throne look almost nothing alike but share the property of *being sittable*. J. J. Gibson called this an affordance — a possibility for action the object offers an agent. Modern CV has begun to address affordances directly (action-recognition, embodied AI). The exam point: a class can be defined functionally, not visually — which is why classification by pure appearance has a ceiling.
Selective attention — the invisible gorilla. Simons & Chabris (1999): viewers asked to count basketball passes routinely miss a person in a gorilla suit walking through the scene. Demonstrates that human vision is *not* a faithful camera — attention determines what we see. Relevant for CV because models that try to mimic human attention (saliency, hard-attention transformers) inherit some of the strengths *and* the failures.
The Bitter Lesson (Sutton). A meta-rule that has shaped the last decade of CV. 'Methods that leverage computation are ultimately the most effective; hand-engineered priors are eventually replaced by scaled-up general methods.' Concretely: hand-crafted SIFT/HOG features lost to learned CNN features; learned CNNs are losing to even-more-general transformers. The lesson: bet on data + compute + general architectures, not on baking in domain knowledge.
Course prerequisites. The course assumes four pillars. (1) Linear algebra — vectors, matrices, eigenvalues, SVD, 2D/3D geometry; needed for transformations, PCA, attention. (2) Image / signal processing — filtering, edge detection, Fourier/DCT transforms; the foundation of convolutions and frequency-domain reasoning. (3) Machine learning / pattern recognition — features, classifiers, train/val/test, loss functions; backbone of all modern CV. (4) Programming — Python, NumPy, PyTorch; implementation is non-optional. Two of these (DIP + ML) are explicitly recapped in this course (Units 2 and 3); the other two you carry forward.
Applications to name-drop. Self-driving (recognise + segment + reconstruct), Hawk-Eye (ball tracking + 3D reconstruction), medical imaging (tumour segmentation, diabetic retinopathy screening), satellite imagery (deforestation monitoring, crop yield), AR/VR (SLAM + pose), Face ID (recognition + anti-spoofing), document analysis (OCR + layout). Each is a concrete instance of the Three Rs in combination.
Definitions
- Computer vision — The science of extracting all possible information about a visual scene from images — answering What/Where/Who/When/Why/How/How-many.
- Marr's definition of vision — (1982) 'To know what is where, by looking.' The two-word mission statement of the field.
- Three Rs (Malik) — Reorganisation (group pixels), Recognition (label them), Reconstruction (measure geometry). Modern CV systems use all three.
- Semantic gap — The conceptual distance between low-level pixel intensities and high-level semantic concepts (objects, intent, action). Bridging it is the central technical challenge.
- Intra-class variation — The amount of visual variability within a single semantic class (e.g., 'chair' includes throne, beanbag, office chair). Often larger than inter-class variation — which is why pure appearance-matching fails.
- Affordance (Gibson) — A possibility for action that an object offers an agent (a chair affords sitting, a handle affords grasping). Provides a functional definition of object category that is more robust than appearance.
- Gestalt principles — Pre-attentive grouping rules formulated by 1920s German psychologists: Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground, Symmetry. Motivate classical segmentation and modern attention.
- Inverse problem of vision — Image formation is many-to-one (3D scene → 2D image), so inversion (2D → 3D + identity) is one-to-many and underdetermined. Vision must use priors to choose a plausible interpretation.
- The Bitter Lesson (Sutton) — Empirical observation that, across decades of AI research, methods that leverage raw computation eventually beat methods that bake in domain knowledge. Drives the 'just scale it up' approach in modern CV.
- Cambrian explosion connection — Vision drove evolutionary divergence ~540 Myr ago. Hard evidence that visual perception is computationally expensive — evolution would not have selected for it if it were cheap.
- Summer Vision Project (1966) — MIT memo by Seymour Papert proposing CV could be largely solved in one summer. The iconic underestimation of the field.
- Inattentional blindness — Failure to notice obvious stimuli when attention is occupied elsewhere (Simons & Chabris's 'invisible gorilla'). A reminder that biological vision is selective, not a faithful camera.
Formulas
Derivations
Why >50% of the brain ⇒ CV is hard, not easy. Counter-intuitive argument worth memorising. We *feel* vision is effortless because it is fully automated by dedicated hardware (~30% of neocortex by some counts, with ~50% involved in any vision-mediated task). The size of the brain budget is evidence the *computation* is enormous, not that the *problem* is trivial. A computer with a 1000-CPU cluster doing facial recognition feels slow because the CPU is general-purpose; the brain dedicates an entire pipeline to vision, which is why it feels instant.
Inverse problem of vision. Image formation is many-to-one: a 3D scene projects to a 2D image by losing one dimension of geometry plus mixing illumination, material, and shape into a single intensity per pixel. Inverting this is one-to-many (many 3D scenes explain the same image), so vision is necessarily ambiguous — every vision system must use PRIORS (e.g., 'objects are usually convex', 'lighting comes from above') to pick a plausible interpretation.
Examples
- Hawk-Eye uses 4–10 high-speed cameras around the court; per frame, detect the ball (recognition), localise its 2D coordinates in each view (segmentation/detection), then triangulate the 3D trajectory (reconstruction). The output 'just inside the line' announcement is a Three-R pipeline.
- Indian roads vs KITTI. KITTI dataset (Karlsruhe) → sparse cars, well-marked lanes. Indian roads → dense mixed traffic (cows, autos, pedestrians, motorbikes), no markings, frequent occlusion. A detector trained on KITTI generalises poorly to Indian footage — classic distribution-shift problem and a reminder that benchmarks are not the real world.
- Chair affordance. Drawing of an office chair, a beanbag, a tree stump, and a piano stool: visually disjoint, but all share the property 'human can sit'. Recognising 'chair' from appearance alone has a ceiling; defining the class by sittability is more robust but requires reasoning about human poses.
- Adelson's checkerboard. Two squares with identical pixel intensity look DIFFERENT because of perceived shadow. Human vision applies a colour-constancy prior; a CNN that 'sees the pixel value' would say they're identical and the human would say they're not. Both are correct — the CNN sees the input, the human sees the inferred scene.
Diagrams
- Three Rs Venn / triangle. Three overlapping circles: Reorganisation, Recognition, Reconstruction. Autonomous driving sits in the centre (uses all three); pure ImageNet classification is in Recognition only; SfM/3DGS is in Reconstruction only; mean-shift segmentation is in Reorganisation only.
- Semantic-gap stack. Bottom layer: raw pixels (intensity array). Middle: edges → texture → parts. Top: 'cat'. Annotate each layer with the kind of operation that produces it (filter, descriptor, classifier, attention).
- Gestalt cheat-card. Six small diagrams: dots clustered (proximity), red/grey mix (similarity), broken circle perceived whole (closure), two crossing dotted lines (continuation), motion arrows (common fate), Rubin's vase (figure-ground).
- Inverse problem. Single 2D image with three different 3D reconstructions producing it (a flat painted shape, a 3D object, a tilted plane) — visualises why priors are mandatory.
Edge cases
- Optical illusions (Adelson, Kitaoka, Ponzo) are NOT bugs — they're consequences of strong perceptual priors that are usually correct. A system that 'fixes' the illusion may break on real-world inputs.
- Selective attention failures. 'Did you see the gorilla?' — people miss the obvious when attention is consumed elsewhere. Models that mimic foveal attention can inherit this trade-off.
- Affordance ≠ appearance. A class like 'sittable' has zero pure-appearance signal; an appearance-only classifier will fail. CV models need shape-pose-context cues.
- Photo vs painting. Detectors trained on natural photos can fail badly on paintings, sketches, or screenshots. YOLO famously generalised better than Faster R-CNN here because its features were less specialised.
- 'Computer sees' the wrong thing. Adversarial examples (changing a few pixels imperceptibly) flip a CNN's prediction — proves the model is NOT seeing what humans see, even when test accuracy is high.
Common mistakes
- 'Vision is easy because we do it effortlessly.' Wrong — the brain spends >50% of its budget on it; you don't notice because the hardware is dedicated and pre-built.
- Confusing the Three Rs. Reorganisation = grouping pixels. Recognition = labelling. Reconstruction = geometry. Semantic segmentation is Reorganisation + Recognition. Detection is Recognition + (sparse) Reorganisation. Don't lump them.
- Naming only 1–2 Gestalt principles when 4+ are asked. Memorise six (Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground; Symmetry as a 7th).
- Saying 'CV mimics humans, period'. Argue both sides. Inspiration ≠ constraint. CV uses sensors and tricks humans cannot.
- Forgetting affordance. When asked 'why is chair hard?' don't only say 'many shapes' — the deep answer is functional class definition (affordance) trumps visual class definition.
Shortcuts
- Marr in one line: 'to know what is where, by looking'.
- Three Rs: Reorganisation, Recognition, Reconstruction (Jitendra Malik). One example each: segmentation, ImageNet, SfM.
- Why CV hard ⇒ seven reasons: pixels-are-numbers, intra-class variation, viewpoint, illumination, occlusion/clutter, scale, ambiguity.
- Gestalt six: Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground (+ Symmetry).
- Bitter Lesson (Sutton): scale + general beats hand-engineering.
- 1966 Summer Vision Project (Papert) — vision is *not* a summer task; mention this for any 'history of CV' question.
- Affordance (Gibson): a class can be defined by what it lets you DO (sit, grip, throw), not by what it looks like.
Proofs / Algorithms
Vision is necessarily ambiguous (informal proof). Image formation collapses many degrees of freedom into a 2D intensity grid. The inverse has infinitely many solutions for the same (a flat painting of a 3D scene produces the same image as the actual 3D scene). Therefore vision requires PRIORS to resolve ambiguity; without them no recovery is possible.
Affordance vs appearance dimensionality. Define the appearance space of a class (all visual instances) and the affordance space (the actions enabled). Intra-class variation says for 'chair' is much larger than (one action: sit). A classifier in -space therefore generalises better than one in -space — at the cost of needing to *infer* affordances from appearance.