Foundations of Computer Vision — Marr, Three Rs, Gestalt, Why CV is Hard
What Are We Even Doing Here?
It's the start of a Computer Vision course, and the question to settle before anything else is: what is computer vision actually trying to do?
David Marr gave the canonical one-line answer in 1982: *"to know what is where, by looking."* That's the whole field, distilled. "What" — what objects, what categories, what materials. "Where" — what location, 2D and 3D. "By looking" — from images alone, no cheating with other sensors.
Modern CV adds a few more questions: *when* something happened (action recognition), *why* it happened (intent inference), *how many* of something there are (counting). But the spine is Marr's two prepositions: *what* and *where*.
The Brain Hint
Here's something that makes the rest of the course make sense: more than 50% of the human brain is involved in vision. Half the cortical real estate, on the most precious computational device known. Evolution spent ~540 million years (since the Cambrian explosion) building this system.
Two reactions to this fact, only one of which is correct.
The *wrong* reaction: "Vision must be easy because humans do it without thinking." That's backwards. You do it without thinking because the hardware is dedicated and prebuilt. The CPU in your laptop is general-purpose; the visual cortex is a custom ASIC for one task, running 24/7 since you were a baby. The size of the budget is evidence the *computation* is enormous, not that the *problem* is simple.
The *right* reaction: "If half the brain is needed, software trying to do this on a CPU is going to need a serious amount of help — algorithms, data, structure, scale." That's why this course covers what it does.
The 1966 Memo
Want a clean illustration of "vision is harder than it looks"? In June 1966, MIT's Seymour Papert wrote a memo proposing a project for the summer. The goal: build "a significant part of a visual system." The plan: a few undergrad summer workers, working until August, would solve figure-ground separation and object recognition.
Sixty years later we are still working on those problems. The lesson the lecture wants you to draw is unambiguous: computer vision is not a summer project. It requires deep mathematical foundations and was vastly underestimated by the AI pioneers. Quote this in the exam — it's the kind of fact graders love.
Three Tasks That Aren't Quite The Same
Jitendra Malik's decomposition of vision into Three Rs is the field's most widely-cited organising frame. Memorise it.
Reorganisation — group proximate and similar pixels into meaningful regions. Semantic segmentation is the canonical example: every pixel labelled "person", "sky", or "road". Edge detection is reorganisation too. The Gestalt principles (Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground, Symmetry) are the perceptual heritage here.
Recognition — connect what we see to memory of categories. ImageNet classification, face identification, scene tagging. The neural-net workhorse of the last fifteen years lives here.
Reconstruction — measure or recreate quantitative aspects of the scene. 3D shape from images, depth maps, full 3D reconstruction. Structure-from-Motion (the Colosseum-from-internet-photos paper) is the iconic example; so is Hawk-Eye for cricket and tennis ball tracking.
The three are not disjoint — a real CV system uses them in combination. Autonomous driving recognises pedestrians AND segments the road AND reconstructs 3D distances to other cars. Hawk-Eye detects (recognition) the ball, localises (reorganisation) its 2D coordinates, and triangulates (reconstruction) the 3D trajectory.
When the exam asks "give one example of each R", give *different* applications; when it asks "what does autonomous driving need?", say "all three".
Why The Pixels Aren't Helping
Computer vision is hard. Memorise these seven reasons; they show up in essays and short answers.
Pixels are just numbers. A computer reads — there's no concept of "edge" or "cat" until something computes one. The semantic gap between raw intensity arrays and human-level concepts is enormous.
Intra-class variation. The "chair" class spans office chair, throne, beanbag, rocking chair, piano stool. They share *function* (you can sit on them) far more than appearance. There is no fixed visual template that catches all chairs — which is exactly the kind of class CV has to handle.
Viewpoint variation. The same object photographed from different angles produces wildly different 2D images.
Illumination. Pixel intensities depend on lighting, shadow, and reflectance — none of which are reliable identity cues.
Occlusion and clutter. Real objects are partially hidden; backgrounds are messy. The lecture flags this with the "Indian roads vs KITTI" comparison — a detector trained on the tidy Karlsruhe footage fails on the chaotic, mixed-traffic, occlusion-heavy Indian streets.
Scale variation. The same object might occupy 10 pixels (a distant pedestrian) or 10,000 pixels (a face filling the frame).
Ambiguity. Many different 3D scenes can produce the same 2D image. A flat painting of a cube and an actual cube look identical from a single viewpoint. Vision is fundamentally an inverse problem — one image, many possible scenes — and so it requires PRIORS (e.g., "lighting comes from above", "objects are convex") to pick a plausible interpretation. Without priors, no recovery is possible.
Should We Copy The Human?
A favourite essay question. The lecture wants both sides.
*Yes (inspiration):* Human vision is the most sophisticated visual system known. Many CV targets — recognition, segmentation, 3D understanding — are precisely what humans excel at. Concepts like receptive fields, hierarchical features, and attention came from neuroscience and underpin CNNs and Transformers.
*No (not a constraint):* Humans are not infallible. They have systematic failures — optical illusions (Adelson's checkerboard, Kitaoka's rotating snakes), inattentional blindness (Simons & Chabris's "invisible gorilla" experiment), the Ponzo illusion. Computers can use sensors humans don't have (LiDAR, infrared, hyperspectral) and don't need to be limited by biological constraints. Draw inspiration when useful; don't be limited by biology.
A line worth using verbatim in the exam: "We borrow from human vision as often as is convenient, but the goal is useful machine perception, not biological replication."
Gestalt — The Old Lessons
Before deep learning, Gestalt psychologists (1920s, Germany) catalogued how humans group elements into wholes. The principle: *"the whole is other than the sum of its parts"* — perception is constructive, not just pixel-matching.
Six (or seven) to know. Proximity — nearby elements group together. Similarity — like-coloured or like-shaped elements group. Closure — the mind completes incomplete shapes (the WWF panda logo is missing many contours; you see a panda). Continuation — the eye follows smooth paths (two crossing dotted lines are perceived as two continuous lines, not four segments meeting at a point). Common Fate — elements moving together group (motion segmentation). Figure-Ground — separation of focal object from background (Rubin's vase). Symmetry — symmetric arrangements are perceived as a unit.
These motivated classical segmentation algorithms (mean-shift, normalized cuts) and continue to inform attention mechanisms today.
Affordance — Why "Chair" Is Hard
J. J. Gibson introduced the term affordance: a possibility for action that an object offers an agent. A chair affords *sitting*; a door handle affords *grasping*; a cup affords *holding liquid*. This is a functional definition of class, not a visual one.
It explains why "chair" is hard. Beanbag, throne, office chair — visually disjoint, but functionally identical. An appearance-only classifier has a ceiling on this kind of class; a model that can reason about poses and actions can climb further. Modern CV (especially embodied AI and action recognition) increasingly addresses affordances directly.
The Invisible Gorilla
Simons & Chabris (1999) ran an experiment now famous in cognitive science: viewers were asked to count basketball passes between players. Halfway through the video, a person in a gorilla suit walks across the court. About half the viewers do not notice.
The lesson for CV: human vision is not a faithful camera. Attention determines what we see; everything else is suppressed. Models that mimic human attention (saliency, hard-attention transformers) inherit both the strengths *and* the failures.
The Bitter Lesson
Rich Sutton's 2019 essay is a meta-rule that has shaped the last decade of CV. Methods that leverage computation eventually beat methods that bake in human knowledge. SIFT and HOG (hand-engineered features) lost to learned CNN features. CNNs (architecturally specialised) are losing ground to Transformers (more general). The pattern repeats.
The Bitter Lesson tells you what to bet on: data + compute + general architecture. Not domain hacks. The course is in many ways an arc through that lesson — early units (DIP, classical features) show the hand-engineered approach; later units (ViT, SSL, VLMs) show its replacement.
What This Course Assumes You Know
Four prerequisites the course is built on:
1. Linear algebra — vectors, matrices, eigenvalues, singular value decomposition, 2D/3D geometry. You'll need this for transformations, PCA, attention.
2. Image / signal processing — filtering, edge detection, Fourier/DCT transforms. Foundation of convolutions, foundation of frequency-domain reasoning. Unit 2 is a fast recap.
3. Machine learning / pattern recognition — features, classifiers, train/val/test, loss functions. The backbone of every modern CV system. Unit 3 is a fast recap.
4. Programming — Python, NumPy, PyTorch. Implementation is non-optional; assignments and quizzes assume comfort with code.
Two of the four (DIP + ML) are recapped in the next two units. The other two — linear algebra and programming — you carry forward.
What You Should Walk Into The Exam Carrying
Marr's definition: "to know what is where, by looking." Three Rs (Reorganisation, Recognition, Reconstruction) with one example each. Seven reasons CV is hard (pixels-are-numbers, intra-class variation, viewpoint, illumination, occlusion, scale, ambiguity). Six Gestalt principles. Affordance (Gibson) — functional class definition. 1966 Summer Vision Project — why CV is *not* a summer task. >50% of brain → CV is hard, not easy. Both sides of "should we mimic humans". The Bitter Lesson.
That's the whole introduction.