Object Detection — R-CNN family, YOLO, NMS, mAP
The Hunt
It's 2013, and computer vision is in a strange place. Networks like AlexNet have just learned to classify images — show them a photo, they tell you "cat". Beautiful. But useless if your photo has a cat and a dog and a duck and you need to circle each one. Classification answers "what is in this image?". The new question — the harder one — is "what is in this image, and where exactly is each thing?"
That's object detection. And the next decade of computer vision is essentially the story of researchers chasing this question and getting faster, smarter, and more end-to-end with every paper.
Three Tasks, One Staircase
Before we hunt, get the vocabulary clean. The lecture explicitly walks you up this staircase and exam questions love to test the boundaries.
Classification takes an image and emits one label ("Cat"). Classification + Localisation is one step harder — same input but the image is known to contain ONE object, and the network outputs a label plus a bounding box . Object Detection drops the "one object" assumption — variable-sized output, a set of (label, box) pairs. Instance Segmentation asks for per-instance pixel masks.
Detection is harder than localisation for three concrete reasons the lecture lists: variable-length output (one image has 2 objects, another has 47; networks like to produce fixed-size outputs); mixed output types (you predict what — discrete class — AND where — continuous coordinates — simultaneously); and big images (classification works at , detection needs ~ because tiny objects vanish at low resolution).
Bounding Boxes — The Unit Of The Hunt
A bounding box is four numbers — , usually top-left plus size, sometimes centre plus size. Two flavours: axis-aligned with sides parallel to image axes (the default), and oriented rotated boxes (rare in standard datasets).
One subtle distinction worth memorising — the lecture flags it specifically: a modal box covers only the visible portion of the object (the dog's box stops at the chair occluding it) — this is the standard. An amodal box covers the full extent of the object, even occluded parts (the dog's box extends behind the chair). If a question asks "what kind of bounding box does PASCAL VOC use?" — modal.
IoU — Measuring Agreement Between Boxes
Suppose your network predicts a box and the ground truth is another box. How "good" is the prediction? You can't just compare corners — you need a single number. That number is Intersection over Union: . Also called the Jaccard index — synonyms, and exam questions occasionally use the second name to see if you flinch.
The intuition: 0 means boxes don't overlap at all, 1 means perfect coincidence. Your professor's calibration table: IoU > 0.5 is "decent", > 0.7 is "pretty good", > 0.9 is "almost perfect". Two warnings the lecture explicitly raises. IoU = 0.5 is NOT "boxes overlap by 50%". Two squares of equal size sharing only ⅓ of each other can have IoU = 0.5. Don't think of it as overlap fraction. And IoU is symmetric: .
NMS — When Your Detector Gets Enthusiastic
Run a detector on a single dog. You don't get one box. You get five — all overlapping, all saying "dog", with confidence scores 0.9, 0.8, 0.75, 0.7, 0.7. The detector is doing its job (every plausible window fires), but the output is ugly.
The fix is Non-Maximum Suppression — a simple greedy algorithm. Pick the highest-scoring box. Delete every other box whose IoU with it exceeds a threshold (e.g., 0.5). Repeat. Survivors are your final detections. The threshold is a hyperparameter — too high and you keep duplicates, too low and you delete real distinct objects standing close together.
That last failure case is worth knowing: NMS struggles with dense crowds. If two pedestrians are walking shoulder-to-shoulder, their boxes overlap heavily, and NMS will delete one of them as a "duplicate". The lecture says it bluntly: "no good solution =(". Soft-NMS exists — instead of zeroing the score, multiply by a decaying function of IoU (linear , or Gaussian ). Nearby legitimate objects keep some confidence rather than vanishing entirely.
NMS is also per class. Applying it globally across classes wrongly suppresses, e.g., a dog and a person standing in the same region.
Scoring A Whole Detector — Precision, Recall, AP, mAP
This is the bit students always lose marks on. Walk through it slowly.
For one fixed class (say "dog") at one fixed IoU threshold (say 0.5), each detection is either TP (matches a ground-truth dog box with IoU ≥ 0.5), FP (doesn't match any GT, or matches one already claimed), or contributes to FN (a GT dog that no detection matched). From these: — fraction of your detections that are correct. — fraction of true objects you found.
A detector with high precision is cautious (rarely wrong, but misses things). High recall is eager (catches everything, but with garbage). You want both. The trade-off is controlled by your confidence threshold — sweep it and you get a curve.
Average Precision is the area under the PR curve. The lecture walks through a worked example: five dog detections sorted by score [0.99, 0.95, 0.90, 0.5, 0.10] and three ground-truth dogs. Trace it: each TP consumes one unmatched GT. Following the lecture's example, the (Recall, Precision) points are — and AP = area under this curve . Mean Average Precision is the average of AP across classes.
COCO mAP is stricter. Pascal VOC's mAP@0.5 was felt to be too lenient: a sloppy box with IoU 0.55 still counted as success. COCO says: be honest about it. Compute mAP at every IoU threshold from 0.5 to 0.95 in steps of 0.05 (so ten thresholds), then average. That's why COCO numbers look much smaller than VOC numbers — same detector, harsher metric.
Datasets You Must Name-Drop
Pascal VOC 2010: 20 classes, ~20k images, 2.4 mean objects/image. ImageNet Detection (ILSVRC 2014): 200 classes, ~470k images, 1.1 mean objects/image — almost every image has just one centred object, so the lecture explicitly says: ImageNet is great for classification, bad for detection benchmarking. MS-COCO 2014: 80 classes, ~120k images, 7.2 obj/img — the modern crowded-scene standard.
Now We Hunt — The Journey From Overfeat To Faster R-CNN
OK. Now we have the vocabulary. The detector itself is the interesting question. The lecture walks through history because every method exists to fix the previous one's failure.
Idea #1 — Localisation as regression. The simplest case: image has exactly one object. Slap a regression head onto a CNN that outputs four numbers , train with L2 loss against the ground-truth box. You now have two heads on the same network: a classification head (softmax over classes) and a regression head (4 numbers). Two flavours of regression head — your slides distinguish them and exam questions test it: class-agnostic outputs 4 numbers total (one box, no matter the class); class-specific outputs numbers (one box per possible class). Generalises to K objects only when K is fixed — not real detection.
Idea #2 — Sliding window (Overfeat). If detection is "classification + regression at every location", do exactly that: slide the network across the image at many positions and scales, classify each window, regress a box, then merge. Overfeat (Sermanet et al., ICLR 2014, ILSVRC 2013 winner) made this practical with one trick: convert fully-connected layers into convolutional layers so you don't actually re-run the network at every position — the convolution structure handles all positions in one forward pass naturally. Then merge boxes greedily. Cost: still expensive across multiple scales; can't gracefully handle variable numbers of objects.
Idea #3 — Detection is classification of candidate windows. Treating "detect every object" as regression fails because the output size varies. Instead, think of detection as classification applied to many candidate windows. Problem: if you classify every window at every scale, the cost is astronomical. Solution: only classify a few promising windows. Don't look everywhere — look only at proposals.
You need a fast, class-agnostic algorithm that says "these ~2000 boxes are probably worth checking." Enter Selective Search (Uijlings et al., IJCV 2013): a bottom-up segmentation algorithm that starts from oversegmentation and greedily merges similar regions at multiple scales using colour, texture, size and fill similarity measures, converting each merged region into a bounding box. It's not learned — it's classical computer vision, fast, and gives ~2000 proposals per image. Good enough. This sets up the entire R-CNN family.
R-CNN — The Original
Girshick et al., CVPR 2014. The recipe, in five training steps drawn straight from the lecture: pretrain a classification CNN on ImageNet (AlexNet); fine-tune for detection by replacing the final 1000-way FC layer with a 21-way one (20 PASCAL classes + 1 background) and training on positive/negative regions from detection data; extract features — for every image, run Selective Search for ~2000 proposals, crop and warp each to (forced to be square because AlexNet expects that), forward-pass each through the CNN, save pool5 features to disk (the lecture notes ~200 GB of features for the PASCAL dataset alone — yes, to disk!); train one binary SVM per class on those cached features; train a class-specific bounding-box regressor that predicts offsets to refine the proposal toward the true box.
So at test time: extract proposals → warp → CNN → SVM scores + box refinement → NMS → done. R-CNN was a huge jump in mAP over pre-CNN methods. But it had three deep flaws the lecture states explicitly. Memorise these — there's an exam question hiding here. Slow at test time — ~2000 proposals × one full CNN forward pass each = ~47 seconds per image. Post-hoc training — the SVMs and bbox regressors are trained AFTER the CNN is frozen; the CNN features can't adapt to what the SVMs and regressors find useful. Complex multi-stage pipeline — pretrain → fine-tune → extract features → train SVMs → train regressors; five separate stages, none of them end-to-end.
Fast R-CNN — Share The Conv Computation
Girshick, ICCV 2015. The key insight: 2000 forward passes are wasteful because the proposals are all from the same image. The early conv layers are doing the same work 2000 times. Compute the convolutions on the full image once, then crop the feature map per proposal.
The mechanism that makes this work is Region of Interest (RoI) Pooling. Run the full image through the conv backbone — output is a feature map of shape . Project each proposal box onto the feature map (scale-down by the network's stride). Divide the projected region into a fixed grid (e.g. ), regardless of the proposal's actual size. Max-pool within each grid cell. Output is a fixed tensor for every proposal — uniform input for downstream FC layers.
Crucially, RoI pooling is differentiable, so the whole network now trains end-to-end. One loss, one optimiser. Problems #2 (post-hoc) and #3 (multi-stage) of R-CNN — gone. Numbers your slides give: training time hr → hr (), test per image s → s (), mAP on VOC 2007 (also better).
But look at that 0.32-second test time more carefully. The lecture asks: does that include Selective Search? No. Selective Search itself takes about 2 seconds per image (running on CPU outside the network). So real wall-clock test time is sec — and now the bottleneck is the proposals themselves.
Faster R-CNN — Make The CNN Do The Proposing
Ren et al., NIPS 2015. The fix is bold: make the CNN do region proposals too. The new component is the Region Proposal Network.
After the last conv layer of the backbone, attach a tiny network that slides over the feature map. A conv over the feature map produces an intermediate feature. Two conv heads then branch off — one for "object vs not-object" classification, one for box regression.
The clever part is anchor boxes: at each spatial location, the RPN predicts not one box but anchors (typically = 3 scales × 3 aspect ratios). The regression head outputs offsets from each anchor, not absolute coordinates. The classification head outputs the probability that each (regressed) anchor contains some object. Two properties matter for the exam: translation invariance (the same N anchors are used at every spatial location — the RPN treats positions identically, learns what objects look like, not where they live); and the regression learns anchor-relative offsets — much easier to learn because the anchor gives a rough size/shape prior, and the on size makes it scale-invariant.
Once the RPN gives proposals, the rest is exactly Fast R-CNN: RoI pool → FC layers → classify + refine box. The whole thing — backbone, RPN, classifier — trains end-to-end with a single multi-task loss. R-CNN at ~50 s/img, Fast R-CNN at ~2 s/img, **Faster R-CNN at ~0.2 s/img — speedup over R-CNN** at the same mAP.
One Look Is Enough — YOLO
In 2015 somebody named Joseph Redmon looks at Faster R-CNN and says: "Why do we need two stages?" Think about what Faster R-CNN actually does. Stage 1: the RPN proposes ~300 candidate boxes. Stage 2: a classifier looks at each box. Two networks. Two loss functions. A human doesn't do this. A human glances at a photo and says "horse on the left, person on the right" — one look. Object detection should be the same.
That's the YOLO insight: You Only Look Once. Reframe detection not as "propose then classify" but as a single regression problem — one CNN forward pass, output everything (boxes and classes) in one tensor, done. No proposals. No second stage. No region cropping.
The grid trick. YOLO takes the input image, resizes it to , and divides it into an grid — in the v1 paper, . So the image is cut into 49 cells, each cell roughly pixels of the original. The rule that organises everything: each grid cell is responsible for predicting at most one object — specifically, the object whose CENTRE falls inside that cell. Not "any object overlapping the cell" — the centre. So in the lecture's example with a person and a horse, the grid cell containing the horse's centre is the one cell responsible for the horse. The horse can spread across many cells; only one cell predicts it. This is how YOLO turns "variable number of objects" into "fixed-size output" — by dividing responsibility geometrically.
What each cell predicts. bounding boxes (in YOLO v1, ), each described by 5 numbers: — centre of the box, encoded as offsets relative to the grid cell (so within the cell); — width and height, encoded relative to the whole image; — objectness confidence. Plus 20 class probabilities (Pascal VOC has 20 classes), shared across the cell's two boxes. So each cell's prediction vector is numbers. Across the whole grid, the output tensor is .
Why two boxes per cell? A quiz favourite. The lecture answer: multiple options during training — at training time one cell predicts boxes, and we pick whichever predicted box has the highest IoU with the ground-truth box as the "responsible" predictor; faster convergence — each predictor specialises (one tends to learn tall, the other wide); and the chosen ("responsible") box is supervised toward the ground truth while the other is trained to predict 'no object'.
YOLO's Loss — Five Terms, One Philosophy
This is where the lecture spends its time and where the exam will too. YOLO's loss is the sum of squared errors over all cells. But the weighting is the key.
There's a fundamental imbalance: most cells contain no object. Across 49 cells, perhaps only 2 contain objects and 47 are background. If you weight everything equally, the "no-object" cells dominate the loss and the network gives up on detecting things. The fix is two scaling hyperparameters: multiplies the box coordinate loss (so localisation matters more); multiplies the no-object confidence loss (so background suppression matters less).
The five components: box centre loss — squared error on coordinates, scaled by , only on cells with objects; box size loss — squared error on (yes, square roots — see below), scaled by , only on cells with objects; objectness loss for object cells — squared error between the predicted confidence and (typically) the IoU of the predicted box with ground truth; objectness loss for no-object cells — squared error driving confidence toward 0, scaled by ; classification loss — squared error over all 20 class probabilities, only on cells with objects.
The square-root trick on . This is the one detail that always shows up. Without it, a 10-pixel error on a huge box (say, 400 pixels wide) and a 10-pixel error on a tiny box (say, 20 pixels wide) contribute equally to the loss. But the small-box error matters much more in human terms — that 10 pixels is half the object! By predicting and instead of and , errors on small boxes are penalised more than equivalent errors on large boxes. Memorise this — exam gold.
YOLO's Three Limitations
The lecture states these on a single slide. Memorise verbatim — "list YOLOv1's limitations" is the most likely exam question on this lecture. Maximum 49 objects can be detected. The grid is cells, each cell handles one object. If your image has 50 objects, you literally can't represent them all. Difficulty with small objects in groups. Birds in a flock, faces in a crowd — multiple object centres can fall in one cell, and only one can be predicted. Poor localisation. Direct regression of box coordinates is harder than refining anchors (which is what Faster R-CNN does). YOLOv1's boxes are looser. v2 fixed this by introducing anchors.
Two-Stage vs Single-Stage — The Intuition
Two-stage thinks like a careful detective: propose, examine. Single-stage thinks like a reflex: look, decide. Two-stage (R-CNN family) has historically higher mAP, slower (~7 FPS Faster R-CNN), variable per-image output; best for accuracy-critical applications (medical, surveillance). Single-stage (YOLO, SSD, RetinaNet) is real-time (~45 FPS YOLOv1), fixed-size tensor; best for real-time applications (driving, robotics, video). YOLO had slightly lower mAP than Faster R-CNN at launch; the gap closed in v2, v3, v4… and modern YOLOs (v8/v11 era) are near state-of-the-art. YOLO also showed surprisingly good generalisation — trained on natural Pascal VOC images, it transferred reasonably well to artwork where Faster R-CNN's region proposals struggled.
What Comes Next
Mask R-CNN adds an instance-segmentation mask head on top of Faster R-CNN (next chapter). Feature Pyramid Network (FPN) uses multi-scale features for detecting objects of any size — standard backbone today. DETR (2020) reformulates detection as set prediction with attention — no anchors, no NMS.
What You Must Walk Into The Exam Carrying
The four-task staircase with their I/O and the three reasons detection is harder. IoU formula = intersection / union, also called Jaccard index; three calibration thresholds (0.5, 0.7, 0.9); NOT equal to fraction of overlap. NMS algorithm in three steps; failure mode in dense crowds. Modal vs amodal — visible only vs full extent. TP / FP / FN definitions, Precision and Recall. AP as area under the PR curve. mAP = mean of AP across classes; mAP@0.5 vs COCO mAP. Three datasets (Pascal VOC 20 classes, COCO 80 classes / 7.2 obj/img, ImageNet 200 classes / 1.1 obj/img — why ImageNet is bad for detection). R-CNN's five training steps and three flaws (slow, post-hoc, multi-stage). Fast R-CNN = share conv + RoI pooling, end-to-end trainable, still uses Selective Search externally. Faster R-CNN = RPN with anchor boxes replacing Selective Search; conv → two heads (objectness + box regression); anchors per location, translation-invariant; boxes are anchor-relative offsets. YOLO's grid mechanic, output tensor , five-term loss with , trick, three limitations, trade-off vs Faster R-CNN.
That's a complete answer to any exam question on object detection.