Object Detection — R-CNN family, YOLO, NMS, mAP
Intuition
Detection asks two questions per object: 'where' (a bounding box) and 'what' (a class). The history of the field is a story of moving the expensive parts of that pipeline — region proposals, per-region computation, ranking — onto the GPU and into a single shared backbone.
Explanation
R-CNN (2014) used Selective Search to generate ~2000 region proposals per image and ran a full CNN forward pass on each cropped region. SVM classification and bounding-box regression were post-hoc. The pipeline was correct but glacial (~47 s/image) because the CNN was being recomputed on overlapping crops.
Fast R-CNN shared the backbone: one CNN forward over the whole image produces a feature map, and each proposal is cropped from that feature map via RoI Pooling (which quantises the floating-point RoI to integer feature-map cells before pooling). The full model — backbone + RoI head — is trained end-to-end with a multi-task loss combining classification cross-entropy and a smooth-L1 box-regression term.
Faster R-CNN replaced Selective Search (a CPU bottleneck) with a Region Proposal Network: a small conv head that slides over the shared feature map and, at each spatial location, predicts objectness and box refinements for k anchors (typically 9 = 3 scales × 3 aspect ratios). The detector and RPN share the backbone, making proposals nearly free.
YOLO is the canonical single-shot alternative: divide the image into an S × S grid (S = 7 for VOC), and let each cell predict B bounding boxes (each with x, y, w, h, confidence) and C class probabilities shared across the cell's boxes. The output is a single S × S × (B·5 + C) tensor — one forward pass for the whole image. Trade-off: each cell can only describe a small number of objects, so YOLO struggles with crowded scenes and small objects.
Two losses worth memorising: GIoU adds a term penalising the area of the smallest enclosing box outside (A ∪ B), giving a non-zero gradient even when predicted and ground-truth boxes don't overlap; Focal Loss multiplies the cross-entropy term by (1 − p_t)^γ to down-weight easy negatives — essential for single-stage detectors where the background class dominates anchor counts.
After raw predictions you almost always need Non-Maximum Suppression: sort by confidence, pop the top box, suppress all remaining boxes with IoU above a threshold τ, and repeat. NMS is applied per class. mAP is then computed by sorting all detections by score, walking the list to compute cumulative precision and recall (using IoU ≥ 0.5 to decide TP/FP), and averaging the area under the per-class PR curves.
Definitions
- Anchor — A predefined bounding-box prior of fixed scale and aspect ratio; predictions are regression offsets from this prior.
- RoI Pool — Quantises a floating-point RoI to integer feature-map cells, then pools each sub-region. Loses sub-pixel alignment.
- RPN (Region Proposal Network) — Small conv head shared with the detector backbone; predicts objectness + box offsets for k anchors at every spatial location.
- GIoU — Generalised IoU. Adds a penalty for the area inside the smallest enclosing box that doesn't belong to A ∪ B; non-zero gradient for non-overlapping boxes.
- Focal Loss — CE multiplied by (1 − p_t)^γ with γ ≈ 2; suppresses the gradient contribution of easy-classified examples (typically background).
- Non-Maximum Suppression — Per-class procedure: keep highest-score box, suppress all boxes with IoU > τ, repeat. Soft-NMS multiplies scores by decay instead of zeroing.
- mAP — Mean of per-class AP, where AP is the area under the precision-recall curve (11-point in VOC, 101-point in COCO). COCO uses AP averaged across IoU thresholds 0.5:0.05:0.95.
Formulas
\text{IoU}(A,B) = |A \cap B| / |A \cup B|\text{GIoU} = \text{IoU} - |C \setminus (A \cup B)| / |C|\text{FL}(p_t) = -(1-p_t)^\gamma \log p_t,\quad \gamma = 2\text{Output tensor (YOLO):}\ S \times S \times (B \cdot 5 + C)\text{Anchor count per location (RPN):}\ k = 9\ (3\ \text{scales} \times 3\ \text{aspect ratios})
Derivations
GIoU rationale: pure IoU is zero whenever A and B don't overlap, so ∂IoU/∂box = 0 — there is no gradient pulling a far-away predicted box toward the ground truth. GIoU adds the term −|C \ (A∪B)|/|C|; when A and B don't overlap, that term is non-zero and depends on the relative position of A and B, restoring a useful gradient direction.
Why √w, √h in YOLO size loss: identical absolute error matters more on small boxes than on large ones (a 5-pixel error on a 20-pixel box is catastrophic; on a 200-pixel box, negligible). Predicting √w and √h compresses large values so equal absolute error becomes a smaller relative error for big boxes.
Examples
- On a PASCAL VOC image, YOLO v1 outputs a 7 × 7 × 30 tensor: 7×7 grid, 2 boxes per cell (10 values for box coords + conf), and 20 class scores per cell.
- RPN at one location with k = 9 anchors produces 9·2 objectness logits + 9·4 bbox-refinement values = 54 outputs per spatial location.
- mAP trace: imagine 5 detections with scores [0.95, 0.85, 0.7, 0.6, 0.5] and TPs [T, F, T, T, F]. Walking the list: P = 1, 0.5, 0.67, 0.75, 0.6; R rises with each TP. AP = area under the resulting (R, P) curve.
Diagrams
- R-CNN → Fast → Faster evolution: per-region CNN forwards → shared backbone + RoI Pool → shared backbone + RPN. Annotate the bottleneck removed at each step.
- RPN anchor diagram: at one cell, 9 anchors of varying scale and aspect ratio overlaid on the receptive field; each anchor gets a binary objectness score + 4 box offsets.
- NMS algorithm trace as a small table: rows = detections, columns = (score, kept?, suppressed by).
Edge cases
- Per-class NMS: applying NMS globally (across classes) wrongly suppresses, e.g., a dog and a person standing in the same region.
- Crowded small objects (flock of birds): YOLO v1 fails — each cell predicts at most B objects sharing one class distribution.
- Pure IoU loss has zero gradient for non-overlapping boxes → training stalls early. Use GIoU/DIoU/CIoU instead.
- Anchor scale mismatch: if the smallest anchor at the final feature level is still larger than typical small objects, small-object recall collapses. FPN / multi-scale anchors fix this.
Common mistakes
- Writing 'YOLO predicts B class distributions per cell' — it predicts ONE class distribution shared by the B boxes.
- Confusing λ_coord (= 5, upweights box-coordinate loss) with λ_noobj (= 0.5, downweights no-object confidence loss).
- Computing mAP with the natural detection order instead of sorting by score first.
- Applying NMS before splitting by class — yields wrong suppressions.
Shortcuts
- Remember R-CNN-family speed in one chain: 47 s → 0.3 s → 0.2 s → 22 ms (R-CNN, Fast, Faster, YOLO).
- Anchors at one RPN location = 9 (3 × 3). Outputs per location = 9·(2 + 4) = 54.
- Soft-NMS one-liner: instead of zeroing the score of a suppressed box, multiply it by a decaying function of IoU (linear or Gaussian).
Proofs / Algorithms
Why GIoU is bounded in [−1, 1]: when A and B are identical, IoU = 1 and the enclosing-box penalty is 0, so GIoU = 1. When A and B are vanishingly small relative to C (far apart, tiny boxes), IoU → 0 and the penalty → 1, so GIoU → −1.