Revision Notes/Unit 3 — Pose Estimation/Pose Estimation — Heatmaps, CPM, OpenPose, SMPL/Story

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

Unit 3 — Pose Estimation

Reading the Body

Object detection asks "is there a person, and where?" — and gives you a bounding box. But a box around a person tells you almost nothing about what they're doing. Is the person dancing? Fighting? Lifting a coffee mug? Falling?

The answer lives in the body itself — in the geometry of arms, legs, torso, head. Human pose estimation asks: given an image of a person, predict the locations of their joints and limbs. Then everything downstream — activity recognition, motion capture, gesture interfaces, avatar animation — becomes tractable.

This was, in fact, one of the first commercially deployed computer vision applications: the Kinect for Xbox 360 in 2010 used real-time pose estimation to turn living rooms into game controllers. The seminal paper — *Real-Time Human Pose Recognition in Parts from Single Depth Images*, Shotton et al., CVPR 2011 — used random forests on Kinect depth maps. Today the descendants of that paper power motion capture in film, AR fitness apps, sign-language recognition, and human-AI avatars like MimicMotion (Tencent 2024) where a single image plus a pose sequence generates a fully animated video.

Four Pose Representations

Memorise these four — they progressively encode more information about the human.

(1) Skeleton (joint keypoints). The simplest representation: a fixed list of $K$ body keypoints (e.g., 17 joints in COCO: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles), each represented as a 2D coordinate $(x, y)$ . Output is $K \times 2$ numbers per person (or $K \times 3$ with confidence). Datasets use various sets: COCO 17, MPII 16, Halpe 26+. DWPose is a recent dense-skeleton variant with face + hand keypoints.

(2) DensePose. Instead of $K$ points, map every visible body pixel to a coordinate in a CANONICAL 2D surface parametrisation of the body. Think of it as UV-mapping a 3D body model to image pixels. The output is a dense per-pixel mapping rather than sparse joints — much more information. Useful for clothing transfer, virtual try-on, dense cross-view correspondences.

(3) Body mesh (SMPL). Go all the way to 3D. SMPL — Skinned Multi-Person Linear Model — parameterises the human body as a mesh of 6,890 vertices with two parameter sets: SHAPE $β \in R^{10}$ (principal components of body shape — tall/short, slim/wide, etc.) and POSE $θ \in R^{72}$ (24 joints, each with 3 axis-angle rotation parameters). Given $(β, θ)$ , SMPL computes a posed 3D mesh: start with a template body, apply shape deformations, then pose deformations via linear blend skinning. Human Mesh Recovery = predict $(β, θ)$ from a single monocular image.

(4) Foundation-model representations (Sapiens). Modern approach: train one large vision foundation model on massive human-centric data (Meta's Sapiens, ECCV 2024) that can simultaneously predict pose, segmentation, depth, and surface normals from a single image. The same backbone serves multiple human-understanding tasks. The bridge to multi-task models.

The Naive Baseline — And Why It Fails

Your lecture's "Baseline" slide describes the obvious approach: image → CNN backbone → 2048-dim feature → linear → $K \times 2 = 34$ numbers (the keypoint coordinates). Just regress the keypoint coordinates directly. Use L2 loss on $(x, y)$ versus ground truth.

Why it fails — the lecture asks this explicitly. Four problems. (1) L2 in pixel space is brutal. A 5-pixel error on a $22 4^{2}$ image is small visually, but L2 penalises it the same as a 5-pixel error on a $200 0^{2}$ image. Poorly conditioned loss landscape. (2) No spatial reasoning. The CNN compresses the image to a global vector before predicting coordinates. The 2D structure of the image is destroyed in the bottleneck — but pose estimation is fundamentally a spatial task. (3) No uncertainty. Direct regression outputs a single point. If two arm positions are plausible (behind vs in front), you can only predict one — the model averages, producing nonsense midpoints. (4) No occlusion handling. Joints that are out of view force the model to hallucinate values.

The fix is one of the most important ideas in pose estimation: predict heatmaps, not coordinates.

Pose As Dense Prediction — The Heatmap Revolution

Instead of predicting $(x, y)$ directly, predict a 2D heatmap for each keypoint. The heatmap $H_{k}$ is the same spatial size as the input image (or downsampled), and $H_{k} (i, j)$ is the probability that keypoint $k$ is at pixel $(i, j)$ .

Output: $H \in R^{K \times H^{'} \times W^{'}}$ — one heatmap per keypoint. Ground truth: a Gaussian centred at each true keypoint location, $G_{k} (i, j) = exp (- ((i - x_{k})^{2} + (j - y_{k})^{2}) / (2 σ^{2}))$ . Loss: per-pixel MSE between predicted heatmap and Gaussian ground truth. Inference: $ar g max$ (or weighted softargmax) of each heatmap, optionally with a parabola fit on the 3 max + neighbours for sub-pixel accuracy.

This is pose estimation as dense prediction. Now you keep the full spatial resolution. The model can express multiple peaks for ambiguous joints (uncertainty). The loss has strong gradient everywhere, not just at one point. Spatial structure is preserved. This reframing — from regression to dense prediction — is what made modern pose estimation work.

Convolutional Pose Machines

Wei et al., CVPR 2016 (arXiv:1602.00134), introduced the architecture that locked this approach in. CPMs operate in stages. Each stage takes the original image features AND the belief maps (heatmaps) from the previous stage, and outputs a new, refined set of belief maps. Stage 1: $CNN (image) \to belief_map_{1}$ . Stage 2: $CNN (image, belief_map_{1}) \to belief_map_{2}$ . And so on through $T$ stages.

The crucial idea: each subsequent stage has a larger receptive field, so it can use long-range spatial dependencies between joints — "the head is here; therefore the neck is probably just below; therefore the shoulders are at these likely locations…"

Each stage's belief map is supervised with the ground-truth Gaussian heatmap — intermediate supervision — which combats vanishing gradients in the deep cascade. The final output is the last stage's belief map. This is structurally similar to iterative refinement ideas you'll see again in image generation and recent vision-language models.

Evaluation — PCK (Percentage of Correct Keypoints)

Your lecture states this metric explicitly. Definition: a predicted keypoint is "correct" if its distance to the ground-truth keypoint is below a threshold. PCK is the percentage of correct keypoints averaged over the dataset.

Two common normalisations. PCKh@0.5 uses threshold = 50% of the head segment length (the "head bone"). Conservative, normalised by head size to be person-scale-invariant. Used on MPII. PCK@0.2 uses threshold = 20% of torso diameter. Used on FLIC.

Why normalise? Without normalisation, a 5-pixel error on a giant person filling the frame versus 5-pixel error on a distant person are wildly different in real-body terms. Normalising by body size gives a scale-invariant metric. Higher PCK is better, closer to 100%. The head bone is preferred over torso because torso shortens dramatically under pose articulation (sitting, bending) while the head is more stable.

Multi-Person Pose — Two Paradigms

Once you can find keypoints on a single isolated person, the hard problem becomes: what if there are 5 people in the photo, some overlapping? You can't just predict $K \times 2$ keypoints — you don't know how many people there are.

The lecture lists the three main challenges of multi-person pose: unknown number of people (could be 1, could be 50); interactions and occlusions between people mess up predictions; runtime should ideally not grow with the number of people.

Two paradigms emerged. OpenPose (Cao et al., CVPR 2017) is the canonical example of one — the bottom-up approach.

OpenPose — Bottom-Up Multi-Person

OpenPose's recipe: image → CNN → two branches. Branch 1 outputs $K$ keypoint heatmaps — predicting all keypoints in the image (across all people). This gives you a soup of candidate joints — say, 7 shoulders, 6 elbows, 9 wrists — scattered across the image. Branch 2 outputs Part Affinity Fields: a 2D vector field for every limb type (e.g., one PAF for the right-shoulder-to-right-elbow limb). At each pixel, the PAF stores a unit vector pointing along the direction of that limb if a limb is there.

The assembly step: associate keypoints into individuals by scoring possible pairs. For each candidate $(shoulder_{i}, elbow_{j})$ pair, integrate the PAF along the line connecting them: $Score = \int_{0}^{1} L_{c} (p (u)) \cdot \overset{v}{^}_{A B} d u$ where $p (u) = (1 - u) A + u B$ traces the line. High integral = the keypoints are connected by a real limb. Run a graph matching (Hungarian algorithm) to find consistent skeleton assignments.

Why PAFs are the clever bit. Without them, you'd have to test every shoulder-elbow pair against some geometric heuristic. PAFs let the network LEARN the limb-association cue itself — encoded as a direction vector field across the entire image. The Hungarian matching becomes a clean integer-programming step that operates on these learned scores. Output channels: $K + 2 L$ — K heatmaps plus 2 channels per limb (x and y of the PAF). With 18 keypoints and 19 limbs: 56 channels.

Bottom-up means: detect all keypoints in one pass, then assemble individuals. Runtime is roughly constant in the number of people (the keypoint detection is fixed-cost; only the matching scales).

Mask R-CNN For Pose — The Top-Down Alternative

The alternative paradigm: first detect each person, then run single-person pose estimation on each detection. Mask R-CNN naturally extends to this — add a fourth head: a keypoint head that predicts a $28 \times 28$ (or larger) heatmap for each of the $K$ body joints per RoI. So for each detected person, you get $K$ heatmaps localising their joints within their bounding box.

Top-down means: localise each person first, then estimate poses individually. Runtime scales linearly with the number of people (you run the pose head once per detected person).

Top-Down vs Bottom-Up — The Canonical Trade-Off

This is the most-likely exam comparison in this lecture.

Top-down (Mask R-CNN style): pipeline is detect people → estimate pose per person. Higher accuracy per individual (sees one isolated person). Runtime scales with number of people. Sensitive to detection errors (missed detection = missed pose). Crowded scenes are worse. Best when few people and accuracy is critical.

Bottom-up (OpenPose style): pipeline is detect all keypoints → assemble people. Lower accuracy per individual (has to disambiguate associations). Runtime is ~constant in number of people. Robust to detection errors. Better in crowded scenes. Best when many people and real-time is needed.

If a question asks "which would you use for a crowd-counting application running at 30 fps on a smartphone?" → bottom-up (OpenPose). If "high-precision pose for a single golfer on a launch monitor?" → top-down (Mask R-CNN keypoint).

From 2D To 3D — SMPL And Human Mesh Recovery

2D pose is useful, but the world is 3D. SMPL — Skinned Multi-Person Linear Model: a template mesh of 6,890 vertices in a canonical pose, shape parameters $β \in R^{10}$ (PCA components across body shapes), and pose parameters $θ \in R^{72}$ (24 joints × 3 axis-angle each). To compute the actual mesh, start from the template, apply shape deformations from $β$ , then apply pose deformations via linear blend skinning — each vertex is influenced by nearby joints by learned skinning weights. The output is a full 3D body mesh. SMPL is deterministic and differentiable in $(β, θ)$ .

Human Mesh Recovery (HMR) predicts $(β, θ)$ from a monocular image. Standard architecture: CNN backbone extracts image features; regress $(β, θ, camera)$ directly; use SMPL to compute mesh vertices and 3D joints. Loss: project the 3D joints back to 2D using the predicted camera, compare to 2D ground-truth joints — $L_{2 D} = ∥ proj (SMPL (β, θ)) - keypoints_{2 D} ∥^{2}$ . Optionally: 3D supervision when available, adversarial loss to keep poses plausible. This bridges 2D pose estimation and 3D body modelling. The output is a full 3D mesh you can render from any view, animate, fit clothing to.

Sapiens — Foundation Models For Humans

Meta's Sapiens (ECCV 2024) — a ViT-based foundation model trained on 300M+ human-centric images that, with task-specific heads, simultaneously handles pose, segmentation, depth, and surface normals. Same recipe as everywhere else in modern vision: pretrain a big backbone on massive data, attach lightweight task-specific heads. Pose estimation, as a field, is increasingly subsumed by general human-understanding models.

What You Must Walk Into The Exam Carrying

The four pose representations in increasing detail: skeleton (K keypoints), DensePose (per-pixel canonical surface mapping), SMPL (parametric 3D mesh with $β$ shape + $θ$ pose), foundation-model multi-task features. Why direct keypoint regression fails (L2 ill-conditioning in pixel space, no spatial reasoning past the CNN bottleneck, no uncertainty, no occlusion handling) — and the fix: predict heatmaps. Pose as dense prediction: $K$ heatmaps, Gaussian-centred GT, MSE loss, argmax inference. CPM: iterative belief-map refinement over $T$ stages, intermediate supervision combats vanishing gradients, RF grows with depth. PCK metric with PCKh@0.5 and PCK@0.2 forms — normalised by head bone or torso diameter for scale invariance. OpenPose (bottom-up): keypoint heatmaps + PAFs + Hungarian matching; $K + 2 L$ output channels; constant runtime in number of people. Mask R-CNN keypoints (top-down): detect first, pose per RoI; per-person accurate; scales linearly with people. Top-down vs bottom-up trade-off with the four-dimensional comparison. SMPL: template + $β \in R^{10}$ + $θ \in R^{72}$ → 6,890-vertex mesh via linear blend skinning. HMR: regress $(β, θ)$ from a monocular image with a 2D reprojection loss. Sapiens: foundation-model approach with multi-task heads.

That's pose estimation: from K joints to a parametric 3D body, via the heatmap revolution that made modern pose work, and the bottom-up/top-down trade-off that defines every multi-person system in the wild.

Computer Vision