Revision Notes/Unit 3 — Pose Estimation/Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

Intuition

Predicting K joint locations as a 17×2 vector throws away spatial structure. Almost every modern pose method predicts a heatmap per joint instead and reads the joint location off the argmax — same idea as dense prediction in segmentation, with parabola-fit interpolation for sub-pixel accuracy.

Explanation

Direct regression of (x, y) per joint from a global feature vector loses pixel-level precision and can't express uncertainty. Heatmap regression sidesteps both issues: output K 2D heatmaps, train with per-pixel MSE against a 2D Gaussian centered on the ground-truth joint, and recover the joint at inference by taking the argmax plus a parabolic fit for sub-pixel accuracy.

Convolutional Pose Machines (CPM) treat pose as iterative refinement. Stage 1 outputs an initial belief map per keypoint; each subsequent stage takes (image features + previous belief maps) as input and outputs a refined belief map. Belief maps encode spatial context — the location of a wrist informs likely elbow positions. Intermediate supervision (loss at every stage) is essential to combat vanishing gradients in the deep cascade.

Multi-person pose comes in two flavours. Top-down (Mask R-CNN keypoints, AlphaPose): first detect persons via a detector, then run a single-person pose net on each crop. Accurate but runtime scales linearly with the number of people; misses any person the detector misses. Bottom-up (OpenPose): detect all keypoints in one shot, then group them into individuals via a learned association mechanism. Runtime ~ constant in person count; grouping is the hard part.

OpenPose solves the grouping problem with Part Affinity Fields. For each limb type (e.g., left elbow → left wrist), the network predicts a 2D vector field over the image: along the limb the field is a unit vector pointing from one keypoint to the other; elsewhere it's zero. To score a candidate pairing of (elbow_i, wrist_j), integrate the dot product of the PAF along the line between them. High score means the limb's direction field flows from one keypoint to the other. Bipartite matching (Hungarian algorithm) finds the highest-scoring set of consistent pairings per limb type.

OpenPose's output has K + 2L channels: K keypoint heatmaps + 2 channels per limb (x and y components). With 18 keypoints and 19 limbs you get 18 + 38 = 56 channels.

SMPL (Skinned Multi-Person Linear) is a parametric body model. Shape parameters β ∈ ℝ¹⁰ are PCA components over body shapes; pose parameters θ ∈ ℝ⁷² are axis-angle rotations for 24 joints. Forward: template mesh → shape blendshapes → pose blendshapes → linear blend skinning → posed 6890-vertex mesh. Human Mesh Recovery (HMR) predicts (β, θ) from a monocular image, giving full 3D body shape rather than just 2D keypoints.

Definitions

Heatmap regression — Predict a 2D Gaussian belief map per keypoint; argmax + parabola fit at inference.
CPM — Convolutional Pose Machine — multi-stage refinement with intermediate supervision.
Part Affinity Field (PAF) — 2D vector field per limb type; encodes limb direction; used to group keypoints via line-integral score.
Top-down pose — Detect person bboxes first, then run single-person pose per crop.
Bottom-up pose — Detect all keypoints jointly, then group into individuals.
SMPL — Skinned Multi-Person Linear body model; β ∈ ℝ¹⁰ shape + θ ∈ ℝ⁷² pose → 6890-vertex mesh.

Formulas

\text{PAF score}(A,B) = \int_{0}^{1} L_c(p(u)) \cdot \hat v_{AB}\, du,\ \ p(u) = (1-u) A + u B
\text{PCK@}\alpha\ :\ \|\hat p_i - p_i\| \le \alpha \cdot d_{\text{ref}}
\text{OpenPose output channels} = K + 2L
\text{SMPL}\ :\ M(\beta, \theta) = W\bigl(T_P(\beta, \theta), J(\beta), \theta, \mathcal{W}\bigr)

Derivations

Why heatmap regression beats coordinate regression: coordinate regression produces a single point with no uncertainty; heatmap output can express ambiguity (two peaks). MSE on a Gaussian-shaped target is also more stable than direct L2 on coordinates because the loss surface is smoother.

Examples

PCKh@0.5 with head bone length 50 px: a predicted keypoint within 25 px of ground truth is correct. The head bone is used because it varies less than the torso under pose articulation.
OpenPose at inference: encoder → K heatmaps + 38 PAF channels → keypoint candidates from argmax of each heatmap → bipartite matching per limb type using PAF integrals → assembled skeletons.
HMR pipeline: ResNet-50 encoder → (β, θ, camera) regression → SMPL forward → projected 2D keypoints + silhouette → adversarial discriminator on (β, θ) realism.

Diagrams

OpenPose two-branch architecture: shared backbone → branch 1 outputs keypoint heatmaps, branch 2 outputs PAFs, both refined over T stages.
PAF vector field overlay on a person image: arrows along each limb pointing from one keypoint to the other.
SMPL hierarchy: template → shape blendshapes (β) → pose blendshapes (θ) → linear blend skinning → posed mesh.

Edge cases

Top-down breaks when the upstream detector misses a person.
Bottom-up grouping is hard when people heavily overlap; PAF scores can prefer wrong pairings in dense crowds.
Heatmap argmax has integer resolution; use parabola fit on the 3 max + neighbours for sub-pixel accuracy.

Common mistakes

Writing 'OpenPose output is 2K channels' — it's K + 2L (one heatmap per keypoint, two PAF channels per limb).
Forgetting that PCK normalises by body size (head bone or torso diameter), not absolute pixel distance.
Conflating SMPL pose θ (72-d axis-angle for 24 joints) with shape β (10-d PCA).
Treating top-down as 'always more accurate' — true only when the detector is reliable.

Shortcuts

Numbers to memorise: K + 2L channels; β = 10, θ = 72; SMPL mesh = 6890 vertices.
PAF score = line integral of (PAF · limb direction). Bipartite matching via Hungarian.
PCKh@0.5 = standard reporting metric for human pose.

Computer Vision