Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Definitions

Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).

219 terms · 0 have plain-English versions

Unit 1 — Introduction & Foundations

Foundations of Computer Vision — Marr, Three Rs, Gestalt, Why CV is Hard

Computer vision: The science of extracting all possible information about a visual scene from images — answering What/Where/Who/When/Why/How/How-many.
Marr's definition of vision: (1982) 'To know what is where, by looking.' The two-word mission statement of the field.
Three Rs (Malik): Reorganisation (group pixels), Recognition (label them), Reconstruction (measure geometry). Modern CV systems use all three.
Semantic gap: The conceptual distance between low-level pixel intensities and high-level semantic concepts (objects, intent, action). Bridging it is the central technical challenge.
Intra-class variation: The amount of visual variability within a single semantic class (e.g., 'chair' includes throne, beanbag, office chair). Often larger than inter-class variation — which is why pure appearance-matching fails.
Affordance (Gibson): A possibility for action that an object offers an agent (a chair affords sitting, a handle affords grasping). Provides a functional definition of object category that is more robust than appearance.
Gestalt principles: Pre-attentive grouping rules formulated by 1920s German psychologists: Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground, Symmetry. Motivate classical segmentation and modern attention.
Inverse problem of vision: Image formation is many-to-one (3D scene → 2D image), so inversion (2D → 3D + identity) is one-to-many and underdetermined. Vision must use priors to choose a plausible interpretation.
The Bitter Lesson (Sutton): Empirical observation that, across decades of AI research, methods that leverage raw computation eventually beat methods that bake in domain knowledge. Drives the 'just scale it up' approach in modern CV.
Cambrian explosion connection: Vision drove evolutionary divergence ~540 Myr ago. Hard evidence that visual perception is computationally expensive — evolution would not have selected for it if it were cheap.
Summer Vision Project (1966): MIT memo by Seymour Papert proposing CV could be largely solved in one summer. The iconic underestimation of the field.
Inattentional blindness: Failure to notice obvious stimuli when attention is occupied elsewhere (Simons & Chabris's 'invisible gorilla'). A reminder that biological vision is selective, not a faithful camera.

Unit 2 — Digital Image Processing Recap

DIP — Filters, Histograms, Fourier/DCT, Morphology, Geometric Ops, Hough, Templates

Sampling vs quantisation: Sampling = discretising spatial position (which pixels exist). Quantisation = discretising intensity (how many gray levels per pixel).
Spatial domain: Operations on pixel values directly. Three flavours: Point→Point, Neighbourhood→Point, Global→Point.
Transform domain: Operations after transforming to another basis (Fourier/DCT/wavelet); useful for convolution, compression, periodic-noise removal.
Convolution theorem: Convolution in space ↔ multiplication in frequency. Justifies FFT-based filtering for large kernels.
Separable filter: A 2D filter $H (x, y) = H_{x} (x) H_{y} (y)$ — can be applied as two 1D passes. Gaussian is separable; Laplacian is not.
Bilateral filter: Edge-preserving smoothing. Weights = spatial Gaussian × range (intensity) Gaussian. Across an edge, intensity differs → range weight collapses → no blur across edge.
Otsu's method: Automatic threshold for bimodal histograms. Picks T that maximises between-class variance (equivalently minimises within-class). $O (L)$ closed-form sweep over all intensity levels.
Morphological erosion / dilation: Erosion $A ⊖ B$ : SE fits inside A → MIN filter, shrinks. Dilation $A \oplus B$ : SE touches A → MAX filter, grows. Duals.
Opening / Closing: Opening = erode then dilate (kills noise dots, preserves shape). Closing = dilate then erode (fills small holes). Both idempotent.
Hit-or-Miss transform (HAM): Match a foreground pattern AND its background simultaneously. Used to locate isolated points, line endpoints, T-junctions, corners.
Affine vs projective: Affine (6 DoF) preserves parallel lines. Projective/homography (8 DoF) preserves only straight lines — allows perspective.
Forward vs inverse warping: Forward: iterate over source pixels, push to T(x,y) — produces holes + overlaps. Inverse: iterate over destination, pull from T⁻¹(x', y') — every dest filled exactly once + needs interpolation.
Hough transform: Voting in parameter space. Each image edge pixel votes for all lines through it; peaks in $(ρ, θ)$ accumulator = detected lines. Polar form avoids the vertical-line infinity problem.
Template matching: Slide a template over the image; score similarity (SSD, NCC, correlation coefficient). Fast but NOT rotation/scale invariant — motivates feature descriptors.
Histogram equalisation: Map intensity via CDF: $s = (L - 1) CDF (r)$ . Output histogram is approximately uniform; contrast maximised; can amplify noise in flat regions.
DCT vs DFT: DCT is real-valued (cosines only) and mirror-extends → no boundary discontinuity → better energy compaction. JPEG uses DCT; spectrum analysis uses DFT.

Unit 3 — Machine Learning Recap

ML — Logistic, NN+Backprop, Ensembles, Density, RNN, Metrics, kNN, Regression, PCA/SVD, Clustering

Sigmoid / softmax: Squash $R$ to $(0, 1)$ / probability simplex. Sigmoid for binary, softmax for $K$ classes.
Cross-entropy loss: Information-theoretic distance between distributions. Categorical: $- \sum_{k} y_{k} lo g \overset{y}{^}_{k}$ . Gradient on logits = $\overset{y}{^} - y$ — clean and stable.
Backpropagation: Chain-rule traversal of a computation graph from loss back to parameters. Each node multiplies the local Jacobian.
Vanishing / exploding gradients: Gradient norm decays to 0 (or blows up) as it propagates back through many layers / time steps. Mitigations: careful init, ReLU/skip connections (forward), LSTM/clip (recurrent).
Batch Normalisation: Normalise activations per mini-batch to zero-mean unit-variance; then learnable scale and shift. Faster training, less init-sensitive, slight regularisation. At inference: use running mean/var, NOT batch stats.
Dropout: Stochastically zero $p$ fraction of activations during training. Acts as an ensemble + regulariser. Disable (or rescale) at inference.
Bagging vs Boosting: Bagging: parallel, bootstrap samples, reduces variance (Random Forest). Boosting: sequential, reweight errors, reduces bias (AdaBoost; Viola-Jones face detection).
Gaussian Mixture Model: $p (x) = \sum_{k} π_{k} N (x ∣ μ_{k}, Σ_{k})$ . Trained by EM. Soft clustering with covariance-shaped components.
EM algorithm: E-step: compute responsibilities given current parameters. M-step: re-estimate parameters by weighted MLE. Monotonically improves log-likelihood; converges to a local maximum.
LSTM: Gated recurrent unit with three gates (forget, input, output) and an additive cell-state path that prevents vanishing gradients (constant error carousel).
Precision / Recall / F1: P = TP/(TP+FP), R = TP/(TP+FN), F1 = 2PR/(P+R) (harmonic mean). Use precision when FP costly; recall when FN costly.
AP / mAP: Average Precision = area under the PR curve for one class. mAP = mean over classes (or queries).
ROC vs PR curve: ROC: TPR vs FPR. PR: Precision vs Recall. Prefer PR for imbalanced data with rare positives.
kNN: Lazy learner; predict by majority vote among k nearest training points. Small k → variance; large k → bias. k usually odd.
Normal equations: Closed-form least squares: $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ . Fails when $X^{⊤} X$ is singular (collinearity) or too large to invert.
L1 vs L2 regularisation: L1 = sum of |w|; drives weights to exactly 0 → feature selection. L2 = sum of w²; shrinks smoothly, no exact zeros.
PCA: Find orthogonal directions of max variance. Steps: centre → covariance → eigendecompose → keep top-k. $λ_{i}$ = variance along $v_{i}$ .
SVD: $X = U Σ V^{⊤}$ . Right singular vectors $V$ are PCA components. Numerically stable. Best rank- $r$ approximation in Frobenius norm (Eckart–Young).
LoRA: Low-Rank Adaptation: $Δ W = A B$ with $A \in R^{d \times r}, B \in R^{r \times d}$ , $r ≪ d$ . Cheap fine-tuning of huge models. Same spirit as SVD.
k-means: Hard-assignment clustering. Init → assign → update → repeat. Issues: spherical bias, init-sensitive, outlier-sensitive. Fixes: k-means++, GMM, k-medoids.
Hierarchical clustering: Agglomerative (bottom-up) or divisive (top-down). Produces a dendrogram; cut at different heights for different cluster counts. Linkage: single / complete / average / Ward.

Unit 4 — Convolutional Neural Networks (CNNs)

CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field

Convolution layer: Stack of learnable filters slid across input. Weight-shared across space → translation equivariance. Params $F (C_{in} K^{2} + 1)$ , independent of $H, W$ .
Receptive field: Set of input pixels that influence one output unit. Grows with depth (and stride). $L$ stacked $3 \times 3$ stride 1 → $RF = 2 L + 1$ .
Pooling (max / avg / GAP): Downsample with NO parameters. Max routes argmax (translation-robust). Avg distributes equally. GAP collapses $H \times W \to 1 \times 1$ , replaces giant FC.
Same / Valid padding: Same: $P = (K - 1) /2$ (odd $K$ ) → output size equals input. Valid: $P = 0$ → output shrinks by $K - 1$ .
1×1 convolution: Per-pixel MLP across channels. Four uses: channel reduction (bottleneck), non-linearity injection, cross-channel mixing, cheap. Used in Inception, ResNet, MobileNet, SENet.
Dilated (atrous) convolution: Inserts $d - 1$ zeros between kernel taps → effective kernel $K + (K - 1) (d - 1)$ . Enlarges RF without parameters or stride. Used in DeepLab, WaveNet. Gridding artefact at large $d$ .
Depthwise-separable convolution: Depthwise ( $K \times K$ per channel) + pointwise ( $1 \times 1$ ). $\sim 8$ – $9 \times$ cheaper than standard conv at $K = 3$ . Foundation of MobileNet, Xception.
Batch Normalisation (CNN): Normalise per channel across $(N, H, W)$ ; learnable $γ, β$ per channel. Inference uses running mean/var (not batch). Faster training, less init-sensitive.
Translation equivariance vs invariance: $f (T_{v} (x)) = T_{v} (f (x))$ vs $f (T_{v} (x)) = f (x)$ . Conv layers are equivariant; after global pool + FC, network is invariant. CNNs are NOT rotation equivariant.
LeNet: (LeCun, 1989/1998) First successful CNN. ~60k params. Two conv-pool blocks then FC. Template every later CNN follows.
AlexNet: (Krizhevsky et al., 2012) 60M params, ReLU + Dropout + augmentation, trained on 2 GPUs. Won ILSVRC-2012 by 10% top-5 — started the deep-learning era.
VGG: (Simonyan & Zisserman, 2014) Only $3 \times 3$ convs, deep + simple. VGG-19: 143.7M params, mostly in FC layers. Top-1 74.2%.
Inception / GoogLeNet: (Szegedy et al., 2014) Parallel branches at multiple kernel sizes with 1×1 bottlenecks placed BEFORE expensive 3×3/5×5. Inception-v3: 27.2M params, top-1 77.3%.
ResNet: (He et al., 2015) $y = F (x) + x$ residual connection. Gradient $\partial F / \partial x + I$ ⇒ no vanishing. Enables 100+ layers. ResNet-50: 25.6M params, 76.1% top-1 — the workhorse backbone.
DenseNet: (Huang et al., 2016) $x_{l} = H_{l} ([x_{0}, \dots, x_{l - 1}])$ — concatenation (not addition). Strong gradient flow, feature reuse.
SENet (Squeeze-Excitation): (Hu et al., 2017) Channel attention: GAP → 2-layer MLP → sigmoid scale → multiply. +1% top-1 at near-zero cost.
MobileNet: (Howard et al., 2017) Depthwise-separable convs throughout. ~4.2M params, ~70% top-1. Designed for phones / edge.
EfficientNet: (Tan & Le, 2019) Compound scaling: $d = α^{ϕ}, w = β^{ϕ}, r = γ^{ϕ}$ , $α β^{2} γ^{2} \approx 2$ . B0: 5.3M params, top-1 77.7%.
C3D / I3D / SlowFast: Video CNN family. C3D: 3D conv from scratch. I3D: inflate 2D pretrained weights along time. SlowFast: slow (semantics, low fps high channel) + fast (motion, high fps low channel) pathways.

Unit 5 — Object Detection

Object Detection — R-CNN family, YOLO, NMS, mAP

Bounding box (modal vs amodal): Modal: covers only the visible portion of the object; standard in PASCAL VOC, COCO, KITTI. Amodal: covers the full extent including occluded parts; used in specialised benchmarks.
Anchor: A predefined box prior at fixed scale and aspect ratio; predictions are regression offsets from it. Faster R-CNN uses $k = 9$ anchors per location (3 scales × 3 ratios).
Selective Search: Classical bottom-up segmentation (Uijlings et al., IJCV 2013). Starts from oversegmentation, greedily merges similar regions using colour + texture + size + fill similarity. Produces ~2000 class-agnostic proposals per image. Workhorse of R-CNN and Fast R-CNN; not learned.
RoI Pool: Project the proposal to the feature map (quantising to integer cells), divide into a fixed $h \times w$ grid (typically $7 \times 7$ ), max-pool per cell. Differentiable, but the two roundings cause sub-pixel misalignment in the input image — fatal for masks.
RPN (Region Proposal Network): Faster R-CNN's learned, backbone-sharing proposal generator. $3 \times 3$ conv → two $1 \times 1$ heads (objectness + box regression). Translation-invariant: same anchor set at every spatial location.
FPN (Feature Pyramid Network): Multi-scale feature representation with top-down lateral connections (Lin et al., CVPR 2017). High-resolution shallow features + low-resolution deep features fused via $1 \times 1$ lateral convs. Lets one detector head handle small + large objects together. Standard in every modern detector.
GIoU: Generalised IoU. Adds the penalty $∣ C ∖ (A \cup B) ∣/∣ C ∣$ where $C$ is the smallest enclosing box. Non-zero gradient even when boxes don't overlap. Bounded in $[- 1, 1]$ .
Focal Loss: Cross-entropy multiplied by $(1 - p_{t})^{γ}$ with $γ \approx 2$ (Lin et al., ICCV 2017). Crushes the gradient contribution of easy-classified examples — essential for single-stage detectors where background anchors vastly outnumber foreground.
Non-Maximum Suppression: Per-class procedure: sort detections by score; keep the top one; suppress all with IoU > $τ$ ; repeat. $τ \approx 0.5$ typical. Fails in dense crowds where legitimate overlapping objects get suppressed.
Soft-NMS: Variant that DECAYS suppressed scores instead of zeroing them. Linear: $s_{i} \leftarrow s_{i} \cdot (1 - IoU)$ . Gaussian: $s_{i} \leftarrow s_{i} \cdot exp (- IoU^{2} / σ)$ . Lets nearby legitimate objects coexist.
mAP: Mean of per-class Average Precision (area under PR curve). VOC uses 11-point interpolation at IoU = 0.5. COCO averages mAP over IoU thresholds ${0.5, 0.55, \dots, 0.95}$ (101-point interp per threshold). COCO numbers are smaller because the metric is stricter.
Smooth-L1 loss: Quadratic near zero (smooth gradient at the minimum), linear elsewhere (robust to outliers). Used for box regression in Fast/Faster R-CNN: $ℓ (z) = 0.5 z^{2}$ if $∣ z ∣ < 1$ else $∣ z ∣ - 0.5$ .
Overfeat: Sermanet et al., ICLR 2014 (ILSVRC 2013 detection winner). Sliding-window classification + box regression, made practical by converting FC layers into $1 \times 1$ convs so the whole image is processed in one forward pass.

Unit 6 — Dense Prediction: Segmentation + Depth

Dense Prediction — Segmentation & Monocular Depth

Semantic / Instance / Panoptic segmentation: Per-pixel class label / class + instance ID for things only / class + instance ID for everything (things and stuff). Panoptic is the most complete.
Things vs Stuff: Things: countable objects with distinct instances (person, car, animal). Stuff: amorphous, uncountable regions identified by texture/material (sky, road, water). Panoptic mixes both correctly.
FCN: Fully Convolutional Network (Long et al., CVPR 2015). Replace the classifier's FC with $1 \times 1$ conv → per-pixel class map. Encoder downsamples; decoder upsamples; arbitrary input size.
Transposed convolution: Learnable upsampling: input pixel scales the filter at the output position; overlapping contributions sum. Equivalent matrix view: $y = W^{⊤} x$ . NOT 'deconvolution'.
U-Net: Symmetric encoder-decoder with CONCAT skip connections at every resolution. Decoder gets deep semantic + shallow local features. Originally MICCAI 2015 medical imaging; now ubiquitous (Stable Diffusion's denoiser is a U-Net).
Dilated/Atrous convolution: Conv with gaps of size $(rate - 1)$ between kernel taps; multiplicatively expands receptive field without parameter growth or resolution loss. DeepLab's foundation. Risk: gridding artifacts at high rates.
ASPP (Atrous Spatial Pyramid Pooling): Parallel branches of atrous convs at multiple rates (e.g., 6, 12, 18) for multi-scale context. Used in DeepLab v3.
Mask R-CNN: Faster R-CNN + a third FCN head producing $28 \times 28$ binary mask per RoI per class. ICCV 2017, 41 k+ citations. Loss = $L_{cls} + L_{box} + L_{mask}$ .
RoI Align: Bilinear interpolation at exact float coordinates with 4 sample points per bin — NO rounding. Replaces RoI Pool's quantisation; critical for pixel-precise masks.
PointRend: Adaptive boundary refinement (Kirillov et al., CVPR 2020). Coarse mask → identify uncertain pixels (prob ≈ 0.5) → point-MLP refinement on high-res features → iterate. Sharper boundaries at modest extra cost.
Dice coefficient: $Dice (A, B) = 2∣ A \cap B ∣/ (∣ A ∣ + ∣ B ∣)$ . Same ranking as IoU. Denominator is SUM, not union. Dice slightly favours small objects.
mIoU: Mean IoU across classes. The standard segmentation metric — robust to class imbalance, unlike pixel accuracy.
MiDaS: Relative monocular depth (Ranftl et al., 2019+). Scale-and-shift-invariant L1 loss lets training combine heterogeneous depth sources (stereo, SfM, synthetic, web video).
ZoeDepth: MiDaS-style relative depth pretraining + metric depth fine-tuning on KITTI/NYU. Zero-shot transfer to metric depth (Bhat et al., 2023).
Focal Loss: $FL (p_{t}) = - α (1 - p_{t})^{γ} lo g p_{t}$ with $γ = 2, α = 0.25$ . Modulating factor crushes easy-classified examples; rebalances dense detection / segmentation. RetinaNet, ICCV 2017.

Unit 7 — Pose Estimation

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

Skeleton / keypoint representation: Fixed list of K body keypoints, each $(x, y)$ or $(x, y, c)$ . COCO uses 17. Simplest and most common pose representation.
DensePose: Per-pixel mapping from image to canonical 2D body surface — UV-mapping a 3D body to image pixels. Dense correspondence.
SMPL: Skinned Multi-Person Linear Model (Loper et al., 2015). Template + $β \in R^{10}$ shape PCA + $θ \in R^{72}$ pose (24 joints × 3 axis-angle) → 6,890-vertex 3D mesh via linear blend skinning.
Human Mesh Recovery (HMR): Regress SMPL parameters $(β, θ, camera)$ from a monocular image (Kanazawa et al., CVPR 2018). Trained with 2D reprojection loss to GT keypoints + optional 3D supervision + adversarial pose-plausibility loss.
Heatmap regression: Predict a 2D Gaussian belief map per joint instead of regressing $(x, y)$ . Per-pixel MSE against a Gaussian-centred target. argmax (or softargmax) at inference.
Convolutional Pose Machine (CPM): Multi-stage pose architecture (Wei et al., CVPR 2016). Each stage refines belief maps using image features + previous stage's belief maps. Intermediate supervision at every stage combats vanishing gradients.
Part Affinity Field (PAF): 2D vector field per limb type. At each pixel along a limb, stores a unit vector along the limb direction; zero elsewhere. Used to group keypoints via line-integral scoring.
PCK / PCKh: Percentage of Correct Keypoints. A keypoint is correct if its distance to GT is below a threshold normalised by body size. PCKh@0.5 uses 0.5 × head bone length; PCK@0.2 uses 0.2 × torso diameter.
Bottom-up pose: Detect all keypoints in the image first (one pass), then group into individuals via association cues (PAFs + Hungarian matching). OpenPose paradigm. Runtime $\sim$ constant in number of people.
Top-down pose: Detect person bounding boxes first, then run single-person pose estimation per box. Mask R-CNN keypoints / AlphaPose. Higher per-person accuracy; runtime $\propto P$ ; brittle on detection misses.
Sapiens: Meta's ECCV 2024 foundation model for humans. ViT backbone pretrained on 300M+ human-centric images; lightweight task heads for pose, segmentation, depth, normals. The 'foundation model + task heads' paradigm applied to human understanding.

Unit 8 — 3D Data (PointNet, DGCNN, MeshCNN)

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

Symmetric function: A function $f$ satisfying $f (x_{1}, \dots, x_{N}) = f (π (x_{1}, \dots, x_{N}))$ for every permutation $π$ . Permutation-invariant by definition.
Voxelization / occupancy grid: Discretising a point cloud onto a regular 3D grid; each voxel marks occupied/empty (or stores density). Enables 3D convolution but at cubic memory cost.
PointNet: Shared per-point MLP $h$ + symmetric max-pool over points + final MLP $γ$ . Universal approximator for continuous symmetric set functions (Qi et al., 2017).
Critical points: The subset of input points whose per-point features actually survive PointNet's max-pool. They determine the global feature; perturbing other points has no effect.
T-Net: A mini-PointNet inside PointNet that predicts a learned alignment matrix ( $3 \times 3$ for input, $64 \times 64$ for features) and applies it before subsequent layers.
PointNet++: Hierarchical PointNet — sample anchors via Farthest Point Sampling, group neighbours via ball query, apply PointNet locally, then stack. Adds local context that vanilla PointNet lacks.
Farthest Point Sampling (FPS): Greedy subsampling: pick first point arbitrarily, then iteratively add the point with maximum minimum-distance to the chosen set. Yields evenly-spaced anchors.
EdgeConv (DGCNN): Per-edge feature $h (x_{i}, x_{j} - x_{i})$ over a kNN graph, max-aggregated. The graph is dynamic — rebuilt in feature space at each layer.
Dynamic graph: The kNN edges of DGCNN, recomputed in the current feature space at every layer (early layers: geometric neighbours; late layers: semantic neighbours).
MeshCNN: Edge-centric mesh operator with a 5-D intrinsic edge feature and a 4-edge fixed neighbourhood; conv uses symmetric input combinations; pooling = task-driven edge collapse.
Dihedral angle: The angle between the two triangular faces meeting at a mesh edge. One of MeshCNN's five intrinsic features.

Unit 9 — NeRF & 3D Gaussian Splatting

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering

Novel view synthesis: Given $N$ photos of a scene, render the scene from a new camera pose not in the original set.
Explicit vs implicit 3D representation: Explicit = enumerate primitives directly (points, mesh, voxels, Gaussians). Implicit = encode as a function (SDF, NeRF MLP). 3DGS is explicit-but-fuzzy: explicit primitives that are 3D Gaussians, not hard points.
Signed Distance Function (SDF): Implicit representation $f (x, y, z)$ returning the signed distance to the nearest surface; surface = zero-level set.
NeRF: Neural Radiance Field: MLP $f_{θ} (x, y, z, θ, ϕ) \to (RGB, σ)$ ; render by volumetric integration along camera rays.
COLMAP: Open-source Structure-from-Motion + multi-view-stereo pipeline. For 3DGS: provides camera intrinsics, extrinsics (poses), and a sparse point cloud for initialisation. Pose estimation in 3DGS is classical, not learned.
Spherical harmonics (SH): Orthonormal angular basis on the unit sphere. Degree $L$ has $(L + 1)^{2}$ functions. 3DGS uses $L = 3$ → 16 per channel → 48 for RGB. Captures view-dependent appearance (specular highlights).
Adaptive Density Control (ADC): Three operations during optimisation: Clone (small Gaussian, high pos gradient), Split (large Gaussian, high pos gradient → two smaller, scale / 1.6), Prune (low-opacity Gaussian).
Differentiable rasterisation: Tile-based GPU rasteriser whose every step (sort, project, composite) is differentiable, so pixel gradients flow back to $μ, Σ, α, SH$ .
Transmittance: $T_{i} = \prod_{j = 1}^{i - 1} (1 - α_{j})$ — probability that light has passed through everything in front of Gaussian $i$ without being absorbed.
PSNR: Peak Signal-to-Noise Ratio = $10 lo g_{10} (R^{2} / MSE)$ where $R = 255$ for 8-bit. Higher better; unbounded; ~30 dB good for 8-bit.
SSIM: Structural Similarity Index, $[- 1, 1]$ , higher better. Sliding-kernel comparison capturing luminance, contrast, structure.
LPIPS: Learned Perceptual Image Patch Similarity. Distance in the feature space of a pretrained AlexNet/VGG; lower better; captures human perception.

Unit 10 — Attention & Transformers

Attention Mechanism & Transformer Architecture

Seq2Seq bottleneck: Pre-attention encoder-decoder RNNs compressed the entire source into a single fixed hidden vector; performance collapsed on long inputs.
Bahdanau attention: Per-decoder-step weighted sum over encoder hidden states; weights computed by an additive MLP score; learns alignment as a byproduct.
Q / K / V: Query / Key / Value — three learned projections of the input. Self-attention: all from same sequence. Cross-attention: Q from decoder, K, V from encoder.
Scaled dot-product attention: $softmax (Q K^{⊤} / d_{k}) V$ . The $d_{k}$ keeps softmax in its useful (non-saturated) regime.
Multi-head attention: $h$ parallel attention heads, each with $d_{k} = d / h$ ; concatenate outputs and project with $W_{O}$ . Same total params as single-head; heads can specialise.
Causal mask (look-ahead mask): Upper-triangular mask of $- \infty$ ; added to attention scores pre-softmax; prevents the decoder from attending to future tokens.
Cross-attention: Q from decoder's current state, K, V from encoder's output. Equivalent to Bahdanau attention in Q-K-V form.
Sinusoidal positional encoding: Vaswani's $sin / cos$ at exponentially decreasing frequencies; allows linear expression of relative position; generalises to unseen lengths.
Pre-Norm vs Post-Norm: Post-Norm (Vaswani 2017): $LN (x + sublayer (x))$ ; needs warmup. Pre-Norm (modern): $x + sublayer (LN (x))$ ; stable for deep stacks.
Teacher forcing / student forcing: Training the decoder with ground-truth previous tokens (teacher) vs predicted previous tokens (student). Inference is always student-forcing.
Beam search: Maintain top- $B$ partial sequences at each step; expand and score; keep top- $B$ again. Trades compute for quality; $B = 4 - 5$ typical.
Soft vs hard attention: Soft: continuous weighted average, differentiable. Hard: discrete sampling of one position, requires REINFORCE.

Unit 11 — Vision Transformers (ViT)

ViT Pipeline, Scaling, and Swin

Patch embedding: Linear projection $E \in R^{P^{2} C \times D}$ of flattened $P \times P \times C$ patches to $D$ -dim token embeddings; equivalently a Conv2d with kernel $P$ and stride $P$ .
[CLS] token: Learnable summary token prepended to the patch sequence; its final hidden state is passed through a linear head for classification. Alternative: global-average-pool patch tokens (~equivalent accuracy).
Inductive bias: Architectural priors. CNNs have locality + translation equivariance baked in. ViTs have neither — they must learn them from data. Small data: bias helps. Large data: bias limits.
Pre-Norm: LayerNorm placed *before* each sublayer (MSA or MLP), with the residual added *after*. $z^{'} = MSA (LN (z)) + z$ . Stable for deep stacks; used by ViT.
ViT-B/16: Vision Transformer Base with $16 \times 16$ patches: $L = 12$ , $D = 768$ , MLP $= 3072$ , $H = 12$ , ~86M params.
Attention distance: Average spatial distance over which a head attends. CNNs: small in early layers, grows with depth. ViTs: spans both small and large distances even in layer 1.
CKA (Centered Kernel Alignment): Similarity metric for comparing representations across layers/models. Used by Raghu et al. to show ViT and CNN representations differ qualitatively.
JFT-300M: Google's internal 300M-image dataset; ViT's pretraining target where it overtakes ResNet by a wide margin.
Swin Transformer: Shifted-Window Transformer (Liu et al., ICCV 2021). Window self-attention ( $O (N)$ per layer) + shifted windows in alternate blocks (cross-window communication) + hierarchical patch merging (4-stage pyramid).
W-MSA / SW-MSA: Window multi-head self-attention with aligned windows / with shifted windows. Swin alternates these between blocks.
Patch merging (Swin): $2 \times 2$ group of patches concatenated and projected $4 C \to 2 C$ ; halves spatial dims and doubles channels — like strided conv in CNNs.
PE interpolation: Bilinearly interpolating learned 1D position embeddings to a new sequence length when fine-tuning at higher resolution; no new parameters.

Unit 12 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

Self-supervised learning (SSL): Supervision derived from structure within the data itself; no human labels. Four families: old-school pretext, contrastive, language-image contrastive, generative.
Positive / negative pair: Positive: two augmented views of the same image. Negative: views from different images. Contrastive methods pull positives together and push negatives apart.
InfoNCE / NT-Xent: Contrastive loss — softmax cross-entropy with a positive logit and many negative logits, scaled by temperature $τ$ . SimCLR's specific form is NT-Xent.
Projection head $g_\phi$: Small MLP between the encoder $f$ and the contrastive loss. Discarded at downstream. Lets $f$ preserve broad features while $g$ enforces invariances.
Momentum encoder: Slowly-updated copy of the online encoder via EMA $ξ \leftarrow m ξ + (1 - m) θ$ , $m \approx 0.999$ . MoCo's key encoder; BYOL/DINO's target encoder.
Memory queue (MoCo): FIFO buffer of $K \sim 65$ k past key embeddings. Provides many negatives without large batch size. Updated each step by enqueuing the current batch's keys and dequeuing the oldest.
Predictor (BYOL): Extra MLP on the online branch only — asymmetry that prevents collapse. Online = $f_{θ} \to g_{θ} \to q_{θ}$ ; target = $f_{ξ} \to g_{ξ}$ (no predictor).
Stop-gradient: Operator that blocks gradients during backprop. BYOL/DINO/JEPA all stop-grad on the target branch — the target is updated via EMA, not gradients.
Sinkhorn-Knopp algorithm (SwAV): Iterative row/column normalisation that produces an equipartitioned soft cluster assignment. Prevents collapse to one prototype.
WIT (WebImageText): CLIP's 400M-pair dataset; 500k queries × 20k pairs/query; scraped from the public internet.
Zero-shot classification (CLIP): Embed class names as text prompts, embed the image, take the argmax cosine similarity. No labelled examples of target classes are seen.
DeViSE: Deep Visual-Semantic Embedding (Frome et al., NeurIPS 2013) — the 2013 precursor of CLIP; introduced image-text joint embedding 8 years earlier, at smaller scale.
Winoground: CVPR 2022 benchmark probing CLIP's compositional reasoning; pairs differ only in word order with matched swapped images. CLIP performs near chance.

Unit 13 — SSL: DINO, MAE, JEPA

DINO, MAE, JEPA — Modern SSL Beyond Contrastive

Self-distillation: Student trained to match teacher's output distribution; teacher and student share architecture; teacher updated via EMA of student. Cross-entropy loss. No negatives.
EMA teacher update: $θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}$ with cosine schedule $λ : 0.996 \to 1$ . Teacher is a slowly-updated, smoothed version of the student.
Centering (DINO): Subtract a running-mean bias from teacher logits before softmax. Prevents collapse to a single-dimension-dominated output. "Bias term added to logits."
Sharpening (DINO): Apply a very low temperature ( $τ_{t} \approx 0.04$ ) to teacher logits before softmax. Produces a peaky, confident target. Prevents collapse to uniform output.
Multi-crop: DINO's augmentation strategy: 2 global views (>50% area, 224 px) fed to both teacher and student + 6–10 local views (<50%, 96 px) fed to student only. Forces local-to-global consistency.
$[\text{CLS}]$ attention as emergent segmentation: In DINO-pretrained ViT, $[CLS]$ 's attention over patches concentrates on the salient object — producing usable segmentation-like maps with zero segmentation supervision.
Registers (DINOv2): Extra learnable tokens prepended to the sequence with no positional embedding; absorb global scratchpad activity so real patch tokens have cleaner attention maps.
Masked Autoencoder (MAE): Patchify image → randomly mask 75% → deep encoder on visible 25% only → light decoder reconstructs masked-patch pixels via MSE. Asymmetric architecture; encoder kept, decoder discarded.
Mask ratio (MAE): Fraction of patches masked. 75% in MAE vs 15% in BERT — images have more spatial redundancy, so higher masking is required to prevent texture-copying shortcut.
JEPA (Joint-Embedding Predictive Architecture): LeCun's program: predict target *representations* (from a separate EMA-updated target encoder) rather than pixels. Context encoder + target encoder + predictor; L2 in feature space with stop-gradient on target.
I-JEPA / V-JEPA / VL-JEPA: Image JEPA (Assran et al., CVPR 2023) / Video JEPA (2024) / Vision-Language JEPA (2025). Same recipe, different modalities.
Stop-gradient: Operator that blocks gradients during backprop. DINO/JEPA/BYOL all stop-grad the target branch — the target is updated via EMA, not gradients.

Unit 14 — Transformer Advances (ViT-5 era)

Modern Transformer Upgrades

Residual stream: The unbroken identity path in a Pre-Norm Transformer; gradients flow directly through it from output back to input.
Pre-Norm: $x_{out} = x + Sublayer (LN (x))$ . LayerNorm placed before the sublayer; residual stream is never normalised. Modern default.
Post-Norm: $x_{out} = LN (x + Sublayer (x))$ . 2017 original; needs careful warmup at depth.
RMSNorm: Drop-the-mean variant of LayerNorm: $y = γ \cdot x / \frac{1}{D} \sum x_{i}^{2}$ . No mean subtraction, no bias $β$ . Cheaper and as effective.
LayerScale: Per-channel learnable diagonal $λ$ on each sublayer's residual contribution, initialised $\sim 1 0^{- 4}$ . Makes deep nets train as near-identity at init.
QK-Norm: Apply LayerNorm (or RMSNorm) to $Q$ and $K$ separately before the attention dot product. Prevents softmax saturation at long sequences / large head dims.
Registers: Extra learnable tokens prepended to the input with no positional encoding; act as global scratchpad. Without them, the model corrupts uninformative patches. Discarded at output.
Flash Attention: Exact, IO-aware attention algorithm. Tiles $Q, K, V$ into SRAM-sized blocks; computes streaming online softmax; never materialises the $N \times N$ matrix in HBM. Memory $O (N^{2}) \to O (N)$ . 2–4× faster.
Online softmax: Stream the softmax: keep running max $m$ , denominator $ℓ$ , output $O$ ; rescale $O, ℓ$ when $m$ grows. Mathematically equivalent to standard softmax, computable tile-by-tile.
RoPE (Rotary Position Embedding): Rotate pairs of dimensions of $Q$ and $K$ by angle $m θ_{i}$ proportional to position. Attention dot product depends only on relative position $m - n$ .
KV-cache: Cache of $K, V$ for past tokens during autoregressive generation. Per-step compute $O (t^{2}) \to O (t)$ ; total generation $O (N^{3}) \to O (N^{2})$ . Memory grows linearly.
MHA / GQA / MQA: Multi-Head Attention: $H$ separate $K, V$ heads. Grouped-Query: $G < H$ shared groups. Multi-Query: $G = 1$ (one $K, V$ for all heads). KV cache ratio: $H / G$ .

Unit 15 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

VLM Architecture — Encoders, Connectors, Positional Encoding

Modality gap: Text is discrete (vocab → lookup); images are continuous (pixels → encoder). A VLM aligns them into a shared latent space.
Three-pillar blueprint: Vision Encoder → Connector → LLM Backbone. Visual + text tokens concatenated, vanilla self-attention handles cross-modal reasoning.
Connector / adapter: Small (linear or MLP) projection from vision-encoder output dim to LLM token dim. In stitched VLMs, the only randomly initialised component.
SigLIP: Sigmoid CLIP. Pairwise binary cross-entropy with logits $z_{ij} = τ f_{I} (x_{i})^{⊤} f_{T} (t_{j}) + b$ . Scales without batch-wide softmax sync.
Prefix-LM mask: Bidirectional attention over image + prompt; causal over answer. Loss computed only on answer tokens.
Location tokens: 1024 extended-vocab tokens $⟨ loc0000 ⟩ \dots ⟨ loc1023 ⟩$ encoding normalised bounding-box coordinates. Detection output is pure text.
Segmentation codewords: 128 extended-vocab tokens $⟨ seg000 ⟩ \dots ⟨ seg127 ⟩$ from a learned VQ-VAE codebook. Segmentation mask = bbox + codeword sequence decoded to pixels.
Dynamic resolution (Qwen2-VL): Process at native aspect ratio + resolution; tile-count clamped by $N_{min} \leq N \leq N_{max}$ . Fixes resolution bottleneck + aspect distortion.
2D-RoPE: Split head dim in halves; rotate first half by row, second half by column. Attention depends on 2D relative displacement.
M-RoPE: Multimodal RoPE (Qwen2-VL): head dim split in thirds for $(t, r, c)$ rotations. Static images: $m_{t} = 0$ . Text tokens: all three = token index.
Native multimodal (Gemma 4): Vision and language share Transformer blocks from early layers; no connector. Modalities meet inside the Transformer with shared weights.
VLA (Vision-Language-Action): VLM + action de-tokeniser. Continuous actions discretised into 256 bins per dimension; treated as just another token type.
OpenVLA: DinoV2 + SigLIP image encoders + MLP connector + LLaMA-2 7B + action de-tokeniser. Trained on 970k robot trajectories. ~7B params.

Unit 16 — Video Understanding

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer

Action classification: One label per trimmed clip — "is this dancing?" Datasets: UCF, HMDB, Kinetics.
Temporal action localisation: Return start/end intervals of actions in an untrimmed video. "Dance: 12.4s–18.7s".
Spatio-temporal action localisation: Bounding box AND time interval per action. AVA is canonical.
Kinetics: Carreira & Zisserman, CVPR 2017. 400/600/700 action classes. The ImageNet for videos — pretraining target for nearly every modern video model.
AVA: Atomic Visual Actions (Gu, CVPR 2018). Spatio-temporally localised — box + time interval + label across pose, person-object, person-person categories.
Optical flow: Per-pixel 2D vector $(u, v)$ describing motion between consecutive frames. Explicit motion signal that RGB-only models miss.
3D convolution (Conv3D): Convolution with a $K_{T} \times K_{H} \times K_{W}$ kernel sliding over a $T \times H \times W$ video volume. Captures spatio-temporal neighbours.
I3D (Inflated 3D ConvNet): Carreira & Zisserman, CVPR 2017. Take a 2D ImageNet CNN, inflate each $K \times K$ filter to $K_{T} \times K \times K$ by replicating along time and dividing by $K_{T}$ . Fine-tune on Kinetics.
Two-Stream networks: Simonyan & Zisserman, NeurIPS 2014. Parallel spatial (RGB) + temporal (optical flow, $2 L$ channels) CNNs, late fusion.
LRCN: Long-term Recurrent Convolutional Networks (Donahue et al., CVPR 2015). Per-frame CNN encoder → LSTM. Good for variable-length outputs (captions).
SlowFast: Feichtenhofer et al., CVPR 2019. Slow pathway (low fps, high channels — semantics) + Fast pathway (high fps, low channels — motion) with lateral fusion.
ViViT: Video Vision Transformer (Arnab, ICCV 2021). Two token-extraction strategies: per-frame patches or 3D *tubelets* ( $t \times h \times w$ ). Explores attention factorisations.
Tubelet embedding: ViViT's strategy of linearly projecting 3D spatio-temporal cubes (e.g., $2 \times 16 \times 16$ ) into tokens — encodes spatio-temporal info from the start.
Divided space-time attention: TimeSformer's winning factorisation: each block does temporal attention (per spatial location across time) then spatial attention (per frame). Near-linear cost.
Dense Event Captioning: Krishna, ICCV 2017. Given a long video, output a list of events with time intervals AND captions; multiple overlapping events allowed.