Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Cheatsheet

Ultra-condensed. Revise a chapter in minutes.

Unit 1 — Object Detection

Object Detection — R-CNN family, YOLO, NMS, mAP

One-liners

R-CNN slow → Fast (RoI Pool) → Faster (RPN) → YOLO (single shot).
RPN: 9 anchors per location (3 scales × 3 ratios).
YOLO loss: 5 terms (center, size with √, obj-conf, noobj-conf, class). λ_coord = 5, λ_noobj = 0.5.
NMS is per class. Soft-NMS multiplies by IoU-decay instead of zeroing.

Formulas

IoU = |A∩B|/|A∪B|
GIoU = IoU − |C\(A∪B)|/|C|
FL = −(1 − p_t)^γ · log p_t
YOLO output: S × S × (B·5 + C)

Definitions

Anchor = predefined box prior; predictions are offsets.
RPN = shared-backbone proposal network (Faster R-CNN).
mAP = mean of per-class AP across classes.

Algorithms

NMS: sort by score; keep top; suppress IoU > τ; repeat. PER CLASS.
mAP: sort detections; mark TP/FP at IoU ≥ 0.5; compute cumulative P, R; AP = area under PR curve; mean over classes.

Comparisons

RoI Pool vs RoI Align: Pool quantises twice (RoI to cells, cell to sub-cell); Align uses bilinear interpolation at exact float coords. Mask-AP gap on small objects.
Two-stage (Faster R-CNN) vs One-stage (YOLO/RetinaNet): Two-stage accurate but slower (proposals + classification); one-stage real-time, historically lower AP — fixed by Focal Loss + FPN.

Keywords

IoUanchorRPNNMSmAPFocal LossGIoUSoft-NMSSelective Search

Unit 2 — Dense Prediction: Segmentation + Depth

Dense Prediction — Segmentation & Monocular Depth

One-liners

Semantic = class only. Instance = things w/ ids. Panoptic = both.
U-Net skips CONCAT, ResNet skips ADD.
RoI Align: bilinear interp, no quantization. Critical for Mask R-CNN.
MiDaS depth is RELATIVE (scale-shift invariant loss).

Formulas

Dice = 2|A∩B|/(|A|+|B|) = 2·IoU/(1+IoU)
mIoU = mean over classes of IoU per class
Atrous conv: expands RF by (rate)× at fixed params

Definitions

Transposed conv = learnable upsampling.
RoI Align = bilinear interpolation; sub-pixel accurate.
Dilated conv = gaps between kernel taps.

Algorithms

Mask R-CNN head: per-RoI FCN → 28×28 mask per class; BCE on correct class only.
FCN-8s decoder: upsample deep + add pool3, pool4 skips, upsample → output.

Comparisons

Dice loss vs Cross-entropy: Dice handles class imbalance natively; CE saturates when one class dominates.
RoI Pool vs RoI Align: Pool quantises twice; Align bilinearly samples → sub-pixel accuracy → big mask AP gain.

Keywords

FCNU-NetMask R-CNNRoI AlignatrousDicemIoUpanopticMiDaS

Unit 3 — Pose Estimation

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL

One-liners

Heatmap regression > coordinate regression (preserves spatial uncertainty).
CPM: multi-stage refinement + intermediate supervision (kills vanishing grads).
Top-down accurate, scales O(P). Bottom-up scales O(1), grouping is hard.
OpenPose channels = K + 2L. e.g., 18 + 38 = 56.
SMPL: β = 10 shape, θ = 72 pose. Mesh = 6890 vertices.

Formulas

PAF score = ∫ PAF · limb_dir du from A to B
PCK@α correct ⟺ dist ≤ α · d_ref

Definitions

Heatmap regression = dense per-pixel Gaussian target.
PAF = vector field encoding limb direction.
SMPL = parametric body model (β shape + θ pose).

Algorithms

OpenPose grouping: candidate keypoints from heatmap argmax → score every pair via PAF line integral → Hungarian matching per limb.
Heatmap argmax + parabola fit: fit y = ax² + bx + c to (h_{x-1}, h_x, h_{x+1}); peak at x − b/(2a).

Comparisons

Top-down pose vs Bottom-up pose: Top-down: O(P) runtime, more accurate, fails when detector misses. Bottom-up: O(1) in P, robust to misses, grouping ambiguous in crowds.
Coordinate regression vs Heatmap regression: Coordinate: no uncertainty, loses spatial structure. Heatmap: pixel-dense, expresses ambiguity, sub-pixel via parabola fit.

Keywords

heatmapCPMPAFOpenPoseHungarianSMPLHMRPCKhtop-downbottom-up

Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN

One-liners

Point clouds: unstructured, irregular density, unordered. CNNs break on all three.
VoxNet: occupancy grid → 3D CNN. Memory O(N³); resolution cap ~32³.
PointNet = shared MLP + MAX pool. Symmetric ⇒ permutation invariant.
PointNet++ = hierarchical PointNet (FPS + ball query).
DGCNN = EdgeConv over dynamic kNN in feature space.
MeshCNN = operate on edges; pool = edge collapse.

Formulas

PointNet(P) = γ(max_p h(p))
EdgeConv: e_ij = h(x_i, x_j − x_i); x_i' = max_j e_ij
VoxNet memory: O(N³) voxels

Definitions

Symmetric function = permutation-invariant.
Critical points = inputs that survive max-pool.
Dynamic graph = kNN in feature space, rebuilt per layer.

Algorithms

FPS (farthest-point sampling): pick first point; iteratively add the point farthest from the chosen set.
PointNet++ set-abstraction: FPS centroids → ball query neighbourhoods → PointNet locally → upsample for segmentation.

Comparisons

VoxNet vs PointNet: VoxNet: regularises via voxelization, 3D conv, O(N³) memory, low resolution. PointNet: raw points, shared MLP + max pool, no quantization, no local context.
PointNet vs PointNet++: Flat global pool vs hierarchical local pools (FPS + ball query).
DGCNN vs PointNet++: DGCNN uses kNN in feature space (semantic neighbours); PointNet++ uses ball query in xyz space (geometric neighbours).

Keywords

voxelPointNetPointNet++DGCNNEdgeConvMeshCNNsymmetric functioncritical points

Unit 5 — NeRF & 3D Gaussian Splatting

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering

One-liners

3DGS = per-scene optimisation, NOT a neural network.
Three pillars: scene modelling / image formation / optimisation.
Per-Gaussian params: 3 + 7 + 1 + 48 = 59.
Σ = R·S·Sᵀ·Rᵀ guarantees PSD.
Metrics: PSNR↑, SSIM↑, LPIPS↓.

Formulas

C = Σ c_i α_i ∏_{j<i}(1 − α_j)
Σ = R·S·Sᵀ·Rᵀ
L = (1 − λ)·L₁ + λ·L_D-SSIM, λ ≈ 0.2

Definitions

SH = orthonormal angular basis (view-dependent colour).
ADC = clone/split/prune of Gaussians.
COLMAP = SfM pre-processing for poses + sparse cloud.

Algorithms

3DGS render: sort Gaussians back-to-front by depth → project μ, Σ to 2D → alpha-composite.
ADC step: for each Gaussian, look at position gradient. Small + high grad → clone. Large + high grad → split. Low α → prune.

Comparisons

NeRF vs 3DGS: NeRF: implicit MLP, dense ray marching, hours to train, seconds/frame to render. 3DGS: explicit Gaussians, rasteriser, ~30 min training, real-time rendering.
Direct Σ optim vs R·S·Sᵀ·Rᵀ: Direct can produce invalid (non-PSD) covariances; decomposition is PSD by construction.

Keywords

NeRF3DGSspherical harmonicsADCCOLMAPPSNRSSIMLPIPSalpha compositing

Unit 6 — Attention & Transformers

Attention Mechanism & Transformer Architecture

One-liners

Attn = softmax(QKᵀ/√dₖ) V. /√dₖ rescales variance.
MHA: h parallel attentions in dₖ = d_model/h, concat, project.
Encoder = self-attn + FFN. Decoder = masked self-attn + cross-attn + FFN.
Positional encoding required: self-attn is permutation equivariant.

Formulas

Attn(Q,K,V) = softmax(QKᵀ/√dₖ) V
PE(pos, 2i) = sin(pos/10000^{2i/d})
y = x + Sublayer(LN(x)) (PreNorm)

Definitions

Self-attn = Q,K,V from same seq.
Cross-attn = Q from decoder, K,V from encoder.
Masked self-attn = look-ahead mask sets future logits to −∞.

Algorithms

Decoder step: feed prefix → masked self-attn → cross-attn over encoder → FFN → logits → argmax/sample → append.
KV-cache: store K, V of past tokens; per step compute Q for new token only.

Comparisons

RNN / LSTM vs Transformer: Sequential vs parallel; O(n) memory vs O(n²); vanishing grads vs constant path length.
Sinusoidal PE vs Learned PE: Sinusoidal generalises to unseen lengths; learned is data-dependent and breaks beyond training length.

Keywords

attentionmulti-headscaled dot productself-attentioncross-attentionmaskedPEAdd & NormFFN

Unit 7 — Vision Transformers (ViT)

ViT Pipeline, Scaling, and Swin

One-liners

ViT: 16×16 patches → 196 tokens (+CLS) → encoder → classify.
ViT-B/16 ≈ 86 M params. Per layer ≈ 7 M.
Position embedding ESSENTIAL — without it, attention is order-blind.
Swin: window self-attn + shift every other layer ⇒ O(n) per layer, global RF over depth.

Formulas

N = HW/P²; tokens = N + 1 (with CLS)
Per-block params = 4d² + 2d·d_ff
ViT cost: O(N² d). Swin: O(M²N)

Definitions

Patch embedding = conv(P, stride P).
[CLS] = learnable global token.
Swin = local windows + shifted grid.

Algorithms

ViT forward: patchify → project → +CLS → +PE → L encoder layers → CLS head.
Swin block: window self-attn → shift → window self-attn → reverse-shift → MLP.

Comparisons

CNN vs ViT: CNN: local early, growing RF, win on small data. ViT: mixed local/global early, needs massive data, scales better.
ViT vs Swin: ViT: global attention, O(N²). Swin: local windows + shift, O(N) per layer, hierarchical downsampling for dense tasks.

Keywords

patchCLSViT-B/16PESwinshifted windowinductive biasJFT

Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

One-liners

SimCLR: 2 augmentations, big batch, NT-Xent.
MoCo: queue of past keys + momentum encoder (small batch OK).
BYOL: no negatives — predictor head + stop-gradient + momentum target.
CLIP: image-text contrastive on 400 M pairs; zero-shot via 'a photo of a {class}'.

Formulas

L = −log e^{s⁺/τ} / Σ_j e^{s_j/τ}
θ_k ← m θ_k + (1−m) θ_q (MoCo, m ≈ 0.999)
CLIP loss = ½(L_{i→t} + L_{t→i}) (symmetric CE)

Definitions

Positive pair = two augmented views of same image.
Projection head g = MLP between encoder and loss; discarded downstream.
Zero-shot = no labelled examples of target classes seen.

Algorithms

SimCLR step: augment ×2 → encoder → projection → NT-Xent over 2N − 2 negatives.
CLIP step: encode images → encode texts → N×N cosine → symmetric softmax CE.

Comparisons

SimCLR vs MoCo: SimCLR: big batch, in-batch negatives. MoCo: queue + momentum encoder, decouples batch size from #negatives.
MoCo / SimCLR vs BYOL: MoCo/SimCLR need negatives. BYOL doesn't — avoids collapse via predictor + stop-gradient + momentum target.

Keywords

SimCLRInfoNCENT-XentMoComomentum encoderBYOLstop-gradientCLIPzero-shot

Unit 9 — SSL: DINO, MAE, JEPA

DINO, MAE, JEPA — Modern SSL Beyond Contrastive

One-liners

DINO = self-DIstillation NO labels. EMA teacher; centering + sharpening; multi-crop.
DINO output dim = 65,536 (large to prevent collapse-to-one-dim).
MAE: 75% mask, encoder sees only visible, small decoder reconstructs pixels.
JEPA: predict TARGET REPRESENTATIONS (not pixels). Latent-space L2.

Formulas

DINO: L = −Σ p_t · log p_s. p_t = softmax((g_t − c)/τ_t).
EMA: θ_t ← λ θ_t + (1−λ) θ_s, λ ≈ 0.996 → 1.
MAE: L = mean squared error on masked patches only.

Definitions

Self-distillation = student matches teacher's distribution.
Centering = subtract running mean (anti-collapse #1).
Sharpening = low teacher τ (anti-collapse #2).
Registers = scratchpad tokens, no position, clean attention maps.

Algorithms

DINO step: augment image into multi-crop set → student over all, teacher over global → softmax with centering+sharpening → CE → backprop student only → EMA update teacher.
MAE step: patchify → mask 75% → encoder on visible → insert mask tokens → decoder → MSE on masked patches.

Comparisons

DINO vs BYOL: BYOL outputs feature vectors, MSE between predictor and target. DINO outputs distributions, cross-entropy with centering+sharpening.
MAE vs JEPA: MAE predicts pixels (wastes capacity on texture). JEPA predicts target features (semantic level).

Keywords

DINOEMA teachercenteringsharpeningmulti-cropMAE75% maskJEPAregistersI-JEPA

Unit 10 — Transformer Advances (ViT-5 era)

Modern Transformer Upgrades

One-liners

7 modern upgrades: PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash.
PreNorm: x + Sublayer(LN(x)). Unbroken residual.
RMSNorm = LayerNorm without mean subtraction. Cheaper.
RoPE: rotation matrix on Q, K. Encodes RELATIVE position.
Flash Attention: tile + online softmax → never materialise N×N in HBM.
GQA: heads share K, V per group → KV-cache shrinks h/G×.

Formulas

PreNorm: y = x + Sublayer(LN(x))
RMSNorm: y = γ · x / RMS(x)
RoPE: rotate (q_{2i}, q_{2i+1}) by m · θᵢ

Definitions

Registers = global scratchpad tokens, no position.
LayerScale γ_l ≈ 1e-4 at init.
KV-cache: store past K, V for autoregressive.

Algorithms

Flash Attention forward: tile (Q_i, K_j, V_j) blocks → compute partial softmax → accumulate via online softmax stats → next block.
Decoder with KV-cache: per step, compute new Q only; append new K, V; attend new Q against cached K, V.

Comparisons

PostNorm vs PreNorm: PostNorm needs careful warmup; PreNorm has unbroken residual stream and trains stably deep.
LayerNorm vs RMSNorm: LayerNorm centers + scales; RMSNorm only scales. RMSNorm cheaper; mean subtraction empirically negligible.
MHA vs GQA / MQA: MHA: K, V per head. MQA: K, V shared across all heads (smallest cache, quality loss). GQA: per-group share — sweet spot.

Keywords

PreNormRMSNormLayerScaleQK-NormRegistersRoPEFlash AttentionGQAKV-cache

Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

VLM Architecture — Encoders, Connectors, Positional Encoding

One-liners

VLM three pillars: Vision Encoder + Connector + LLM.
Connector = ONLY random-init piece in stitched VLMs.
Prefix-LM: bidirectional on image+prompt, causal on answer, loss on answer.
SigLIP: pairwise sigmoid BCE (no batch-wide softmax).
Qwen2-VL: dynamic resolution + 2-layer MLP connector + M-RoPE.
Gemma 4: native multimodal (no connector).

Formulas

SigLIP: −1/n² Σᵢⱼ [y log σ(z) + (1−y) log(1−σ(z))]
M-RoPE position: (t, r, c)

Definitions

Modality gap = discrete vocab vs continuous pixels.
Prefix-LM = mixed bidirectional/causal mask.
Dynamic resolution = native AR + clamped token count.

Algorithms

PaliGemma forward: SigLIP(image) → linear → concat with text tokens → Gemma decoder with prefix-LM mask → next-token loss on suffix.
M-RoPE: split head dim into 3; rotate Q, K by (m_t·θ, m_r·θ, m_c·θ) respectively.

Comparisons

CLIP vs SigLIP: CLIP: softmax over batch → needs sync. SigLIP: pairwise sigmoid BCE → independent per pair → scales arbitrarily large batches.
PaliGemma 224² fixed vs Qwen2-VL native AR: PaliGemma loses fine detail. Qwen2-VL preserves AR + resolution; tile count clamped by user.
Stitched (PaliGemma) vs Native multimodal (Gemma 4): Stitched: connector bottleneck reconciles two pretrained latent spaces. Native: joint training from scratch, no connector.

Keywords

VLMSigLIPPaliGemmaQwen2-VLM-RoPEPrefix-LMconnectordynamic resolutionGemma 4

Unit 12 — Video Understanding

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer

One-liners

Video ≠ images × T (temporal pattern matters).
I3D: inflate 2D filters to 3D, /K_T for activation scale.
Two-Stream: spatial (RGB) + temporal (stacked optical flow, 2L ch).
SlowFast: low-fps semantic + high-fps motion + lateral fusion.
ViViT: uniform frame sample OR tubelet embedding.
TimeSformer winner: divided space-time attention.

Formulas

3D conv filter: C_out × C_in × K_T × K_H × K_W
I3D inflate: W_3D = W_2D / K_T
Joint attn cost: O((TN)²·d). Divided ≈ O(T²N + TN²)·d.

Definitions

Optical flow = 2D motion field per pixel.
Tubelet = 3D patch (t × h × w).
Divided attn = factorise temporal then spatial.

Algorithms

Two-Stream forward: spatial CNN on RGB + temporal CNN on L flow frames → late fusion of softmax.
TimeSformer divided block: temporal MSA → spatial MSA → MLP; each per block.

Comparisons

2D CNN + LSTM vs I3D: 2D+LSTM: per-frame spatial features then RNN aggregator. I3D: native spatio-temporal kernels, end-to-end on space and time.
Joint space-time vs Divided space-time: Joint: O((TN)²) — prohibitive. Divided: separate temporal then spatial; near-linear; best accuracy/efficiency.
Slow path vs Fast path (SlowFast): Slow: low fps, high channels, expensive per frame (semantics). Fast: high fps, low channels, cheap (motion).

Keywords

KineticsI3DC3DTwo-Streamoptical flowSlowFastViViTTimeSformertubeletdivided attention