Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Cheatsheet
Ultra-condensed. Revise a chapter in minutes.
Unit 1 — Object Detection
Object Detection — R-CNN family, YOLO, NMS, mAP
One-liners
- R-CNN slow → Fast (RoI Pool) → Faster (RPN) → YOLO (single shot).
- RPN: 9 anchors per location (3 scales × 3 ratios).
- YOLO loss: 5 terms (center, size with √, obj-conf, noobj-conf, class). λ_coord = 5, λ_noobj = 0.5.
- NMS is per class. Soft-NMS multiplies by IoU-decay instead of zeroing.
Formulas
IoU = |A∩B|/|A∪B|GIoU = IoU − |C\(A∪B)|/|C|FL = −(1 − p_t)^γ · log p_tYOLO output: S × S × (B·5 + C)
Definitions
- Anchor = predefined box prior; predictions are offsets.
- RPN = shared-backbone proposal network (Faster R-CNN).
- mAP = mean of per-class AP across classes.
Algorithms
- NMS: sort by score; keep top; suppress IoU > τ; repeat. PER CLASS.
- mAP: sort detections; mark TP/FP at IoU ≥ 0.5; compute cumulative P, R; AP = area under PR curve; mean over classes.
Comparisons
- RoI Pool vs RoI Align: Pool quantises twice (RoI to cells, cell to sub-cell); Align uses bilinear interpolation at exact float coords. Mask-AP gap on small objects.
- Two-stage (Faster R-CNN) vs One-stage (YOLO/RetinaNet): Two-stage accurate but slower (proposals + classification); one-stage real-time, historically lower AP — fixed by Focal Loss + FPN.
Keywords
IoUanchorRPNNMSmAPFocal LossGIoUSoft-NMSSelective Search
Unit 2 — Dense Prediction: Segmentation + Depth
Dense Prediction — Segmentation & Monocular Depth
One-liners
- Semantic = class only. Instance = things w/ ids. Panoptic = both.
- U-Net skips CONCAT, ResNet skips ADD.
- RoI Align: bilinear interp, no quantization. Critical for Mask R-CNN.
- MiDaS depth is RELATIVE (scale-shift invariant loss).
Formulas
Dice = 2|A∩B|/(|A|+|B|) = 2·IoU/(1+IoU)mIoU = mean over classes of IoU per classAtrous conv: expands RF by (rate)× at fixed params
Definitions
- Transposed conv = learnable upsampling.
- RoI Align = bilinear interpolation; sub-pixel accurate.
- Dilated conv = gaps between kernel taps.
Algorithms
- Mask R-CNN head: per-RoI FCN → 28×28 mask per class; BCE on correct class only.
- FCN-8s decoder: upsample deep + add pool3, pool4 skips, upsample → output.
Comparisons
- Dice loss vs Cross-entropy: Dice handles class imbalance natively; CE saturates when one class dominates.
- RoI Pool vs RoI Align: Pool quantises twice; Align bilinearly samples → sub-pixel accuracy → big mask AP gain.
Keywords
FCNU-NetMask R-CNNRoI AlignatrousDicemIoUpanopticMiDaS
Unit 3 — Pose Estimation
Pose Estimation — Heatmaps, CPM, OpenPose, SMPL
One-liners
- Heatmap regression > coordinate regression (preserves spatial uncertainty).
- CPM: multi-stage refinement + intermediate supervision (kills vanishing grads).
- Top-down accurate, scales O(P). Bottom-up scales O(1), grouping is hard.
- OpenPose channels = K + 2L. e.g., 18 + 38 = 56.
- SMPL: β = 10 shape, θ = 72 pose. Mesh = 6890 vertices.
Formulas
PAF score = ∫ PAF · limb_dir du from A to BPCK@α correct ⟺ dist ≤ α · d_ref
Definitions
- Heatmap regression = dense per-pixel Gaussian target.
- PAF = vector field encoding limb direction.
- SMPL = parametric body model (β shape + θ pose).
Algorithms
- OpenPose grouping: candidate keypoints from heatmap argmax → score every pair via PAF line integral → Hungarian matching per limb.
- Heatmap argmax + parabola fit: fit y = ax² + bx + c to (h_{x-1}, h_x, h_{x+1}); peak at x − b/(2a).
Comparisons
- Top-down pose vs Bottom-up pose: Top-down: O(P) runtime, more accurate, fails when detector misses. Bottom-up: O(1) in P, robust to misses, grouping ambiguous in crowds.
- Coordinate regression vs Heatmap regression: Coordinate: no uncertainty, loses spatial structure. Heatmap: pixel-dense, expresses ambiguity, sub-pixel via parabola fit.
Keywords
heatmapCPMPAFOpenPoseHungarianSMPLHMRPCKhtop-downbottom-up
Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)
3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN
One-liners
- Point clouds: unstructured, irregular density, unordered. CNNs break on all three.
- VoxNet: occupancy grid → 3D CNN. Memory O(N³); resolution cap ~32³.
- PointNet = shared MLP + MAX pool. Symmetric ⇒ permutation invariant.
- PointNet++ = hierarchical PointNet (FPS + ball query).
- DGCNN = EdgeConv over dynamic kNN in feature space.
- MeshCNN = operate on edges; pool = edge collapse.
Formulas
PointNet(P) = γ(max_p h(p))EdgeConv: e_ij = h(x_i, x_j − x_i); x_i' = max_j e_ijVoxNet memory: O(N³) voxels
Definitions
- Symmetric function = permutation-invariant.
- Critical points = inputs that survive max-pool.
- Dynamic graph = kNN in feature space, rebuilt per layer.
Algorithms
- FPS (farthest-point sampling): pick first point; iteratively add the point farthest from the chosen set.
- PointNet++ set-abstraction: FPS centroids → ball query neighbourhoods → PointNet locally → upsample for segmentation.
Comparisons
- VoxNet vs PointNet: VoxNet: regularises via voxelization, 3D conv, O(N³) memory, low resolution. PointNet: raw points, shared MLP + max pool, no quantization, no local context.
- PointNet vs PointNet++: Flat global pool vs hierarchical local pools (FPS + ball query).
- DGCNN vs PointNet++: DGCNN uses kNN in feature space (semantic neighbours); PointNet++ uses ball query in xyz space (geometric neighbours).
Keywords
voxelPointNetPointNet++DGCNNEdgeConvMeshCNNsymmetric functioncritical points
Unit 5 — NeRF & 3D Gaussian Splatting
NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering
One-liners
- 3DGS = per-scene optimisation, NOT a neural network.
- Three pillars: scene modelling / image formation / optimisation.
- Per-Gaussian params: 3 + 7 + 1 + 48 = 59.
- Σ = R·S·Sᵀ·Rᵀ guarantees PSD.
- Metrics: PSNR↑, SSIM↑, LPIPS↓.
Formulas
C = Σ c_i α_i ∏_{j<i}(1 − α_j)Σ = R·S·Sᵀ·RᵀL = (1 − λ)·L₁ + λ·L_D-SSIM, λ ≈ 0.2
Definitions
- SH = orthonormal angular basis (view-dependent colour).
- ADC = clone/split/prune of Gaussians.
- COLMAP = SfM pre-processing for poses + sparse cloud.
Algorithms
- 3DGS render: sort Gaussians back-to-front by depth → project μ, Σ to 2D → alpha-composite.
- ADC step: for each Gaussian, look at position gradient. Small + high grad → clone. Large + high grad → split. Low α → prune.
Comparisons
- NeRF vs 3DGS: NeRF: implicit MLP, dense ray marching, hours to train, seconds/frame to render. 3DGS: explicit Gaussians, rasteriser, ~30 min training, real-time rendering.
- Direct Σ optim vs R·S·Sᵀ·Rᵀ: Direct can produce invalid (non-PSD) covariances; decomposition is PSD by construction.
Keywords
NeRF3DGSspherical harmonicsADCCOLMAPPSNRSSIMLPIPSalpha compositing
Unit 6 — Attention & Transformers
Attention Mechanism & Transformer Architecture
One-liners
- Attn = softmax(QKᵀ/√dₖ) V. /√dₖ rescales variance.
- MHA: h parallel attentions in dₖ = d_model/h, concat, project.
- Encoder = self-attn + FFN. Decoder = masked self-attn + cross-attn + FFN.
- Positional encoding required: self-attn is permutation equivariant.
Formulas
Attn(Q,K,V) = softmax(QKᵀ/√dₖ) VPE(pos, 2i) = sin(pos/10000^{2i/d})y = x + Sublayer(LN(x)) (PreNorm)
Definitions
- Self-attn = Q,K,V from same seq.
- Cross-attn = Q from decoder, K,V from encoder.
- Masked self-attn = look-ahead mask sets future logits to −∞.
Algorithms
- Decoder step: feed prefix → masked self-attn → cross-attn over encoder → FFN → logits → argmax/sample → append.
- KV-cache: store K, V of past tokens; per step compute Q for new token only.
Comparisons
- RNN / LSTM vs Transformer: Sequential vs parallel; O(n) memory vs O(n²); vanishing grads vs constant path length.
- Sinusoidal PE vs Learned PE: Sinusoidal generalises to unseen lengths; learned is data-dependent and breaks beyond training length.
Keywords
attentionmulti-headscaled dot productself-attentioncross-attentionmaskedPEAdd & NormFFN
Unit 7 — Vision Transformers (ViT)
ViT Pipeline, Scaling, and Swin
One-liners
- ViT: 16×16 patches → 196 tokens (+CLS) → encoder → classify.
- ViT-B/16 ≈ 86 M params. Per layer ≈ 7 M.
- Position embedding ESSENTIAL — without it, attention is order-blind.
- Swin: window self-attn + shift every other layer ⇒ O(n) per layer, global RF over depth.
Formulas
N = HW/P²; tokens = N + 1 (with CLS)Per-block params = 4d² + 2d·d_ffViT cost: O(N² d). Swin: O(M²N)
Definitions
- Patch embedding = conv(P, stride P).
- [CLS] = learnable global token.
- Swin = local windows + shifted grid.
Algorithms
- ViT forward: patchify → project → +CLS → +PE → L encoder layers → CLS head.
- Swin block: window self-attn → shift → window self-attn → reverse-shift → MLP.
Comparisons
- CNN vs ViT: CNN: local early, growing RF, win on small data. ViT: mixed local/global early, needs massive data, scales better.
- ViT vs Swin: ViT: global attention, O(N²). Swin: local windows + shift, O(N) per layer, hierarchical downsampling for dense tasks.
Keywords
patchCLSViT-B/16PESwinshifted windowinductive biasJFT
Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)
Contrastive SSL — SimCLR / MoCo / BYOL / CLIP
One-liners
- SimCLR: 2 augmentations, big batch, NT-Xent.
- MoCo: queue of past keys + momentum encoder (small batch OK).
- BYOL: no negatives — predictor head + stop-gradient + momentum target.
- CLIP: image-text contrastive on 400 M pairs; zero-shot via 'a photo of a {class}'.
Formulas
L = −log e^{s⁺/τ} / Σ_j e^{s_j/τ}θ_k ← m θ_k + (1−m) θ_q (MoCo, m ≈ 0.999)CLIP loss = ½(L_{i→t} + L_{t→i}) (symmetric CE)
Definitions
- Positive pair = two augmented views of same image.
- Projection head g = MLP between encoder and loss; discarded downstream.
- Zero-shot = no labelled examples of target classes seen.
Algorithms
- SimCLR step: augment ×2 → encoder → projection → NT-Xent over 2N − 2 negatives.
- CLIP step: encode images → encode texts → N×N cosine → symmetric softmax CE.
Comparisons
- SimCLR vs MoCo: SimCLR: big batch, in-batch negatives. MoCo: queue + momentum encoder, decouples batch size from #negatives.
- MoCo / SimCLR vs BYOL: MoCo/SimCLR need negatives. BYOL doesn't — avoids collapse via predictor + stop-gradient + momentum target.
Keywords
SimCLRInfoNCENT-XentMoComomentum encoderBYOLstop-gradientCLIPzero-shot
Unit 9 — SSL: DINO, MAE, JEPA
DINO, MAE, JEPA — Modern SSL Beyond Contrastive
One-liners
- DINO = self-DIstillation NO labels. EMA teacher; centering + sharpening; multi-crop.
- DINO output dim = 65,536 (large to prevent collapse-to-one-dim).
- MAE: 75% mask, encoder sees only visible, small decoder reconstructs pixels.
- JEPA: predict TARGET REPRESENTATIONS (not pixels). Latent-space L2.
Formulas
DINO: L = −Σ p_t · log p_s. p_t = softmax((g_t − c)/τ_t).EMA: θ_t ← λ θ_t + (1−λ) θ_s, λ ≈ 0.996 → 1.MAE: L = mean squared error on masked patches only.
Definitions
- Self-distillation = student matches teacher's distribution.
- Centering = subtract running mean (anti-collapse #1).
- Sharpening = low teacher τ (anti-collapse #2).
- Registers = scratchpad tokens, no position, clean attention maps.
Algorithms
- DINO step: augment image into multi-crop set → student over all, teacher over global → softmax with centering+sharpening → CE → backprop student only → EMA update teacher.
- MAE step: patchify → mask 75% → encoder on visible → insert mask tokens → decoder → MSE on masked patches.
Comparisons
- DINO vs BYOL: BYOL outputs feature vectors, MSE between predictor and target. DINO outputs distributions, cross-entropy with centering+sharpening.
- MAE vs JEPA: MAE predicts pixels (wastes capacity on texture). JEPA predicts target features (semantic level).
Keywords
DINOEMA teachercenteringsharpeningmulti-cropMAE75% maskJEPAregistersI-JEPA
Unit 10 — Transformer Advances (ViT-5 era)
Modern Transformer Upgrades
One-liners
- 7 modern upgrades: PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash.
- PreNorm: x + Sublayer(LN(x)). Unbroken residual.
- RMSNorm = LayerNorm without mean subtraction. Cheaper.
- RoPE: rotation matrix on Q, K. Encodes RELATIVE position.
- Flash Attention: tile + online softmax → never materialise N×N in HBM.
- GQA: heads share K, V per group → KV-cache shrinks h/G×.
Formulas
PreNorm: y = x + Sublayer(LN(x))RMSNorm: y = γ · x / RMS(x)RoPE: rotate (q_{2i}, q_{2i+1}) by m · θᵢ
Definitions
- Registers = global scratchpad tokens, no position.
- LayerScale γ_l ≈ 1e-4 at init.
- KV-cache: store past K, V for autoregressive.
Algorithms
- Flash Attention forward: tile (Q_i, K_j, V_j) blocks → compute partial softmax → accumulate via online softmax stats → next block.
- Decoder with KV-cache: per step, compute new Q only; append new K, V; attend new Q against cached K, V.
Comparisons
- PostNorm vs PreNorm: PostNorm needs careful warmup; PreNorm has unbroken residual stream and trains stably deep.
- LayerNorm vs RMSNorm: LayerNorm centers + scales; RMSNorm only scales. RMSNorm cheaper; mean subtraction empirically negligible.
- MHA vs GQA / MQA: MHA: K, V per head. MQA: K, V shared across all heads (smallest cache, quality loss). GQA: per-group share — sweet spot.
Keywords
PreNormRMSNormLayerScaleQK-NormRegistersRoPEFlash AttentionGQAKV-cache
Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)
VLM Architecture — Encoders, Connectors, Positional Encoding
One-liners
- VLM three pillars: Vision Encoder + Connector + LLM.
- Connector = ONLY random-init piece in stitched VLMs.
- Prefix-LM: bidirectional on image+prompt, causal on answer, loss on answer.
- SigLIP: pairwise sigmoid BCE (no batch-wide softmax).
- Qwen2-VL: dynamic resolution + 2-layer MLP connector + M-RoPE.
- Gemma 4: native multimodal (no connector).
Formulas
SigLIP: −1/n² Σᵢⱼ [y log σ(z) + (1−y) log(1−σ(z))]M-RoPE position: (t, r, c)
Definitions
- Modality gap = discrete vocab vs continuous pixels.
- Prefix-LM = mixed bidirectional/causal mask.
- Dynamic resolution = native AR + clamped token count.
Algorithms
- PaliGemma forward: SigLIP(image) → linear → concat with text tokens → Gemma decoder with prefix-LM mask → next-token loss on suffix.
- M-RoPE: split head dim into 3; rotate Q, K by (m_t·θ, m_r·θ, m_c·θ) respectively.
Comparisons
- CLIP vs SigLIP: CLIP: softmax over batch → needs sync. SigLIP: pairwise sigmoid BCE → independent per pair → scales arbitrarily large batches.
- PaliGemma 224² fixed vs Qwen2-VL native AR: PaliGemma loses fine detail. Qwen2-VL preserves AR + resolution; tile count clamped by user.
- Stitched (PaliGemma) vs Native multimodal (Gemma 4): Stitched: connector bottleneck reconciles two pretrained latent spaces. Native: joint training from scratch, no connector.
Keywords
VLMSigLIPPaliGemmaQwen2-VLM-RoPEPrefix-LMconnectordynamic resolutionGemma 4
Unit 12 — Video Understanding
Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer
One-liners
- Video ≠ images × T (temporal pattern matters).
- I3D: inflate 2D filters to 3D, /K_T for activation scale.
- Two-Stream: spatial (RGB) + temporal (stacked optical flow, 2L ch).
- SlowFast: low-fps semantic + high-fps motion + lateral fusion.
- ViViT: uniform frame sample OR tubelet embedding.
- TimeSformer winner: divided space-time attention.
Formulas
3D conv filter: C_out × C_in × K_T × K_H × K_WI3D inflate: W_3D = W_2D / K_TJoint attn cost: O((TN)²·d). Divided ≈ O(T²N + TN²)·d.
Definitions
- Optical flow = 2D motion field per pixel.
- Tubelet = 3D patch (t × h × w).
- Divided attn = factorise temporal then spatial.
Algorithms
- Two-Stream forward: spatial CNN on RGB + temporal CNN on L flow frames → late fusion of softmax.
- TimeSformer divided block: temporal MSA → spatial MSA → MLP; each per block.
Comparisons
- 2D CNN + LSTM vs I3D: 2D+LSTM: per-frame spatial features then RNN aggregator. I3D: native spatio-temporal kernels, end-to-end on space and time.
- Joint space-time vs Divided space-time: Joint: O((TN)²) — prohibitive. Divided: separate temporal then spatial; near-linear; best accuracy/efficiency.
- Slow path vs Fast path (SlowFast): Slow: low fps, high channels, expensive per frame (semantics). Fast: high fps, low channels, cheap (motion).
Keywords
KineticsI3DC3DTwo-Streamoptical flowSlowFastViViTTimeSformertubeletdivided attention