Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Cheatsheet

Ultra-condensed. Revise a chapter in minutes.

Unit 1 — Object Detection

Object Detection — R-CNN family, YOLO, NMS, mAP
One-liners
  • R-CNN slow → Fast (RoI Pool) → Faster (RPN) → YOLO (single shot).
  • RPN: 9 anchors per location (3 scales × 3 ratios).
  • YOLO loss: 5 terms (center, size with √, obj-conf, noobj-conf, class). λ_coord = 5, λ_noobj = 0.5.
  • NMS is per class. Soft-NMS multiplies by IoU-decay instead of zeroing.
Formulas
  • IoU = |A∩B|/|A∪B|
  • GIoU = IoU − |C\(A∪B)|/|C|
  • FL = −(1 − p_t)^γ · log p_t
  • YOLO output: S × S × (B·5 + C)
Definitions
  • Anchor = predefined box prior; predictions are offsets.
  • RPN = shared-backbone proposal network (Faster R-CNN).
  • mAP = mean of per-class AP across classes.
Algorithms
  • NMS: sort by score; keep top; suppress IoU > τ; repeat. PER CLASS.
  • mAP: sort detections; mark TP/FP at IoU ≥ 0.5; compute cumulative P, R; AP = area under PR curve; mean over classes.
Comparisons
  • RoI Pool vs RoI Align: Pool quantises twice (RoI to cells, cell to sub-cell); Align uses bilinear interpolation at exact float coords. Mask-AP gap on small objects.
  • Two-stage (Faster R-CNN) vs One-stage (YOLO/RetinaNet): Two-stage accurate but slower (proposals + classification); one-stage real-time, historically lower AP — fixed by Focal Loss + FPN.
Keywords
IoUanchorRPNNMSmAPFocal LossGIoUSoft-NMSSelective Search

Unit 2 — Dense Prediction: Segmentation + Depth

Dense Prediction — Segmentation & Monocular Depth
One-liners
  • Semantic = class only. Instance = things w/ ids. Panoptic = both.
  • U-Net skips CONCAT, ResNet skips ADD.
  • RoI Align: bilinear interp, no quantization. Critical for Mask R-CNN.
  • MiDaS depth is RELATIVE (scale-shift invariant loss).
Formulas
  • Dice = 2|A∩B|/(|A|+|B|) = 2·IoU/(1+IoU)
  • mIoU = mean over classes of IoU per class
  • Atrous conv: expands RF by (rate)× at fixed params
Definitions
  • Transposed conv = learnable upsampling.
  • RoI Align = bilinear interpolation; sub-pixel accurate.
  • Dilated conv = gaps between kernel taps.
Algorithms
  • Mask R-CNN head: per-RoI FCN → 28×28 mask per class; BCE on correct class only.
  • FCN-8s decoder: upsample deep + add pool3, pool4 skips, upsample → output.
Comparisons
  • Dice loss vs Cross-entropy: Dice handles class imbalance natively; CE saturates when one class dominates.
  • RoI Pool vs RoI Align: Pool quantises twice; Align bilinearly samples → sub-pixel accuracy → big mask AP gain.
Keywords
FCNU-NetMask R-CNNRoI AlignatrousDicemIoUpanopticMiDaS

Unit 3 — Pose Estimation

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL
One-liners
  • Heatmap regression > coordinate regression (preserves spatial uncertainty).
  • CPM: multi-stage refinement + intermediate supervision (kills vanishing grads).
  • Top-down accurate, scales O(P). Bottom-up scales O(1), grouping is hard.
  • OpenPose channels = K + 2L. e.g., 18 + 38 = 56.
  • SMPL: β = 10 shape, θ = 72 pose. Mesh = 6890 vertices.
Formulas
  • PAF score = ∫ PAF · limb_dir du from A to B
  • PCK@α correct ⟺ dist ≤ α · d_ref
Definitions
  • Heatmap regression = dense per-pixel Gaussian target.
  • PAF = vector field encoding limb direction.
  • SMPL = parametric body model (β shape + θ pose).
Algorithms
  • OpenPose grouping: candidate keypoints from heatmap argmax → score every pair via PAF line integral → Hungarian matching per limb.
  • Heatmap argmax + parabola fit: fit y = ax² + bx + c to (h_{x-1}, h_x, h_{x+1}); peak at x − b/(2a).
Comparisons
  • Top-down pose vs Bottom-up pose: Top-down: O(P) runtime, more accurate, fails when detector misses. Bottom-up: O(1) in P, robust to misses, grouping ambiguous in crowds.
  • Coordinate regression vs Heatmap regression: Coordinate: no uncertainty, loses spatial structure. Heatmap: pixel-dense, expresses ambiguity, sub-pixel via parabola fit.
Keywords
heatmapCPMPAFOpenPoseHungarianSMPLHMRPCKhtop-downbottom-up

Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN
One-liners
  • Point clouds: unstructured, irregular density, unordered. CNNs break on all three.
  • VoxNet: occupancy grid → 3D CNN. Memory O(N³); resolution cap ~32³.
  • PointNet = shared MLP + MAX pool. Symmetric ⇒ permutation invariant.
  • PointNet++ = hierarchical PointNet (FPS + ball query).
  • DGCNN = EdgeConv over dynamic kNN in feature space.
  • MeshCNN = operate on edges; pool = edge collapse.
Formulas
  • PointNet(P) = γ(max_p h(p))
  • EdgeConv: e_ij = h(x_i, x_j − x_i); x_i' = max_j e_ij
  • VoxNet memory: O(N³) voxels
Definitions
  • Symmetric function = permutation-invariant.
  • Critical points = inputs that survive max-pool.
  • Dynamic graph = kNN in feature space, rebuilt per layer.
Algorithms
  • FPS (farthest-point sampling): pick first point; iteratively add the point farthest from the chosen set.
  • PointNet++ set-abstraction: FPS centroids → ball query neighbourhoods → PointNet locally → upsample for segmentation.
Comparisons
  • VoxNet vs PointNet: VoxNet: regularises via voxelization, 3D conv, O(N³) memory, low resolution. PointNet: raw points, shared MLP + max pool, no quantization, no local context.
  • PointNet vs PointNet++: Flat global pool vs hierarchical local pools (FPS + ball query).
  • DGCNN vs PointNet++: DGCNN uses kNN in feature space (semantic neighbours); PointNet++ uses ball query in xyz space (geometric neighbours).
Keywords
voxelPointNetPointNet++DGCNNEdgeConvMeshCNNsymmetric functioncritical points

Unit 5 — NeRF & 3D Gaussian Splatting

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering
One-liners
  • 3DGS = per-scene optimisation, NOT a neural network.
  • Three pillars: scene modelling / image formation / optimisation.
  • Per-Gaussian params: 3 + 7 + 1 + 48 = 59.
  • Σ = R·S·Sᵀ·Rᵀ guarantees PSD.
  • Metrics: PSNR↑, SSIM↑, LPIPS↓.
Formulas
  • C = Σ c_i α_i ∏_{j<i}(1 − α_j)
  • Σ = R·S·Sᵀ·Rᵀ
  • L = (1 − λ)·L₁ + λ·L_D-SSIM, λ ≈ 0.2
Definitions
  • SH = orthonormal angular basis (view-dependent colour).
  • ADC = clone/split/prune of Gaussians.
  • COLMAP = SfM pre-processing for poses + sparse cloud.
Algorithms
  • 3DGS render: sort Gaussians back-to-front by depth → project μ, Σ to 2D → alpha-composite.
  • ADC step: for each Gaussian, look at position gradient. Small + high grad → clone. Large + high grad → split. Low α → prune.
Comparisons
  • NeRF vs 3DGS: NeRF: implicit MLP, dense ray marching, hours to train, seconds/frame to render. 3DGS: explicit Gaussians, rasteriser, ~30 min training, real-time rendering.
  • Direct Σ optim vs R·S·Sᵀ·Rᵀ: Direct can produce invalid (non-PSD) covariances; decomposition is PSD by construction.
Keywords
NeRF3DGSspherical harmonicsADCCOLMAPPSNRSSIMLPIPSalpha compositing

Unit 6 — Attention & Transformers

Attention Mechanism & Transformer Architecture
One-liners
  • Attn = softmax(QKᵀ/√dₖ) V. /√dₖ rescales variance.
  • MHA: h parallel attentions in dₖ = d_model/h, concat, project.
  • Encoder = self-attn + FFN. Decoder = masked self-attn + cross-attn + FFN.
  • Positional encoding required: self-attn is permutation equivariant.
Formulas
  • Attn(Q,K,V) = softmax(QKᵀ/√dₖ) V
  • PE(pos, 2i) = sin(pos/10000^{2i/d})
  • y = x + Sublayer(LN(x)) (PreNorm)
Definitions
  • Self-attn = Q,K,V from same seq.
  • Cross-attn = Q from decoder, K,V from encoder.
  • Masked self-attn = look-ahead mask sets future logits to −∞.
Algorithms
  • Decoder step: feed prefix → masked self-attn → cross-attn over encoder → FFN → logits → argmax/sample → append.
  • KV-cache: store K, V of past tokens; per step compute Q for new token only.
Comparisons
  • RNN / LSTM vs Transformer: Sequential vs parallel; O(n) memory vs O(n²); vanishing grads vs constant path length.
  • Sinusoidal PE vs Learned PE: Sinusoidal generalises to unseen lengths; learned is data-dependent and breaks beyond training length.
Keywords
attentionmulti-headscaled dot productself-attentioncross-attentionmaskedPEAdd & NormFFN

Unit 7 — Vision Transformers (ViT)

ViT Pipeline, Scaling, and Swin
One-liners
  • ViT: 16×16 patches → 196 tokens (+CLS) → encoder → classify.
  • ViT-B/16 ≈ 86 M params. Per layer ≈ 7 M.
  • Position embedding ESSENTIAL — without it, attention is order-blind.
  • Swin: window self-attn + shift every other layer ⇒ O(n) per layer, global RF over depth.
Formulas
  • N = HW/P²; tokens = N + 1 (with CLS)
  • Per-block params = 4d² + 2d·d_ff
  • ViT cost: O(N² d). Swin: O(M²N)
Definitions
  • Patch embedding = conv(P, stride P).
  • [CLS] = learnable global token.
  • Swin = local windows + shifted grid.
Algorithms
  • ViT forward: patchify → project → +CLS → +PE → L encoder layers → CLS head.
  • Swin block: window self-attn → shift → window self-attn → reverse-shift → MLP.
Comparisons
  • CNN vs ViT: CNN: local early, growing RF, win on small data. ViT: mixed local/global early, needs massive data, scales better.
  • ViT vs Swin: ViT: global attention, O(N²). Swin: local windows + shift, O(N) per layer, hierarchical downsampling for dense tasks.
Keywords
patchCLSViT-B/16PESwinshifted windowinductive biasJFT

Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP
One-liners
  • SimCLR: 2 augmentations, big batch, NT-Xent.
  • MoCo: queue of past keys + momentum encoder (small batch OK).
  • BYOL: no negatives — predictor head + stop-gradient + momentum target.
  • CLIP: image-text contrastive on 400 M pairs; zero-shot via 'a photo of a {class}'.
Formulas
  • L = −log e^{s⁺/τ} / Σ_j e^{s_j/τ}
  • θ_k ← m θ_k + (1−m) θ_q (MoCo, m ≈ 0.999)
  • CLIP loss = ½(L_{i→t} + L_{t→i}) (symmetric CE)
Definitions
  • Positive pair = two augmented views of same image.
  • Projection head g = MLP between encoder and loss; discarded downstream.
  • Zero-shot = no labelled examples of target classes seen.
Algorithms
  • SimCLR step: augment ×2 → encoder → projection → NT-Xent over 2N − 2 negatives.
  • CLIP step: encode images → encode texts → N×N cosine → symmetric softmax CE.
Comparisons
  • SimCLR vs MoCo: SimCLR: big batch, in-batch negatives. MoCo: queue + momentum encoder, decouples batch size from #negatives.
  • MoCo / SimCLR vs BYOL: MoCo/SimCLR need negatives. BYOL doesn't — avoids collapse via predictor + stop-gradient + momentum target.
Keywords
SimCLRInfoNCENT-XentMoComomentum encoderBYOLstop-gradientCLIPzero-shot

Unit 9 — SSL: DINO, MAE, JEPA

DINO, MAE, JEPA — Modern SSL Beyond Contrastive
One-liners
  • DINO = self-DIstillation NO labels. EMA teacher; centering + sharpening; multi-crop.
  • DINO output dim = 65,536 (large to prevent collapse-to-one-dim).
  • MAE: 75% mask, encoder sees only visible, small decoder reconstructs pixels.
  • JEPA: predict TARGET REPRESENTATIONS (not pixels). Latent-space L2.
Formulas
  • DINO: L = −Σ p_t · log p_s. p_t = softmax((g_t − c)/τ_t).
  • EMA: θ_t ← λ θ_t + (1−λ) θ_s, λ ≈ 0.996 → 1.
  • MAE: L = mean squared error on masked patches only.
Definitions
  • Self-distillation = student matches teacher's distribution.
  • Centering = subtract running mean (anti-collapse #1).
  • Sharpening = low teacher τ (anti-collapse #2).
  • Registers = scratchpad tokens, no position, clean attention maps.
Algorithms
  • DINO step: augment image into multi-crop set → student over all, teacher over global → softmax with centering+sharpening → CE → backprop student only → EMA update teacher.
  • MAE step: patchify → mask 75% → encoder on visible → insert mask tokens → decoder → MSE on masked patches.
Comparisons
  • DINO vs BYOL: BYOL outputs feature vectors, MSE between predictor and target. DINO outputs distributions, cross-entropy with centering+sharpening.
  • MAE vs JEPA: MAE predicts pixels (wastes capacity on texture). JEPA predicts target features (semantic level).
Keywords
DINOEMA teachercenteringsharpeningmulti-cropMAE75% maskJEPAregistersI-JEPA

Unit 10 — Transformer Advances (ViT-5 era)

Modern Transformer Upgrades
One-liners
  • 7 modern upgrades: PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash.
  • PreNorm: x + Sublayer(LN(x)). Unbroken residual.
  • RMSNorm = LayerNorm without mean subtraction. Cheaper.
  • RoPE: rotation matrix on Q, K. Encodes RELATIVE position.
  • Flash Attention: tile + online softmax → never materialise N×N in HBM.
  • GQA: heads share K, V per group → KV-cache shrinks h/G×.
Formulas
  • PreNorm: y = x + Sublayer(LN(x))
  • RMSNorm: y = γ · x / RMS(x)
  • RoPE: rotate (q_{2i}, q_{2i+1}) by m · θᵢ
Definitions
  • Registers = global scratchpad tokens, no position.
  • LayerScale γ_l ≈ 1e-4 at init.
  • KV-cache: store past K, V for autoregressive.
Algorithms
  • Flash Attention forward: tile (Q_i, K_j, V_j) blocks → compute partial softmax → accumulate via online softmax stats → next block.
  • Decoder with KV-cache: per step, compute new Q only; append new K, V; attend new Q against cached K, V.
Comparisons
  • PostNorm vs PreNorm: PostNorm needs careful warmup; PreNorm has unbroken residual stream and trains stably deep.
  • LayerNorm vs RMSNorm: LayerNorm centers + scales; RMSNorm only scales. RMSNorm cheaper; mean subtraction empirically negligible.
  • MHA vs GQA / MQA: MHA: K, V per head. MQA: K, V shared across all heads (smallest cache, quality loss). GQA: per-group share — sweet spot.
Keywords
PreNormRMSNormLayerScaleQK-NormRegistersRoPEFlash AttentionGQAKV-cache

Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

VLM Architecture — Encoders, Connectors, Positional Encoding
One-liners
  • VLM three pillars: Vision Encoder + Connector + LLM.
  • Connector = ONLY random-init piece in stitched VLMs.
  • Prefix-LM: bidirectional on image+prompt, causal on answer, loss on answer.
  • SigLIP: pairwise sigmoid BCE (no batch-wide softmax).
  • Qwen2-VL: dynamic resolution + 2-layer MLP connector + M-RoPE.
  • Gemma 4: native multimodal (no connector).
Formulas
  • SigLIP: −1/n² Σᵢⱼ [y log σ(z) + (1−y) log(1−σ(z))]
  • M-RoPE position: (t, r, c)
Definitions
  • Modality gap = discrete vocab vs continuous pixels.
  • Prefix-LM = mixed bidirectional/causal mask.
  • Dynamic resolution = native AR + clamped token count.
Algorithms
  • PaliGemma forward: SigLIP(image) → linear → concat with text tokens → Gemma decoder with prefix-LM mask → next-token loss on suffix.
  • M-RoPE: split head dim into 3; rotate Q, K by (m_t·θ, m_r·θ, m_c·θ) respectively.
Comparisons
  • CLIP vs SigLIP: CLIP: softmax over batch → needs sync. SigLIP: pairwise sigmoid BCE → independent per pair → scales arbitrarily large batches.
  • PaliGemma 224² fixed vs Qwen2-VL native AR: PaliGemma loses fine detail. Qwen2-VL preserves AR + resolution; tile count clamped by user.
  • Stitched (PaliGemma) vs Native multimodal (Gemma 4): Stitched: connector bottleneck reconciles two pretrained latent spaces. Native: joint training from scratch, no connector.
Keywords
VLMSigLIPPaliGemmaQwen2-VLM-RoPEPrefix-LMconnectordynamic resolutionGemma 4

Unit 12 — Video Understanding

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer
One-liners
  • Video ≠ images × T (temporal pattern matters).
  • I3D: inflate 2D filters to 3D, /K_T for activation scale.
  • Two-Stream: spatial (RGB) + temporal (stacked optical flow, 2L ch).
  • SlowFast: low-fps semantic + high-fps motion + lateral fusion.
  • ViViT: uniform frame sample OR tubelet embedding.
  • TimeSformer winner: divided space-time attention.
Formulas
  • 3D conv filter: C_out × C_in × K_T × K_H × K_W
  • I3D inflate: W_3D = W_2D / K_T
  • Joint attn cost: O((TN)²·d). Divided ≈ O(T²N + TN²)·d.
Definitions
  • Optical flow = 2D motion field per pixel.
  • Tubelet = 3D patch (t × h × w).
  • Divided attn = factorise temporal then spatial.
Algorithms
  • Two-Stream forward: spatial CNN on RGB + temporal CNN on L flow frames → late fusion of softmax.
  • TimeSformer divided block: temporal MSA → spatial MSA → MLP; each per block.
Comparisons
  • 2D CNN + LSTM vs I3D: 2D+LSTM: per-frame spatial features then RNN aggregator. I3D: native spatio-temporal kernels, end-to-end on space and time.
  • Joint space-time vs Divided space-time: Joint: O((TN)²) — prohibitive. Divided: separate temporal then spatial; near-linear; best accuracy/efficiency.
  • Slow path vs Fast path (SlowFast): Slow: low fps, high channels, expensive per frame (semantics). Fast: high fps, low channels, cheap (motion).
Keywords
KineticsI3DC3DTwo-Streamoptical flowSlowFastViViTTimeSformertubeletdivided attention