Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Cheatsheet

Ultra-condensed. Revise a chapter in minutes.

Unit 1 — Introduction & Foundations

Foundations of Computer Vision — Marr, Three Rs, Gestalt, Why CV is Hard
One-liners
  • Marr: 'to know what is where, by looking.'
  • Three Rs = Reorganisation + Recognition + Reconstruction (Malik).
  • >50% of brain → vision is hard (computation is huge), not easy.
  • Gestalt 6: Proximity, Similarity, Closure, Continuation, Common Fate, Figure-Ground.
  • Bitter Lesson: scale + general methods beat hand-engineering.
  • 1966 Summer Vision Project — CV is NOT a summer task.
Formulas
  • Vision = WHAT × WHERE (+ when, why, how many)
  • Three Rs ⇒ Reorganisation ∪ Recognition ∪ Reconstruction
Definitions
  • Marr's definition: 'to know what is where, by looking.'
  • Affordance (Gibson): action possibility an object offers an agent.
  • Inverse problem: 2D image → 3D scene is one-to-many → needs priors.
  • Semantic gap: distance from pixels to concepts.
Algorithms
  • Three-R decomposition: identify what each task needs (label, group, geometry).
  • Gestalt grouping recipe: proximity → similarity → closure → continuation.
Comparisons
  • Appearance class vs Affordance class: Appearance: defined by what it looks like (brittle to intra-class variation). Affordance: defined by what it lets you do (robust but requires action reasoning).
  • Yes — mimic humans vs No — don't be limited: Inspiration (receptive fields, attention) yes. Constraint (human illusions, biological limits) no.
Keywords
MarrThree RsGestaltaffordanceBitter Lessonsemantic gapinverse problemCambrianSummer Vision Projectinvisible gorilla

Unit 2 — Digital Image Processing Recap

DIP — Filters, Histograms, Fourier/DCT, Morphology, Geometric Ops, Hough, Templates
One-liners
  • Smoothing kernels sum to 1; derivative kernels sum to 0.
  • Edge ⊥ gradient.
  • Gaussian is SEPARABLE → not .
  • Median kills salt-and-pepper; mean smears it.
  • Bilateral = spatial Gaussian × intensity Gaussian → edge-preserving.
  • JPEG uses DCT (NOT DFT) for energy compaction + no boundary discontinuity.
  • Erosion = MIN (shrink); Dilation = MAX (grow). Opening kills noise; closing fills holes.
  • Boundary = .
  • Otsu MAXIMISES between-class variance.
  • Inverse warping ✓; forward has holes.
  • Hough polar form avoids vertical-line infinity.
Formulas
  • Negative:
  • Log:
  • Gamma:
  • Conv:
  • Gaussian:
  • Hist eq:
  • Otsu:
  • Hough:
Definitions
  • Sampling = spatial discretisation; Quantisation = intensity discretisation.
  • Convolution theorem: spatial conv ↔ frequency multiplication.
  • Bilateral filter = spatial × range Gaussian — edge-preserving.
  • Otsu = max between-class variance threshold.
  • Erosion ⊖ / Dilation ⊕ duals; Opening / Closing idempotent.
  • DoF ladder: translation 2, rigid 3, similarity 4, affine 6, projective 8.
Algorithms
  • Histogram equalisation: histogram → PDF → CDF → map .
  • Otsu: sweep , compute , pick the argmax.
  • JPEG: RGB → YCbCr → 4:2:0 chroma sub → 8×8 DCT → quantise → zigzag → RLE+Huffman.
  • Hough lines: 2D accumulator over ; vote per edge pixel; peaks = lines.
Comparisons
  • DFT vs DCT: DFT: complex, periodic assumption → boundary discontinuity → high-freq energy. DCT: real, mirror-extend → no discontinuity → better energy compaction → used in JPEG.
  • Forward warping vs Inverse warping: Forward: iterate over source → holes + overlaps. Inverse: iterate over destination → every pixel filled, needs interpolation. Always inverse.
  • Mean filter vs Gaussian filter: Mean: uniform weights → boxy. Gaussian: distance-weighted, separable, no ringing, controls spread.
  • Mean filter vs Median filter: Mean: linear, smears outliers. Median: rank statistic, kills salt-and-pepper.
  • Affine vs Homography: Affine 6 DoF preserves parallel lines. Homography 8 DoF preserves only straight lines (perspective).
Keywords
samplingquantisationGaussianSobelLaplacianLoGmedianbilateralhistogramequalisationDFTDCTJPEGOtsuerosiondilationopeningclosingHoughtemplate matchingaffinehomographywarpinginterpolation

Unit 3 — Machine Learning Recap

ML — Logistic, NN+Backprop, Ensembles, Density, RNN, Metrics, kNN, Regression, PCA/SVD, Clustering
One-liners
  • Why not MSE for classification? Non-convex + vanishing grad with sigmoid.
  • Logistic SGD: .
  • Why not zero init? Symmetry — all neurons identical forever.
  • Kaiming init: factor of 2 to compensate for ReLU half-zeroing.
  • BatchNorm at inference uses RUNNING mean/var (not batch).
  • Bagging reduces variance (RF). Boosting reduces bias (AdaBoost, Viola-Jones).
  • GMM trained via EM — soft responsibilities + weighted MLE updates.
  • LSTM additive cell state = constant error carousel; clip for explosion.
  • Precision when FP costly; Recall when FN costly. F1 is harmonic mean.
  • ROC for balanced, PR for imbalanced data.
  • PCA = eigenvectors of covariance; SVD is numerically stable; LoRA = low-rank update.
  • k-means issues: hard assignment, spherical bias, init-sensitive, outlier-sensitive. Use k-means++ / GMM / k-medoids.
Formulas
  • ; softmax
  • CE:
  • Normal eqs:
  • BN:
  • LSTM:
  • P/R/F1: , ,
  • PCA: top- eigenvectors of ; SVD
Definitions
  • Cross-entropy = ; gradient on logits = .
  • BatchNorm: normalise per-batch + learnable γ, β. Inference uses running stats.
  • Bagging vs Boosting: variance reduction (parallel) vs bias reduction (sequential).
  • EM: E-step responsibilities, M-step weighted MLE; monotone log-likelihood improvement.
  • PCA: orthogonal directions of max variance; via eigendecomp of covariance.
  • k-means++: init centres proportional to squared distance from existing ones — robust init.
Algorithms
  • EM for GMM: γ_ik = π_k N(x|μ_k,Σ_k)/Σ_j π_j N(x|μ_j,Σ_j); update μ, Σ, π via weighted MLE; repeat.
  • k-means: init → assign nearest → recompute centres → repeat.
  • Backprop: forward pass stores activations; backward pass walks chain rule from loss.
  • PCA: centre → covariance → eigendecompose → top- → project.
Comparisons
  • Bagging vs Boosting: Bagging: parallel, independent, reduces VARIANCE (RF). Boosting: sequential, reweight errors, reduces BIAS (AdaBoost, Viola-Jones).
  • L1 regularisation vs L2 regularisation: L1 (): exact zeros, feature selection. L2 (): smooth shrinkage, no exact zeros.
  • ROC curve vs PR curve: ROC: TPR vs FPR, balanced data. PR: Precision vs Recall, preferred for imbalanced / rare positives.
  • LSTM vs GRU: LSTM: 3 gates + cell state. GRU: 2 gates, no separate cell state. GRU often comparable with fewer params.
  • PCA vs SVD: PCA = eigendecomp of . SVD = direct decomposition of . Same answer; SVD is more numerically stable.
  • k-means vs GMM: k-means: hard assignment, spherical clusters. GMM: soft (responsibilities), covariance-shaped clusters; EM-trained.
Keywords
sigmoidsoftmaxcross-entropyMLPbackpropBatchNormdropoutKaimingbaggingboostingAdaBoostGMMEMLSTMGRUBPTTPrecisionRecallF1mAPROCPR curvekNNnormal equationsL1L2PCASVDLoRAk-meansk-means++hierarchical clusteringdendrogram

Unit 4 — Convolutional Neural Networks (CNNs)

CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field
One-liners
  • Conv params , independent of .
  • Same pad (odd K): .
  • RF of stacked stride-1: .
  • Two < one in params; extra ReLU; same RF.
  • conv: bottleneck, mixer, cheap, plus non-linearity.
  • Pool has NO parameters.
  • BatchNorm: per-channel (length , not per-spatial). Inference uses RUNNING stats.
  • CNN feature maps are translation EQUIvariant; after pool+FC, INvariant. NOT rotation equivariant.
  • ResNet: ; gradient ⇒ no vanishing.
  • Depthwise-separable: ~ cheaper.
  • Compound scaling (EfficientNet): .
  • I3D inflates 2D weights, divides by .
Formulas
  • Output:
  • Params:
  • RF: for stacked s1
  • Dilated effective K:
  • Depthwise-sep ratio:
  • BN:
  • Residual:
Definitions
  • Translation equivariance: . Invariance: .
  • Receptive field: input pixels that influence an output.
  • conv: per-pixel cross-channel MLP; bottleneck role in Inception/ResNet.
  • Dilated conv: gaps in kernel ⇒ larger RF without more params.
  • Depthwise-separable: depthwise + pointwise; MobileNet building block.
  • Compound scaling: scale depth, width, resolution together with one coefficient.
Algorithms
  • Stack convs: () preserves ; insert stride-2 pool/conv to halve.
  • Inception module: 1×1 / 3×3 / 5×5 / 3×3-pool branches concatenated with 1×1 bottlenecks before 3×3/5×5.
  • ResNet block: with skip-add.
  • SE block: GAP → FC → ReLU → FC → sigmoid → channel-wise multiply.
  • I3D inflation: .
Comparisons
  • Conv layer vs FC layer: Conv: weight-shared, translation equivariant, fewer params, preserves spatial structure. FC: per-pixel weights, breaks spatial structure.
  • Max pool vs Avg pool / GAP: Max: argmax routing, translation-robust, dominant in classification. Avg: distribute equally, captures global context; GAP replaces FC heads.
  • ResNet (add) vs DenseNet (concat): ResNet skip-ADDS to input — same channel count. DenseNet CONCATENATES — channel count grows. Concat preserves more information; add is cheaper.
  • Standard conv vs Depthwise-separable: Standard: . Depthwise-sep: . Ratio ( cheaper).
  • Xavier init vs Kaiming init: Xavier: for sigmoid/tanh. Kaiming: for ReLU; factor 2 compensates for half-zeroing.
  • C3D vs I3D: C3D: 3D conv from scratch, expensive. I3D: inflate 2D pretrained kernels along time, divide by — gets ImageNet pretraining for free.
Keywords
convolutionstridepaddingreceptive field1×1 convpoolingGAPdilateddepthwise-separableBatchNormequivarianceinvarianceLeNetAlexNetVGGInceptionResNetDenseNetSENetMobileNetEfficientNetcompound scalingC3DI3DSlowFastWaveNet

Unit 5 — Object Detection

Object Detection — R-CNN family, YOLO, NMS, mAP
One-liners
  • R-CNN slow → Fast (RoI Pool) → Faster (RPN) → YOLO (single shot).
  • RPN: 9 anchors per location (3 scales × 3 ratios).
  • YOLO loss: 5 terms (center, size with √, obj-conf, noobj-conf, class). λ_coord = 5, λ_noobj = 0.5.
  • NMS is per class. Soft-NMS multiplies by IoU-decay instead of zeroing.
Formulas
Definitions
  • Anchor = predefined box prior; predictions are offsets.
  • RPN = shared-backbone proposal network (Faster R-CNN).
  • mAP = mean of per-class AP across classes.
Algorithms
  • NMS: sort by score; keep top; suppress IoU > τ; repeat. PER CLASS.
  • mAP: sort detections; mark TP/FP at IoU ≥ 0.5; compute cumulative P, R; AP = area under PR curve; mean over classes.
Comparisons
  • RoI Pool vs RoI Align: Pool quantises twice (RoI to cells, cell to sub-cell); Align uses bilinear interpolation at exact float coords. Mask-AP gap on small objects.
  • Two-stage (Faster R-CNN) vs One-stage (YOLO/RetinaNet): Two-stage accurate but slower (proposals + classification); one-stage real-time, historically lower AP — fixed by Focal Loss + FPN.
Keywords
IoUanchorRPNNMSmAPFocal LossGIoUSoft-NMSSelective Search

Unit 6 — Dense Prediction: Segmentation + Depth

Dense Prediction — Segmentation & Monocular Depth
One-liners
  • Semantic = class only. Instance = things w/ ids. Panoptic = both.
  • U-Net skips CONCAT, ResNet skips ADD.
  • RoI Align: bilinear interp, no quantization. Critical for Mask R-CNN.
  • MiDaS depth is RELATIVE (scale-shift invariant loss).
Formulas
  • Dice = 2|A∩B|/(|A|+|B|) = 2·IoU/(1+IoU)
  • mIoU = mean over classes of IoU per class
  • Atrous conv: expands RF by (rate)× at fixed params
Definitions
  • Transposed conv = learnable upsampling.
  • RoI Align = bilinear interpolation; sub-pixel accurate.
  • Dilated conv = gaps between kernel taps.
Algorithms
  • Mask R-CNN head: per-RoI FCN → 28×28 mask per class; BCE on correct class only.
  • FCN-8s decoder: upsample deep + add pool3, pool4 skips, upsample → output.
Comparisons
  • Dice loss vs Cross-entropy: Dice handles class imbalance natively; CE saturates when one class dominates.
  • RoI Pool vs RoI Align: Pool quantises twice; Align bilinearly samples → sub-pixel accuracy → big mask AP gain.
Keywords
FCNU-NetMask R-CNNRoI AlignatrousDicemIoUpanopticMiDaS

Unit 7 — Pose Estimation

Pose Estimation — Heatmaps, CPM, OpenPose, SMPL
One-liners
  • Heatmap regression > coordinate regression (preserves spatial uncertainty).
  • CPM: multi-stage refinement + intermediate supervision (kills vanishing grads).
  • Top-down accurate, scales O(P). Bottom-up scales O(1), grouping is hard.
  • OpenPose channels = K + 2L. e.g., 18 + 38 = 56.
  • SMPL: β = 10 shape, θ = 72 pose. Mesh = 6890 vertices.
Formulas
  • PAF score = ∫ PAF · limb_dir du from A to B
  • PCK@α correct ⟺ dist ≤ α · d_ref
Definitions
  • Heatmap regression = dense per-pixel Gaussian target.
  • PAF = vector field encoding limb direction.
  • SMPL = parametric body model (β shape + θ pose).
Algorithms
  • OpenPose grouping: candidate keypoints from heatmap argmax → score every pair via PAF line integral → Hungarian matching per limb.
  • Heatmap argmax + parabola fit: fit y = ax² + bx + c to (h_{x-1}, h_x, h_{x+1}); peak at x − b/(2a).
Comparisons
  • Top-down pose vs Bottom-up pose: Top-down: O(P) runtime, more accurate, fails when detector misses. Bottom-up: O(1) in P, robust to misses, grouping ambiguous in crowds.
  • Coordinate regression vs Heatmap regression: Coordinate: no uncertainty, loses spatial structure. Heatmap: pixel-dense, expresses ambiguity, sub-pixel via parabola fit.
Keywords
heatmapCPMPAFOpenPoseHungarianSMPLHMRPCKhtop-downbottom-up

Unit 8 — 3D Data (PointNet, DGCNN, MeshCNN)

3D Representations — VoxNet, PointNet, PointNet++, DGCNN, MeshCNN
One-liners
  • Point clouds: unstructured, irregular density, unordered. CNNs break on all three.
  • VoxNet: occupancy grid → 3D CNN. Memory O(N³); resolution cap ~32³.
  • PointNet = shared MLP + MAX pool. Symmetric ⇒ permutation invariant.
  • PointNet++ = hierarchical PointNet (FPS + ball query).
  • DGCNN = EdgeConv over dynamic kNN in feature space.
  • MeshCNN = operate on edges; pool = edge collapse.
Formulas
  • PointNet(P) = γ(max_p h(p))
  • EdgeConv: e_ij = h(x_i, x_j − x_i); x_i' = max_j e_ij
  • VoxNet memory: O(N³) voxels
Definitions
  • Symmetric function = permutation-invariant.
  • Critical points = inputs that survive max-pool.
  • Dynamic graph = kNN in feature space, rebuilt per layer.
Algorithms
  • FPS (farthest-point sampling): pick first point; iteratively add the point farthest from the chosen set.
  • PointNet++ set-abstraction: FPS centroids → ball query neighbourhoods → PointNet locally → upsample for segmentation.
Comparisons
  • VoxNet vs PointNet: VoxNet: regularises via voxelization, 3D conv, O(N³) memory, low resolution. PointNet: raw points, shared MLP + max pool, no quantization, no local context.
  • PointNet vs PointNet++: Flat global pool vs hierarchical local pools (FPS + ball query).
  • DGCNN vs PointNet++: DGCNN uses kNN in feature space (semantic neighbours); PointNet++ uses ball query in xyz space (geometric neighbours).
Keywords
voxelPointNetPointNet++DGCNNEdgeConvMeshCNNsymmetric functioncritical points

Unit 9 — NeRF & 3D Gaussian Splatting

NeRF & 3DGS — Per-Scene Optimisation and Differentiable Rendering
One-liners
  • 3DGS = per-scene optimisation, NOT a neural network.
  • Three pillars: scene modelling / image formation / optimisation.
  • Per-Gaussian params: 3 + 7 + 1 + 48 = 59.
  • Σ = R·S·Sᵀ·Rᵀ guarantees PSD.
  • Metrics: PSNR↑, SSIM↑, LPIPS↓.
Formulas
  • C = Σ c_i α_i ∏_{j<i}(1 − α_j)
  • Σ = R·S·Sᵀ·Rᵀ
  • L = (1 − λ)·L₁ + λ·L_D-SSIM, λ ≈ 0.2
Definitions
  • SH = orthonormal angular basis (view-dependent colour).
  • ADC = clone/split/prune of Gaussians.
  • COLMAP = SfM pre-processing for poses + sparse cloud.
Algorithms
  • 3DGS render: sort Gaussians back-to-front by depth → project μ, Σ to 2D → alpha-composite.
  • ADC step: for each Gaussian, look at position gradient. Small + high grad → clone. Large + high grad → split. Low α → prune.
Comparisons
  • NeRF vs 3DGS: NeRF: implicit MLP, dense ray marching, hours to train, seconds/frame to render. 3DGS: explicit Gaussians, rasteriser, ~30 min training, real-time rendering.
  • Direct Σ optim vs R·S·Sᵀ·Rᵀ: Direct can produce invalid (non-PSD) covariances; decomposition is PSD by construction.
Keywords
NeRF3DGSspherical harmonicsADCCOLMAPPSNRSSIMLPIPSalpha compositing

Unit 10 — Attention & Transformers

Attention Mechanism & Transformer Architecture
One-liners
  • Attn = softmax(QKᵀ/√dₖ) V. /√dₖ rescales variance.
  • MHA: h parallel attentions in dₖ = d_model/h, concat, project.
  • Encoder = self-attn + FFN. Decoder = masked self-attn + cross-attn + FFN.
  • Positional encoding required: self-attn is permutation equivariant.
Formulas
  • Attn(Q,K,V) = softmax(QKᵀ/√dₖ) V
  • PE(pos, 2i) = sin(pos/10000^{2i/d})
  • y = x + Sublayer(LN(x)) (PreNorm)
Definitions
  • Self-attn = Q,K,V from same seq.
  • Cross-attn = Q from decoder, K,V from encoder.
  • Masked self-attn = look-ahead mask sets future logits to −∞.
Algorithms
  • Decoder step: feed prefix → masked self-attn → cross-attn over encoder → FFN → logits → argmax/sample → append.
  • KV-cache: store K, V of past tokens; per step compute Q for new token only.
Comparisons
  • RNN / LSTM vs Transformer: Sequential vs parallel; O(n) memory vs O(n²); vanishing grads vs constant path length.
  • Sinusoidal PE vs Learned PE: Sinusoidal generalises to unseen lengths; learned is data-dependent and breaks beyond training length.
Keywords
attentionmulti-headscaled dot productself-attentioncross-attentionmaskedPEAdd & NormFFN

Unit 11 — Vision Transformers (ViT)

ViT Pipeline, Scaling, and Swin
One-liners
  • ViT: 16×16 patches → 196 tokens (+CLS) → encoder → classify.
  • ViT-B/16 ≈ 86 M params. Per layer ≈ 7 M.
  • Position embedding ESSENTIAL — without it, attention is order-blind.
  • Swin: window self-attn + shift every other layer ⇒ O(n) per layer, global RF over depth.
Formulas
  • N = HW/P²; tokens = N + 1 (with CLS)
  • Per-block params = 4d² + 2d·d_ff
  • ViT cost: O(N² d). Swin: O(M²N)
Definitions
  • Patch embedding = conv(P, stride P).
  • [CLS] = learnable global token.
  • Swin = local windows + shifted grid.
Algorithms
  • ViT forward: patchify → project → +CLS → +PE → L encoder layers → CLS head.
  • Swin block: window self-attn → shift → window self-attn → reverse-shift → MLP.
Comparisons
  • CNN vs ViT: CNN: local early, growing RF, win on small data. ViT: mixed local/global early, needs massive data, scales better.
  • ViT vs Swin: ViT: global attention, O(N²). Swin: local windows + shift, O(N) per layer, hierarchical downsampling for dense tasks.
Keywords
patchCLSViT-B/16PESwinshifted windowinductive biasJFT

Unit 12 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP
One-liners
  • SimCLR: 2 augmentations, big batch, NT-Xent.
  • MoCo: queue of past keys + momentum encoder (small batch OK).
  • BYOL: no negatives — predictor head + stop-gradient + momentum target.
  • CLIP: image-text contrastive on 400 M pairs; zero-shot via 'a photo of a {class}'.
Formulas
  • L = −log e^{s⁺/τ} / Σ_j e^{s_j/τ}
  • θ_k ← m θ_k + (1−m) θ_q (MoCo, m ≈ 0.999)
  • CLIP loss = ½(L_{i→t} + L_{t→i}) (symmetric CE)
Definitions
  • Positive pair = two augmented views of same image.
  • Projection head g = MLP between encoder and loss; discarded downstream.
  • Zero-shot = no labelled examples of target classes seen.
Algorithms
  • SimCLR step: augment ×2 → encoder → projection → NT-Xent over 2N − 2 negatives.
  • CLIP step: encode images → encode texts → N×N cosine → symmetric softmax CE.
Comparisons
  • SimCLR vs MoCo: SimCLR: big batch, in-batch negatives. MoCo: queue + momentum encoder, decouples batch size from #negatives.
  • MoCo / SimCLR vs BYOL: MoCo/SimCLR need negatives. BYOL doesn't — avoids collapse via predictor + stop-gradient + momentum target.
Keywords
SimCLRInfoNCENT-XentMoComomentum encoderBYOLstop-gradientCLIPzero-shot

Unit 13 — SSL: DINO, MAE, JEPA

DINO, MAE, JEPA — Modern SSL Beyond Contrastive
One-liners
  • DINO = self-DIstillation NO labels. EMA teacher; centering + sharpening; multi-crop.
  • DINO output dim = 65,536 (large to prevent collapse-to-one-dim).
  • MAE: 75% mask, encoder sees only visible, small decoder reconstructs pixels.
  • JEPA: predict TARGET REPRESENTATIONS (not pixels). Latent-space L2.
Formulas
  • DINO: L = −Σ p_t · log p_s. p_t = softmax((g_t − c)/τ_t).
  • EMA: θ_t ← λ θ_t + (1−λ) θ_s, λ ≈ 0.996 → 1.
  • MAE: L = mean squared error on masked patches only.
Definitions
  • Self-distillation = student matches teacher's distribution.
  • Centering = subtract running mean (anti-collapse #1).
  • Sharpening = low teacher τ (anti-collapse #2).
  • Registers = scratchpad tokens, no position, clean attention maps.
Algorithms
  • DINO step: augment image into multi-crop set → student over all, teacher over global → softmax with centering+sharpening → CE → backprop student only → EMA update teacher.
  • MAE step: patchify → mask 75% → encoder on visible → insert mask tokens → decoder → MSE on masked patches.
Comparisons
  • DINO vs BYOL: BYOL outputs feature vectors, MSE between predictor and target. DINO outputs distributions, cross-entropy with centering+sharpening.
  • MAE vs JEPA: MAE predicts pixels (wastes capacity on texture). JEPA predicts target features (semantic level).
Keywords
DINOEMA teachercenteringsharpeningmulti-cropMAE75% maskJEPAregistersI-JEPA

Unit 14 — Transformer Advances (ViT-5 era)

Modern Transformer Upgrades
One-liners
  • 7 modern upgrades: PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash.
  • PreNorm: x + Sublayer(LN(x)). Unbroken residual.
  • RMSNorm = LayerNorm without mean subtraction. Cheaper.
  • RoPE: rotation matrix on Q, K. Encodes RELATIVE position.
  • Flash Attention: tile + online softmax → never materialise N×N in HBM.
  • GQA: heads share K, V per group → KV-cache shrinks h/G×.
Formulas
  • PreNorm: y = x + Sublayer(LN(x))
  • RMSNorm: y = γ · x / RMS(x)
  • RoPE: rotate (q_{2i}, q_{2i+1}) by m · θᵢ
Definitions
  • Registers = global scratchpad tokens, no position.
  • LayerScale γ_l ≈ 1e-4 at init.
  • KV-cache: store past K, V for autoregressive.
Algorithms
  • Flash Attention forward: tile (Q_i, K_j, V_j) blocks → compute partial softmax → accumulate via online softmax stats → next block.
  • Decoder with KV-cache: per step, compute new Q only; append new K, V; attend new Q against cached K, V.
Comparisons
  • PostNorm vs PreNorm: PostNorm needs careful warmup; PreNorm has unbroken residual stream and trains stably deep.
  • LayerNorm vs RMSNorm: LayerNorm centers + scales; RMSNorm only scales. RMSNorm cheaper; mean subtraction empirically negligible.
  • MHA vs GQA / MQA: MHA: K, V per head. MQA: K, V shared across all heads (smallest cache, quality loss). GQA: per-group share — sweet spot.
Keywords
PreNormRMSNormLayerScaleQK-NormRegistersRoPEFlash AttentionGQAKV-cache

Unit 15 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

VLM Architecture — Encoders, Connectors, Positional Encoding
One-liners
  • VLM three pillars: Vision Encoder + Connector + LLM.
  • Connector = ONLY random-init piece in stitched VLMs.
  • Prefix-LM: bidirectional on image+prompt, causal on answer, loss on answer.
  • SigLIP: pairwise sigmoid BCE (no batch-wide softmax).
  • Qwen2-VL: dynamic resolution + 2-layer MLP connector + M-RoPE.
  • Gemma 4: native multimodal (no connector).
Formulas
  • SigLIP: −1/n² Σᵢⱼ [y log σ(z) + (1−y) log(1−σ(z))]
  • M-RoPE position: (t, r, c)
Definitions
  • Modality gap = discrete vocab vs continuous pixels.
  • Prefix-LM = mixed bidirectional/causal mask.
  • Dynamic resolution = native AR + clamped token count.
Algorithms
  • PaliGemma forward: SigLIP(image) → linear → concat with text tokens → Gemma decoder with prefix-LM mask → next-token loss on suffix.
  • M-RoPE: split head dim into 3; rotate Q, K by (m_t·θ, m_r·θ, m_c·θ) respectively.
Comparisons
  • CLIP vs SigLIP: CLIP: softmax over batch → needs sync. SigLIP: pairwise sigmoid BCE → independent per pair → scales arbitrarily large batches.
  • PaliGemma 224² fixed vs Qwen2-VL native AR: PaliGemma loses fine detail. Qwen2-VL preserves AR + resolution; tile count clamped by user.
  • Stitched (PaliGemma) vs Native multimodal (Gemma 4): Stitched: connector bottleneck reconciles two pretrained latent spaces. Native: joint training from scratch, no connector.
Keywords
VLMSigLIPPaliGemmaQwen2-VLM-RoPEPrefix-LMconnectordynamic resolutionGemma 4

Unit 16 — Video Understanding

Video Architectures — 3D CNNs, Two-Stream, SlowFast, ViViT, TimeSformer
One-liners
  • Video ≠ images × T (temporal pattern matters).
  • I3D: inflate 2D filters to 3D, /K_T for activation scale.
  • Two-Stream: spatial (RGB) + temporal (stacked optical flow, 2L ch).
  • SlowFast: low-fps semantic + high-fps motion + lateral fusion.
  • ViViT: uniform frame sample OR tubelet embedding.
  • TimeSformer winner: divided space-time attention.
Formulas
  • 3D conv filter: C_out × C_in × K_T × K_H × K_W
  • I3D inflate: W_3D = W_2D / K_T
  • Joint attn cost: O((TN)²·d). Divided ≈ O(T²N + TN²)·d.
Definitions
  • Optical flow = 2D motion field per pixel.
  • Tubelet = 3D patch (t × h × w).
  • Divided attn = factorise temporal then spatial.
Algorithms
  • Two-Stream forward: spatial CNN on RGB + temporal CNN on L flow frames → late fusion of softmax.
  • TimeSformer divided block: temporal MSA → spatial MSA → MLP; each per block.
Comparisons
  • 2D CNN + LSTM vs I3D: 2D+LSTM: per-frame spatial features then RNN aggregator. I3D: native spatio-temporal kernels, end-to-end on space and time.
  • Joint space-time vs Divided space-time: Joint: O((TN)²) — prohibitive. Divided: separate temporal then spatial; near-linear; best accuracy/efficiency.
  • Slow path vs Fast path (SlowFast): Slow: low fps, high channels, expensive per frame (semantics). Fast: high fps, low channels, cheap (motion).
Keywords
KineticsI3DC3DTwo-Streamoptical flowSlowFastViViTTimeSformertubeletdivided attention