Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Memory Triggers

Tiny cues. They reconstruct big topics.

R-CNN family speedup order

47s (R-CNN) → 0.3s (Fast, RoI Pool) → 0.2s (Faster, RPN) → 22ms (YOLO single shot).

YOLO output shape

S × S × (B·5 + C). Default 7×7×30 on PASCAL VOC (B=2, C=20).

YOLO loss λ-coefficients

λ_coord = 5 (box terms upweighted), λ_noobj = 0.5 (no-object cells downweighted to prevent gradient drowning).

NMS one-liner

Sort by confidence; pop top; suppress IoU > τ; repeat; PER CLASS.

Conv output size

(W − F + 2P)/S + 1. Same-padding for odd kernel: P = (F−1)/2.

Dice ↔ IoU

Dice = 2·IoU / (1 + IoU). Both monotonic; Dice is sum-not-union in denominator.

Opening vs Closing

Open = erode→dilate (kills noise). Close = dilate→erode (fills holes). Both idempotent.

Sobel/Laplacian sign rule

Smoothing kernels sum to 1 (mean/Gaussian). Derivative kernels sum to 0 (Sobel/Laplacian).

Why JPEG uses DCT not DFT

DCT is real-valued + no edge discontinuity assumption → better energy compaction → fewer artifacts at block boundaries.

Heatmap regression mantra

Output K 2D heatmaps; GT = 2D Gaussian at keypoint; argmax + parabola fit for sub-pixel.

PAF score equation

Score(A→B) = integral over line from A to B of (PAF unit vector) · (line direction).

OpenPose output channels

K heatmaps + 2L PAFs. e.g., 18 keypoints + 19 limbs ⇒ 18 + 38 = 56 channels.

PCKh@0.5 normalization

Distance threshold = 0.5 × head bone length. Head used because torso varies more under pose.

SMPL parameter counts

β = 10 (shape PCA), θ = 72 (24 joints × axis-angle), mesh = 6890 vertices.

PointNet universal approximator

Any continuous symmetric set function ≈ γ(MAX_p h(p)). Shared MLP h, symmetric max-pool, MLP γ.

3DGS parameters per Gaussian

3 (μ) + 7 (R·S = 4 quat + 3 scale) + 1 (α) + 48 (SH deg-3 × 3 channels) = 59.

Σ = R·S·Sᵀ·Rᵀ guarantee

Decomposition forces positive semi-definite; direct optimization of a 6-DoF symmetric matrix can produce invalid Σ.

Attention scaling argument

QKᵀ variance grows with dₖ → softmax saturates → vanishing grads. /√dₖ rescales variance to 1.

ViT-B/16 quick numbers

224² image, 16² patch ⇒ 14×14 = 196 patches; +CLS = 197. d=768, L=12, h=12, ≈ 86 M params.

ViT param formula per block

4·d² (QKVO) + 2·d·d_ff (FFN). With d=768, d_ff=3072 ⇒ ≈ 7 M / layer × 12 ≈ 85 M.

DINO output dim

65,536 — large to prevent collapse-to-one-dim.

DINO anti-collapse pair

Centering (running mean subtraction) — spreads distribution; Sharpening (low τ teacher) — peaky target. Both balance.

MAE vs BERT mask ratio

MAE 75%, BERT 15%. Images have higher spatial redundancy → must mask more to force semantic learning.

RoPE pair-rotation

Rotate (q_{2i}, q_{2i+1}) by m·θᵢ. Dot product depends only on (m − n) → relative position.

GQA reduction

Group h heads into G clusters, share K,V per group. KV-cache shrinks by h/G with minimal quality loss.

PaliGemma three pillars

SigLIP encoder (frozen) → linear connector (only random-init component) → Gemma decoder.

Prefix-LM mask

Image + prompt: bidirectional. Suffix (answer): causal. Loss only on suffix.

M-RoPE triple

(t, r, c). Static images: t=0. Text: t=r=c=token index. Video: full T×H×W.

I3D inflation

Copy 2D K×K filter K times along time axis, divide by K. 'Boring video' of static image gives identical activations.

Two-Stream split

Spatial stream (RGB frame, 'what'); Temporal stream (stack of optical flow, 'how'). Late fusion.

Focal loss formula

FL = −(1−pₜ)^γ · log(pₜ). γ=2 typical. Down-weights easy negatives → single-stage matches two-stage.

Modern Transformer upgrade list (7)

PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash Attention.