Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Memory Triggers

Tiny cues. They reconstruct big topics.

R-CNN family speedup order
47s (R-CNN) → 0.3s (Fast, RoI Pool) → 0.2s (Faster, RPN) → 22ms (YOLO single shot).
YOLO output shape
S × S × (B·5 + C). Default 7×7×30 on PASCAL VOC (B=2, C=20).
YOLO loss λ-coefficients
λ_coord = 5 (box terms upweighted), λ_noobj = 0.5 (no-object cells downweighted to prevent gradient drowning).
NMS one-liner
Sort by confidence; pop top; suppress IoU > τ; repeat; PER CLASS.
Conv output size
(W − F + 2P)/S + 1. Same-padding for odd kernel: P = (F−1)/2.
Dice ↔ IoU
Dice = 2·IoU / (1 + IoU). Both monotonic; Dice is sum-not-union in denominator.
Opening vs Closing
Open = erode→dilate (kills noise). Close = dilate→erode (fills holes). Both idempotent.
Sobel/Laplacian sign rule
Smoothing kernels sum to 1 (mean/Gaussian). Derivative kernels sum to 0 (Sobel/Laplacian).
Why JPEG uses DCT not DFT
DCT is real-valued + no edge discontinuity assumption → better energy compaction → fewer artifacts at block boundaries.
Heatmap regression mantra
Output K 2D heatmaps; GT = 2D Gaussian at keypoint; argmax + parabola fit for sub-pixel.
PAF score equation
Score(A→B) = integral over line from A to B of (PAF unit vector) · (line direction).
OpenPose output channels
K heatmaps + 2L PAFs. e.g., 18 keypoints + 19 limbs ⇒ 18 + 38 = 56 channels.
PCKh@0.5 normalization
Distance threshold = 0.5 × head bone length. Head used because torso varies more under pose.
SMPL parameter counts
β = 10 (shape PCA), θ = 72 (24 joints × axis-angle), mesh = 6890 vertices.
PointNet universal approximator
Any continuous symmetric set function ≈ γ(MAX_p h(p)). Shared MLP h, symmetric max-pool, MLP γ.
3DGS parameters per Gaussian
3 (μ) + 7 (R·S = 4 quat + 3 scale) + 1 (α) + 48 (SH deg-3 × 3 channels) = 59.
Σ = R·S·Sᵀ·Rᵀ guarantee
Decomposition forces positive semi-definite; direct optimization of a 6-DoF symmetric matrix can produce invalid Σ.
Attention scaling argument
QKᵀ variance grows with dₖ → softmax saturates → vanishing grads. /√dₖ rescales variance to 1.
ViT-B/16 quick numbers
224² image, 16² patch ⇒ 14×14 = 196 patches; +CLS = 197. d=768, L=12, h=12, ≈ 86 M params.
ViT param formula per block
4·d² (QKVO) + 2·d·d_ff (FFN). With d=768, d_ff=3072 ⇒ ≈ 7 M / layer × 12 ≈ 85 M.
DINO output dim
65,536 — large to prevent collapse-to-one-dim.
DINO anti-collapse pair
Centering (running mean subtraction) — spreads distribution; Sharpening (low τ teacher) — peaky target. Both balance.
MAE vs BERT mask ratio
MAE 75%, BERT 15%. Images have higher spatial redundancy → must mask more to force semantic learning.
RoPE pair-rotation
Rotate (q_{2i}, q_{2i+1}) by m·θᵢ. Dot product depends only on (m − n) → relative position.
GQA reduction
Group h heads into G clusters, share K,V per group. KV-cache shrinks by h/G with minimal quality loss.
PaliGemma three pillars
SigLIP encoder (frozen) → linear connector (only random-init component) → Gemma decoder.
Prefix-LM mask
Image + prompt: bidirectional. Suffix (answer): causal. Loss only on suffix.
M-RoPE triple
(t, r, c). Static images: t=0. Text: t=r=c=token index. Video: full T×H×W.
I3D inflation
Copy 2D K×K filter K times along time axis, divide by K. 'Boring video' of static image gives identical activations.
Two-Stream split
Spatial stream (RGB frame, 'what'); Temporal stream (stack of optical flow, 'how'). Late fusion.
Focal loss formula
FL = −(1−pₜ)^γ · log(pₜ). γ=2 typical. Down-weights easy negatives → single-stage matches two-stage.
Modern Transformer upgrade list (7)
PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash Attention.