Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Formulas & Diagrams
High-ROI section — formulas improve marks, diagrams improve recall.
Formulas
Convolution output size
—
O = \lfloor (W - F + 2P)/S \rfloor + 1
Spatial output dim for a conv with kernel F, padding P, stride S on an input of width W.
Intersection-over-Union (IoU)
—
\text{IoU}(A,B) = \frac{|A \cap B|}{|A \cup B|}Box overlap metric. Detection TP threshold typically 0.5.
GIoU loss
—
\text{GIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}Bounded in [−1, 1]; gives non-zero gradient even when boxes don't overlap (C = smallest enclosing box).
Dice coefficient
—
\text{Dice} = \frac{2|A \cap B|}{|A| + |B|} = \frac{2 \,\text{IoU}}{1 + \text{IoU}}Segmentation overlap. Note denominator is SUM, not union.
Focal loss
—
FL(p_t) = -(1-p_t)^{\gamma} \, \log p_tγ ≈ 2. Down-weights well-classified examples to combat foreground/background imbalance.
Scaled dot-product attention
—
\text{Attn}(Q,K,V) = \text{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right) VCore of every Transformer. √dₖ keeps softmax in its linear regime.
Multi-head attention
—
\text{MHA}(Q,K,V) = \text{Concat}(h_1,\dots,h_H)\,W^O,\ h_i = \text{Attn}(QW_i^Q, KW_i^K, VW_i^V)H parallel attentions in lower-dim subspaces; learn diverse relationships.
InfoNCE / NT-Xent
—
L_i = -\log \frac{\exp(\text{sim}(z_i,z_i^+)/\tau)}{\sum_j \exp(\text{sim}(z_i,z_j)/\tau)}Contrastive loss. Cross-entropy with positive logit in numerator.
SigLIP pairwise sigmoid loss
—
L = -\tfrac{1}{n^2}\sum_{i,j}\Big[y_{ij}\log \sigma(z_{ij}) + (1-y_{ij})\log(1-\sigma(z_{ij}))\Big]Independent binary classification per pair; scales without batch synchronization.
PSNR
—
\text{PSNR} = 10 \log_{10}\!\left(\tfrac{R^2}{\text{MSE}}\right)R = max pixel value. Higher = better. ≥ 30 dB → good reconstruction.
RoPE pair rotation
—
\begin{bmatrix} q'_{2i}\\ q'_{2i+1}\end{bmatrix} = \begin{bmatrix}\cos m\theta_i & -\sin m\theta_i\\ \sin m\theta_i & \cos m\theta_i\end{bmatrix}\begin{bmatrix} q_{2i}\\ q_{2i+1}\end{bmatrix}Rotates (Q,K) by angle m·θᵢ. Dot product depends only on relative position m − n.
RMSNorm
—
y = \gamma \cdot \frac{x}{\sqrt{\tfrac{1}{d}\sum_i x_i^2}}No mean subtraction; cheaper than LayerNorm; rescales onto a sphere.
LayerNorm
—
y = \gamma \cdot \frac{x - \mu}{\sigma} + \betaPer-token across feature dim. Independent of batch size.
3DGS alpha compositing
—
C = \sum_i c_i\,\alpha_i \prod_{j<i}(1 - \alpha_j)Front-to-back blending of sorted 2D-projected Gaussians.
3DGS covariance parameterization
—
\Sigma = R\,S\,S^\top R^\top
Decomposition guarantees PSD; R from unit quaternion, S from log-space scales.
SMPL forward model
—
M(\beta,\theta) = W(T_P(\beta,\theta),\, J(\beta),\, \theta,\, \mathcal{W})Template + shape blendshapes + pose blendshapes + linear blend skinning. β ∈ ℝ¹⁰, θ ∈ ℝ⁷².
PCK@α
—
\text{PCK@}\alpha = \tfrac{1}{N}\sum_i \mathbb{1}\!\left[\|\hat p_i - p_i\| \le \alpha \cdot d_{\text{ref}}\right]Pose correctness. PCKh@0.5: d_ref = head bone length.
YOLO loss (sum of 5)
—
L = \lambda_{\text{coord}}\sum_{ij}^{\text{obj}}[(x-\hat x)^2 + (y-\hat y)^2] + \lambda_{\text{coord}}\sum_{ij}^{\text{obj}}[(\sqrt w - \sqrt{\hat w})^2 + (\sqrt h - \sqrt{\hat h})^2] + \sum_{ij}^{\text{obj}}(C-\hat C)^2 + \lambda_{\text{noobj}}\sum_{ij}^{\text{noobj}}(C-\hat C)^2 + \sum_i^{\text{obj}}\sum_c (p(c)-\hat p(c))^2Center, size (√ to balance small/large), object conf, no-obj conf (λ=0.5), class. λ_coord = 5.
Triplet loss
—
L = \max(0,\ \|A-P\|^2 - \|A-N\|^2 + \alpha)
Anchor / Positive / Negative with margin α. Used in metric learning.
Diagrams
R-CNN → Fast → Faster R-CNN evolution
Side-by-side block diagrams: per-region CNN forward (R-CNN); single CNN forward + RoI pooling (Fast); shared backbone + RPN + RoI head (Faster). Annotate the bottleneck in each.
[ diagram placeholder ]
YOLO grid output tensor
S × S grid overlaid on an image; each cell predicts B boxes (x,y,w,h,conf) + C class probs → S × S × (B·5 + C) tensor.
[ diagram placeholder ]
FCN-32s vs FCN-8s skip fusion
Encoder downsamples 32×; decoder upsamples. FCN-8s adds pool3 + pool4 skip connections fused with deep features.
[ diagram placeholder ]
U-Net symmetric encoder/decoder
Contracting path on left, expanding path on right; horizontal concat skip connections at every resolution.
[ diagram placeholder ]
OpenPose: heatmaps + PAFs
Network outputs K keypoint heatmaps and 2L PAF channels (x,y components per limb). Bipartite matching groups keypoints.
[ diagram placeholder ]
SMPL pipeline
Template mesh → shape blendshapes (β) → pose blendshapes + skinning (θ) → posed 3D mesh (6890 vertices).
[ diagram placeholder ]
PointNet architecture
Shared MLP per point (N × D → N × F) → symmetric max-pool over N → global feature → final MLP. T-Net for input + feature transform.
[ diagram placeholder ]
3D Gaussian Splatting pipeline
Images → COLMAP → sparse point cloud + camera poses → init Gaussians → render → image-space loss → backprop → adaptive density control (clone/split/prune).
[ diagram placeholder ]
Transformer block (PreNorm)
x → LN → MHA → +residual → LN → FFN(D → 4D → D) → +residual. Modern variant; unbroken residual stream.
[ diagram placeholder ]
ViT pipeline
Image → P×P patches → linear projection → +[CLS] → +position embeddings → L encoder layers → CLS output → MLP head → class logits.
[ diagram placeholder ]
Swin shifted-window attention
M×M local windows; alternating layers shift window grid by M/2 so tokens at window edges land in window interior next layer.
[ diagram placeholder ]
DINO self-distillation
Student + EMA teacher with identical architecture; multi-crop (2 global → teacher; 6-10 local → student). Centering + sharpening on teacher output.
[ diagram placeholder ]
MAE asymmetric encoder/decoder
75% patches masked; encoder sees only visible 25%; small decoder reconstructs masked pixels using mask tokens at the right positions.
[ diagram placeholder ]
PaliGemma three pillars
SigLIP (frozen) → linear connector (random init) → Gemma decoder. Prefix-LM mask: image+prompt bidirectional, suffix causal.
[ diagram placeholder ]
SlowFast dual pathway
Slow pathway: low fps, high channels (semantics). Fast pathway: high fps, low channels (motion). Lateral connections fuse.
[ diagram placeholder ]