Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Formulas & Diagrams

High-ROI section — formulas improve marks, diagrams improve recall.

Formulas

Convolution output size

—

O = \lfloor (W - F + 2P)/S \rfloor + 1

Spatial output dim for a conv with kernel F, padding P, stride S on an input of width W.

Intersection-over-Union (IoU)

—

\text{IoU}(A,B) = \frac{|A \cap B|}{|A \cup B|}

Box overlap metric. Detection TP threshold typically 0.5.

GIoU loss

—

\text{GIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}

Bounded in [−1, 1]; gives non-zero gradient even when boxes don't overlap (C = smallest enclosing box).

Dice coefficient

—

\text{Dice} = \frac{2|A \cap B|}{|A| + |B|} = \frac{2 \,\text{IoU}}{1 + \text{IoU}}

Segmentation overlap. Note denominator is SUM, not union.

Focal loss

—

FL(p_t) = -(1-p_t)^{\gamma} \, \log p_t

γ ≈ 2. Down-weights well-classified examples to combat foreground/background imbalance.

Scaled dot-product attention

—

\text{Attn}(Q,K,V) = \text{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right) V

Core of every Transformer. √dₖ keeps softmax in its linear regime.

Multi-head attention

—

\text{MHA}(Q,K,V) = \text{Concat}(h_1,\dots,h_H)\,W^O,\ h_i = \text{Attn}(QW_i^Q, KW_i^K, VW_i^V)

H parallel attentions in lower-dim subspaces; learn diverse relationships.

InfoNCE / NT-Xent

—

L_i = -\log \frac{\exp(\text{sim}(z_i,z_i^+)/\tau)}{\sum_j \exp(\text{sim}(z_i,z_j)/\tau)}

Contrastive loss. Cross-entropy with positive logit in numerator.

SigLIP pairwise sigmoid loss

—

L = -\tfrac{1}{n^2}\sum_{i,j}\Big[y_{ij}\log \sigma(z_{ij}) + (1-y_{ij})\log(1-\sigma(z_{ij}))\Big]

Independent binary classification per pair; scales without batch synchronization.

PSNR

—

\text{PSNR} = 10 \log_{10}\!\left(\tfrac{R^2}{\text{MSE}}\right)

R = max pixel value. Higher = better. ≥ 30 dB → good reconstruction.

RoPE pair rotation

—

\begin{bmatrix} q'_{2i}\\ q'_{2i+1}\end{bmatrix} = \begin{bmatrix}\cos m\theta_i & -\sin m\theta_i\\ \sin m\theta_i & \cos m\theta_i\end{bmatrix}\begin{bmatrix} q_{2i}\\ q_{2i+1}\end{bmatrix}

Rotates (Q,K) by angle m·θᵢ. Dot product depends only on relative position m − n.

RMSNorm

—

y = \gamma \cdot \frac{x}{\sqrt{\tfrac{1}{d}\sum_i x_i^2}}

No mean subtraction; cheaper than LayerNorm; rescales onto a sphere.

LayerNorm

—

y = \gamma \cdot \frac{x - \mu}{\sigma} + \beta

Per-token across feature dim. Independent of batch size.

3DGS alpha compositing

—

C = \sum_i c_i\,\alpha_i \prod_{j<i}(1 - \alpha_j)

Front-to-back blending of sorted 2D-projected Gaussians.

3DGS covariance parameterization

—

\Sigma = R\,S\,S^\top R^\top

Decomposition guarantees PSD; R from unit quaternion, S from log-space scales.

SMPL forward model

—

M(\beta,\theta) = W(T_P(\beta,\theta),\, J(\beta),\, \theta,\, \mathcal{W})

Template + shape blendshapes + pose blendshapes + linear blend skinning. β ∈ ℝ¹⁰, θ ∈ ℝ⁷².

PCK@α

—

\text{PCK@}\alpha = \tfrac{1}{N}\sum_i \mathbb{1}\!\left[\|\hat p_i - p_i\| \le \alpha \cdot d_{\text{ref}}\right]

Pose correctness. PCKh@0.5: d_ref = head bone length.

YOLO loss (sum of 5)

—

L = \lambda_{\text{coord}}\sum_{ij}^{\text{obj}}[(x-\hat x)^2 + (y-\hat y)^2] + \lambda_{\text{coord}}\sum_{ij}^{\text{obj}}[(\sqrt w - \sqrt{\hat w})^2 + (\sqrt h - \sqrt{\hat h})^2] + \sum_{ij}^{\text{obj}}(C-\hat C)^2 + \lambda_{\text{noobj}}\sum_{ij}^{\text{noobj}}(C-\hat C)^2 + \sum_i^{\text{obj}}\sum_c (p(c)-\hat p(c))^2

Center, size (√ to balance small/large), object conf, no-obj conf (λ=0.5), class. λ_coord = 5.

Triplet loss

—

L = \max(0,\ \|A-P\|^2 - \|A-N\|^2 + \alpha)

Anchor / Positive / Negative with margin α. Used in metric learning.

Diagrams

R-CNN → Fast → Faster R-CNN evolution

Side-by-side block diagrams: per-region CNN forward (R-CNN); single CNN forward + RoI pooling (Fast); shared backbone + RPN + RoI head (Faster). Annotate the bottleneck in each.

[ diagram placeholder ]

YOLO grid output tensor

S × S grid overlaid on an image; each cell predicts B boxes (x,y,w,h,conf) + C class probs → S × S × (B·5 + C) tensor.

[ diagram placeholder ]

FCN-32s vs FCN-8s skip fusion

Encoder downsamples 32×; decoder upsamples. FCN-8s adds pool3 + pool4 skip connections fused with deep features.

[ diagram placeholder ]

U-Net symmetric encoder/decoder

Contracting path on left, expanding path on right; horizontal concat skip connections at every resolution.

[ diagram placeholder ]

OpenPose: heatmaps + PAFs

Network outputs K keypoint heatmaps and 2L PAF channels (x,y components per limb). Bipartite matching groups keypoints.

[ diagram placeholder ]

SMPL pipeline

Template mesh → shape blendshapes (β) → pose blendshapes + skinning (θ) → posed 3D mesh (6890 vertices).

[ diagram placeholder ]

PointNet architecture

Shared MLP per point (N × D → N × F) → symmetric max-pool over N → global feature → final MLP. T-Net for input + feature transform.

[ diagram placeholder ]

3D Gaussian Splatting pipeline

Images → COLMAP → sparse point cloud + camera poses → init Gaussians → render → image-space loss → backprop → adaptive density control (clone/split/prune).

[ diagram placeholder ]

Transformer block (PreNorm)

x → LN → MHA → +residual → LN → FFN(D → 4D → D) → +residual. Modern variant; unbroken residual stream.

[ diagram placeholder ]

ViT pipeline

Image → P×P patches → linear projection → +[CLS] → +position embeddings → L encoder layers → CLS output → MLP head → class logits.

[ diagram placeholder ]

Swin shifted-window attention

M×M local windows; alternating layers shift window grid by M/2 so tokens at window edges land in window interior next layer.

[ diagram placeholder ]

DINO self-distillation

Student + EMA teacher with identical architecture; multi-crop (2 global → teacher; 6-10 local → student). Centering + sharpening on teacher output.

[ diagram placeholder ]

MAE asymmetric encoder/decoder

75% patches masked; encoder sees only visible 25%; small decoder reconstructs masked pixels using mask tokens at the right positions.

[ diagram placeholder ]

PaliGemma three pillars

SigLIP (frozen) → linear connector (random init) → Gemma decoder. Prefix-LM mask: image+prompt bidirectional, suffix causal.

[ diagram placeholder ]

SlowFast dual pathway

Slow pathway: low fps, high channels (semantics). Fast pathway: high fps, low channels (motion). Lateral connections fuse.

[ diagram placeholder ]