Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Computer Vision (Spring 2025-26) covers modern deep-learning approaches to visual understanding: object detection, dense prediction, pose estimation, 3D representations, NeRF & 3D Gaussian Splatting, Transformers and Vision Transformers, self-supervised learning (SimCLR, DINO, MAE, JEPA), modern transformer engineering, vision-language models, and video understanding. This revision hub distils the entire syllabus into chapter-wise notes, cheatsheets, high-yield topics and practice questions — designed to revise the whole course in an evening, not a week.

Syllabus

Unit 1 — Object Detection

1 chapters

From R-CNN's slow per-region forwards to YOLO's single-shot grid prediction: detection is the bedrock of modern vision. Master the R-CNN family, anchors, NMS, mAP, and the losses that make single-stage detectors competitive.

Unit 2 — Dense Prediction: Segmentation + Depth

1 chapters

From sparse boxes to per-pixel labels. FCN, U-Net, dilated convolutions, Mask R-CNN and RoI Align for segmentation; MiDaS and scale-invariant losses for depth.

Unit 3 — Pose Estimation

1 chapters

Single-person and multi-person human pose. Heatmap regression, CPM, OpenPose with Part Affinity Fields, top-down vs bottom-up, and SMPL for 3D body recovery.

Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)

1 chapters

Why CNNs break on point clouds, voxelization's curse of dimensionality, PointNet's permutation-invariance theorem, hierarchical and graph-based variants, and operating on meshes directly via MeshCNN.

Unit 5 — NeRF & 3D Gaussian Splatting

1 chapters

From implicit volumetric rendering (NeRF) to explicit-primitive splatting (3DGS): per-scene optimisation, differentiable rasterisation, spherical harmonics for view-dependence, and adaptive density control.

Unit 6 — Attention & Transformers

1 chapters

Why attention beats RNN bottlenecks, scaled dot-product attention with the √dₖ rationale, multi-head attention, encoder/decoder masking, positional encodings, and the Show-Attend-and-Tell precursor.

Unit 7 — Vision Transformers (ViT)

1 chapters

Image-as-tokens: patchify, project, prepend a [CLS], add positional embeddings, run a Transformer encoder, classify. ViT scales beautifully but needs massive data; Swin localises attention for efficiency.

Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

1 chapters

Self-supervised learning derives labels from the data itself. Contrastive methods pull positive pairs together and push negatives apart; CLIP scales this to image-text. Includes BYOL's 'no negatives' trick.

Unit 9 — SSL: DINO, MAE, JEPA

1 chapters

Self-distillation without labels (DINO), masked reconstruction in pixel space (MAE), and prediction in representation space (JEPA). DINO's anti-collapse tricks are heavily tested.

Unit 10 — Transformer Advances (ViT-5 era)

1 chapters

The seven upgrades that take you from 2017 Transformer to modern LLM/VLM stack: PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash Attention.

Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)

1 chapters

Vision-Language Models: the 3-pillar blueprint, Prefix-LM masking, SigLIP vs CLIP loss, dynamic resolution + M-RoPE for video, and the move toward native multimodal models.

Unit 12 — Video Understanding

1 chapters

Beyond images × T: action recognition, temporal localisation, 3D CNNs (I3D), Two-Stream, SlowFast, ViViT, and TimeSformer's divided space-time attention.

Weightage

Unit 1 — Object Detection12%

Unit 2 — Dense Prediction (Segmentation + Depth)10%

Unit 3 — Pose Estimation8%

Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)8%

Unit 5 — NeRF & 3D Gaussian Splatting8%

Unit 6 — Attention & Transformers10%

Unit 7 — Vision Transformers (ViT)8%

Unit 8 — SSL: Contrastive (SimCLR/MoCo/BYOL/CLIP)8%

Unit 9 — SSL: DINO / MAE / JEPA8%

Unit 10 — Transformer Advances5%

Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)7%

Unit 12 — Video Understanding8%

Exam pattern

Quizzes (~15%) · Assignments / programming (~30%) · Mid-Sem (~20%) · End-Sem (~35%). Confirm exact split with the instructor.

Important dates

Quiz 1 (tentative)2026-02-10
Mid-Sem Exam2026-03-05
Quiz 2 (tentative)2026-04-02
Assignment 4 deadline2026-04-19
End-Sem Exam2026-04-25

Professor notes

Tapaswi-Sharma quizzes lean conceptual: 'why does X work?' rather than 'recite X'. Have a one-line rationale ready for every named architecture.
Assignments emphasise implementation depth — Asgn 3 extends Faster R-CNN to oriented boxes and multi-task U-Net; Asgn 4 builds ViT from scratch and CvT. Expect case-study questions adjacent to those tasks.
Watch out for 'list the upgrades' enumeration questions on modern transformers — write them as a numbered list, not prose.
Multimodal / vision-language case studies are increasingly common; rehearse the 3-pillar VLM blueprint and Prefix-LM masking pattern.
Diagrams earn marks. Spend 1 minute on a clean block diagram before writing answer prose.