Computer Vision
CSE471Computer Vision (Spring 2025-26) covers modern deep-learning approaches to visual understanding: object detection, dense prediction, pose estimation, 3D representations, NeRF & 3D Gaussian Splatting, Transformers and Vision Transformers, self-supervised learning (SimCLR, DINO, MAE, JEPA), modern transformer engineering, vision-language models, and video understanding. This revision hub distils the entire syllabus into chapter-wise notes, cheatsheets, high-yield topics and practice questions — designed to revise the whole course in an evening, not a week.
Syllabus
Unit 1 — Object Detection
1 chaptersFrom R-CNN's slow per-region forwards to YOLO's single-shot grid prediction: detection is the bedrock of modern vision. Master the R-CNN family, anchors, NMS, mAP, and the losses that make single-stage detectors competitive.
Unit 2 — Dense Prediction: Segmentation + Depth
1 chaptersFrom sparse boxes to per-pixel labels. FCN, U-Net, dilated convolutions, Mask R-CNN and RoI Align for segmentation; MiDaS and scale-invariant losses for depth.
Unit 3 — Pose Estimation
1 chaptersSingle-person and multi-person human pose. Heatmap regression, CPM, OpenPose with Part Affinity Fields, top-down vs bottom-up, and SMPL for 3D body recovery.
Unit 4 — 3D Data (PointNet, DGCNN, MeshCNN)
1 chaptersWhy CNNs break on point clouds, voxelization's curse of dimensionality, PointNet's permutation-invariance theorem, hierarchical and graph-based variants, and operating on meshes directly via MeshCNN.
Unit 5 — NeRF & 3D Gaussian Splatting
1 chaptersFrom implicit volumetric rendering (NeRF) to explicit-primitive splatting (3DGS): per-scene optimisation, differentiable rasterisation, spherical harmonics for view-dependence, and adaptive density control.
Unit 6 — Attention & Transformers
1 chaptersWhy attention beats RNN bottlenecks, scaled dot-product attention with the √dₖ rationale, multi-head attention, encoder/decoder masking, positional encodings, and the Show-Attend-and-Tell precursor.
Unit 7 — Vision Transformers (ViT)
1 chaptersImage-as-tokens: patchify, project, prepend a [CLS], add positional embeddings, run a Transformer encoder, classify. ViT scales beautifully but needs massive data; Swin localises attention for efficiency.
Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)
1 chaptersSelf-supervised learning derives labels from the data itself. Contrastive methods pull positive pairs together and push negatives apart; CLIP scales this to image-text. Includes BYOL's 'no negatives' trick.
Unit 9 — SSL: DINO, MAE, JEPA
1 chaptersSelf-distillation without labels (DINO), masked reconstruction in pixel space (MAE), and prediction in representation space (JEPA). DINO's anti-collapse tricks are heavily tested.
Unit 10 — Transformer Advances (ViT-5 era)
1 chaptersThe seven upgrades that take you from 2017 Transformer to modern LLM/VLM stack: PreNorm, RMSNorm, LayerScale, QK-Norm, Registers, RoPE, GQA + KV-cache + Flash Attention.
Unit 11 — Multimodal LLMs (PaliGemma / Qwen2-VL / Gemma 4)
1 chaptersVision-Language Models: the 3-pillar blueprint, Prefix-LM masking, SigLIP vs CLIP loss, dynamic resolution + M-RoPE for video, and the move toward native multimodal models.
Unit 12 — Video Understanding
1 chaptersBeyond images × T: action recognition, temporal localisation, 3D CNNs (I3D), Two-Stream, SlowFast, ViViT, and TimeSformer's divided space-time attention.
Weightage
Exam pattern
Quizzes (~15%) · Assignments / programming (~30%) · Mid-Sem (~20%) · End-Sem (~35%). Confirm exact split with the instructor.
Important dates
- Quiz 1 (tentative)2026-02-10
- Mid-Sem Exam2026-03-05
- Quiz 2 (tentative)2026-04-02
- Assignment 4 deadline2026-04-19
- End-Sem Exam2026-04-25
Professor notes
- Tapaswi-Sharma quizzes lean conceptual: 'why does X work?' rather than 'recite X'. Have a one-line rationale ready for every named architecture.
- Assignments emphasise implementation depth — Asgn 3 extends Faster R-CNN to oriented boxes and multi-task U-Net; Asgn 4 builds ViT from scratch and CvT. Expect case-study questions adjacent to those tasks.
- Watch out for 'list the upgrades' enumeration questions on modern transformers — write them as a numbered list, not prose.
- Multimodal / vision-language case studies are increasingly common; rehearse the 3-pillar VLM blueprint and Prefix-LM masking pattern.
- Diagrams earn marks. Spend 1 minute on a clean block diagram before writing answer prose.