Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Sample Papers/Original — End-Sem Practice (Full Syllabus, modern emphasis)

Original — End-Sem Practice (Full Syllabus, modern emphasis)

Duration: 180 min • Max marks: 100

18 marks

1.Differentiate semantic, instance, and panoptic segmentation in one line each.3 m
2.What four properties of point clouds break standard CNNs?3 m
3.State DINO's two anti-collapse tricks and why both are needed.3 m
4.Why does MAE mask 75% while BERT masks 15%?3 m
5.What is GQA, and how does it compare to MHA and MQA in KV-cache size?3 m
6.Write SigLIP's loss in one line and state how it differs from CLIP's loss.3 m

40 marks

1.Explain Swin Transformer's local-window + shifted-window attention. Why does it give global receptive field at O(n) per layer?10 m
2.Describe the three pillars of 3D Gaussian Splatting. Why is the covariance parameterised as R·S·Sᵀ·Rᵀ and not as a 3×3 symmetric matrix?10 m
3.Enumerate seven modern transformer upgrades over the original 2017 Transformer. Give a one-line rationale per upgrade.10 m
4.Compare Two-Stream and SlowFast video networks. What is each architecture trying to separate, and how do SlowFast's lateral connections help?10 m

24 marks

1.Watch-clock VLM case study. Given an image of a digital alarm clock plus a text input listing the two displayed numbers, e.g., ["05", "23"], design a Vision-Language Model that predicts the time the clock shows. Cover: (a) architecture (3-pillar VLM); (b) attention masking; (c) handling ambiguous (blurred / glared) digits using the text prior; (d) zero-shot behaviour when the text input is dropped, and the relationship to CLIP-style classification.12 m
2.Oriented Faster R-CNN. Extend Faster R-CNN to predict oriented (rotated) bounding boxes. Discuss: (a) what changes in annotations; (b) direct angle regression vs multi-bin classification — pros, cons, and the angle-wraparound issue; (c) modifications to mAP for oriented boxes; (d) which loss-weight hyperparameter you would tune and how.12 m

18 marks

1.Compute the parameter count of ViT-L/16 (L=24, d_model=1024, d_ff=4096, patch=16, image=224). Show per-layer breakdown and total.9 m
2.Prove that PointNet is permutation invariant. State the universal approximation property (informally).9 m

Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)