Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Original — End-Sem Practice (Full Syllabus, modern emphasis)
Duration: 180 min • Max marks: 100
Section A — Short Answers (each 3 marks)
18 marks- 1.Differentiate semantic, instance, and panoptic segmentation in one line each.3 m
- 2.What four properties of point clouds break standard CNNs?3 m
- 3.State DINO's two anti-collapse tricks and why both are needed.3 m
- 4.Why does MAE mask 75% while BERT masks 15%?3 m
- 5.What is GQA, and how does it compare to MHA and MQA in KV-cache size?3 m
- 6.Write SigLIP's loss in one line and state how it differs from CLIP's loss.3 m
Section B — Long Answers (each 10 marks)
40 marks- 1.Explain Swin Transformer's local-window + shifted-window attention. Why does it give global receptive field at O(n) per layer?10 m
- 2.Describe the three pillars of 3D Gaussian Splatting. Why is the covariance parameterised as R·S·Sᵀ·Rᵀ and not as a 3×3 symmetric matrix?10 m
- 3.Enumerate seven modern transformer upgrades over the original 2017 Transformer. Give a one-line rationale per upgrade.10 m
- 4.Compare Two-Stream and SlowFast video networks. What is each architecture trying to separate, and how do SlowFast's lateral connections help?10 m
Section C — Case studies (each 12 marks)
24 marks- 1.Watch-clock VLM case study. Given an image of a digital alarm clock plus a text input listing the two displayed numbers, e.g., ["05", "23"], design a Vision-Language Model that predicts the time the clock shows. Cover: (a) architecture (3-pillar VLM); (b) attention masking; (c) handling ambiguous (blurred / glared) digits using the text prior; (d) zero-shot behaviour when the text input is dropped, and the relationship to CLIP-style classification.12 m
- 2.Oriented Faster R-CNN. Extend Faster R-CNN to predict oriented (rotated) bounding boxes. Discuss: (a) what changes in annotations; (b) direct angle regression vs multi-bin classification — pros, cons, and the angle-wraparound issue; (c) modifications to mAP for oriented boxes; (d) which loss-weight hyperparameter you would tune and how.12 m
Section D — Calculation / proof (each 9 marks)
18 marks- 1.Compute the parameter count of ViT-L/16 (L=24, d_model=1024, d_ff=4096, patch=16, image=224). Show per-layer breakdown and total.9 m
- 2.Prove that PointNet is permutation invariant. State the universal approximation property (informally).9 m
Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)