Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits
Sample Papers/Mock Paper 11 — Architectural Reasoning ('Why this design choice?')

Mock Paper 11 — Architectural Reasoning ('Why this design choice?')

Duration: 180 min • Max marks: 100

Section A — Quick 'Why?' Questions (2 marks each, 20 marks)

20 marks
  1. 1.Why does AlexNet use two parallel GPU paths in the original implementation?2 m
  2. 2.Why does VGG use stacked 3×3 convs instead of one 5×5 or 7×7?2 m
  3. 3.Why does ResNet use the identity shortcut rather than a learned transformation?2 m
  4. 4.Why does BatchNorm have learnable γ and β? Why re-scale and re-shift after normalising?2 m
  5. 5.Why does YOLO use a single network with grid output instead of two stages?2 m
  6. 6.Why does Faster R-CNN's RPN use anchors of multiple scales and ratios?2 m
  7. 7.Why does U-Net use CONCAT skips instead of ADD like ResNet?2 m
  8. 8.Why MAE 75% mask but BERT 15%?2 m
  9. 9.Why does DINO use multi-crop but SimCLR uses only two views?2 m
  10. 10.Why is stochastic depth used in deep ViTs but not in shallow CNNs?2 m

Section B — Architectural Trade-offs (4-6 marks each, 40 marks)

40 marks
  1. 1.Why use depthwise separable convolutions (MobileNet) instead of regular conv? Show savings with C_in=128, C_out=256, kernel 3×3.5 m
  2. 2.Why did Transformers switch from PostNorm to PreNorm?5 m
  3. 3.Why does CLIP use contrastive learning rather than generative captioning?5 m
  4. 4.Why does Mask R-CNN predict per-class masks rather than a single class-agnostic mask?5 m
  5. 5.Why does DINO use output dimension 65,536? Why not 1000 (matching ImageNet classes)?4 m
  6. 6.Why does 3DGS use Spherical Harmonics for colour rather than just storing RGB triplets per Gaussian?5 m
  7. 7.Why does PointNet use MAX pooling rather than SUM or MEAN?5 m
  8. 8.Why does the original ViT underperform CNNs on small datasets despite more parameters?4 m

Section C — Deep Architectural Comparisons (10 marks each, 40 marks)

40 marks
  1. 1.Compare design philosophies of CLIP, DINO, MAE: objective, inductive biases, why at scale, best downstream tasks.10 m
  2. 2.Why is the U-Net architecture so universally applicable (medical seg, Stable Diffusion, pix2pix, optical flow, inpainting)? What's the unifying property?10 m
  3. 3.List 8 modern transformer improvements over the original and what each solves.10 m
  4. 4.Why does 3D Gaussian Splatting work so much faster than NeRF (30 min vs 8-12 hr training)?10 m

Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)