Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Sample Papers/Mock Paper 3 — Assignment-aligned (Faster R-CNN ext, Multi-task U-Net, ViT, CvT, ClipCap)
Mock Paper 3 — Assignment-aligned (Faster R-CNN ext, Multi-task U-Net, ViT, CvT, ClipCap)
Duration: 180 min • Max marks: 100
Section A — Short Answer (1-2 marks each, 20 marks)
20 marks- 1.Write one Transformer block in (a) PostNorm and (b) PreNorm form. Which appears in the original 2017 paper, and which in ViT?2 m
- 2.ViT on CIFAR-10 (32×32) with patch=4 — sequence length including CLS?1 m
- 3.Why are positional embeddings unnecessary in CvT in principle? Why might you still want them on CIFAR-10?2 m
- 4.To extend Faster R-CNN to oriented bounding boxes, where in the architecture do you need changes — RPN, RoI Heads, or both?1 m
- 5.Why do Transformers use LayerNorm and CNNs use BatchNorm? Two reasons.2 m
- 6.ClipCap variants: (a) MLP mapper + fine-tuned GPT-2, (b) Transformer mapper + frozen GPT-2. Which has more trainable params and by roughly how much?1 m
- 7.Direct angle regression vs multi-bin classification for oriented bboxes — one advantage of each.2 m
- 8.What is a synset in the ImageNet hierarchy?1 m
- 9.What are PointNet's 'critical points'? What does robustness to random point dropout demonstrate?2 m
- 10.What does the RPN's objectness loss trade off, and what is the standard loss formulation?1 m
- 11.In ClipCap with frozen GPT-2 + Transformer mapper, why are the prefix tokens largely unreadable when projected to GPT-2's vocab via nearest neighbour?2 m
- 12.Purpose of U-Net's skip connections for a multi-task variant predicting both segmentation and depth?1 m
- 13.ViT trained with NO positional embeddings: (a) test accuracy, (b) can it distinguish 'cat at top-left' from 'cat at bottom-right'?2 m
- 14.In CvT, what is depth-wise separable convolution and why is it used for Q/K/V projection?1 m
Section B — Conceptual / Explanation (4-6 marks each, 40 marks)
40 marks- 1.A 6-layer ViT on CIFAR-10 after 30 epochs: train loss 1.5 → 0.4, val loss 1.6 → 0.9 → 1.1 (rising last 10 epochs), val acc plateaued at 68%. (a) Diagnosis. (b) Four specific changes to push val acc above 80%. (c) An early-stopping criterion.6 m
- 2.CLIP zero-shot on plant diseases: high acc on common 'apple scab'; poor on rare 'orchid mosaic virus'; '+3% from adding the prefix 'a photo of''. Explain each and suggest an engineering technique beyond basic prompting.5 m
- 3.Multi-task U-Net with L_total = CE(seg) + MSE(depth). Convergence: mIoU 0.45 (target >0.65), depth RMSE 0.05 m (target <0.1 m). Segmentation under-performs. Diagnose and propose remedies.5 m
- 4.CvT's three architectural changes vs ViT: (1) Convolutional Token Embedding, (2) Convolutional Projection for Q/K/V, (3) Multi-stage hierarchy. For each: what it adds, inductive bias. Which contributes most on CIFAR-10?5 m
- 5.PointNet observations: 88% baseline; 88% under random permutation; 86% under 50% random point dropout; 52% when critical points are removed. Explain each via architecture.5 m
- 6.ClipCap trained with prefix length k=10. At inference: CIDEr drops if you truncate to k=5 (0.62) or k=1 (0.18), AND drops if you zero-pad to k=20 (0.95) or k=40 (0.78). Why does performance peak exactly at k=10?4 m
- 7.Self-driving stack: 3D detection at 30 FPS, lane segmentation at 10 FPS, depth for path planning — design a shared-backbone multi-task system.5 m
- 8.Critique: 'task-specific architectures are dead — just train task-specific heads on a frozen foundation-model backbone.' When is this correct? When is it wrong?5 m
Section C — Long-Form / Calculation (10 marks each, 40 marks)
40 marks- 1.Extend Faster R-CNN to oriented bboxes. (a) Complete modifications: dataset/loss, RPN, RoI Heads, evaluation. (b) Two angle prediction methods — (i) direct regression, (ii) multi-bin classification + residual — state the loss and one weakness each. (c) Angle wraparound: propose a modified loss for direct regression that handles it correctly.10 m
- 2.U-Net multi-task variants: vanilla (skip) mIoU 0.62 / depth 0.08; without skips 0.31 / 0.18; with residual blocks 0.66 / 0.07. (a) Why is removing skips catastrophic for both? (b) Residual blocks help only marginally — why less than in ResNet? (c) Two further architectural changes to push both metrics significantly higher.10 m
- 3.ViT hyperparameter analysis on CIFAR-10. (a) For each hyperparameter (patch size, D, L, H, MLP hidden, augmentation, LR schedule), predict major (>2%), moderate (1-2%), or minor (<1%) impact and justify. (b) State your best configuration to maximise crossing 80% accuracy in 50 epochs. (c) 'ViTs need more inductive bias on small datasets' — operational implications.10 m
- 4.Design a production e-commerce product-captioning system: 50K daily images, captions accurate (brand/colour/size/material), concise, brand-safe, <200 ms inference. (a) Full architecture. (b) With only 5K labeled (image, caption) pairs but 500K unlabeled images, how to leverage the unlabeled data? (c) After deployment, the system fails on a new category ('vintage cameras'). How to adapt efficiently without full retraining?10 m
Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)