Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits
Sample Papers/Mock Paper 1 — Comprehensive (full syllabus, 3 hr / 100 marks)

Mock Paper 1 — Comprehensive (full syllabus, 3 hr / 100 marks)

Duration: 180 min • Max marks: 100

Section A — Short Answer / MCQ (1-2 marks each, 20 marks)

20 marks
  1. 1.Compute the output size of a convolution on a 32×32 input with 5×5 kernel, stride 2, padding 1.1 m
  2. 2.Between a 5×5 Gaussian kernel and a 5×5 Mean filter, which produces less ringing in the frequency domain, and why?1 m
  3. 3.Why does YOLO use (√w − √ŵ)² + (√h − √ĥ)² rather than (w − ŵ)² + (h − ĥ)² for the size loss?2 m
  4. 4.How many anchor boxes does Faster R-CNN's RPN use per spatial location, and how is the number derived?1 m
  5. 5.A small binary image has two 2×2 'on' blocks at positions (rows 1-2, cols 1-2) and (rows 1-2, cols 4-5), and a 2×4 'on' block at (rows 4-5, cols 1-4) — all on a 7×7 black background. How many connected components under (a) 4-connectivity and (b) 8-connectivity?2 m
  6. 6.What is the typical output dimensionality of DINO's projection head, and why is it so large?1 m
  7. 7.ViT at 384×384 with patch size 16 — how many patch tokens (excluding CLS)?2 m
  8. 8.What masking ratio does MAE use, and why much higher than BERT's 15%?1 m
  9. 9.Why does RoI Align beat RoI Pool for masks even with the same backbone?2 m
  10. 10.State the IoU formula between two bounding boxes.1 m
  11. 11.A neural network has all weights initialised to zero. Will training succeed? Justify.2 m
  12. 12.OpenPose with 18 keypoint types and 19 limb types — how many output channels (excluding background)?1 m
  13. 13.Name SMPL's shape and pose parameters and their dimensionalities.1 m
  14. 14.Why is JEPA said to predict 'representations instead of pixels', and how does this differ from MAE?2 m

Section B — Conceptual / Explanation (4-6 marks each, 40 marks)

40 marks
  1. 1.Why is a Multi-Layer Perceptron unsuitable for image classification on real photographs? Quantify with parameter counts for an MLP first layer vs an equivalent conv layer on 200×200×3 input with 1000 hidden / 10 conv filters of size 5×5.5 m
  2. 2.Describe scaled dot-product attention and explain the √dₖ rescaling.5 m
  3. 3.Compare top-down vs bottom-up multi-person pose estimation. Pipeline (2-3 lines), 2 pros, 2 cons each. Mention representative methods.6 m
  4. 4.How does PointNet achieve permutation invariance? Identify the symmetric function.4 m
  5. 5.Explain DINO's centering and sharpening. What collapse does each prevent and why are both needed? Illustrate centering with a small numerical example.6 m
  6. 6.Describe the JPEG compression pipeline in 7 steps. Which step is lossy? Why DCT over DFT?5 m
  7. 7.Given a small binary 5×5 image with a hollow 3×3 ring (rows 2-4, cols 2-4 with centre off), compute erosion and dilation with a 3×3 cross structuring element. Then compute opening and closing.5 m
  8. 8.State the Convolution Theorem formally. Give one practical implication for image processing.4 m

Section C — Long-Form / Calculation (10 marks each, 40 marks)

40 marks
  1. 1.Object Detection / mAP. 8 detections sorted by score with the following IoU vs closest GT: 0.87, 0.42, 0.71, 0.55, 0.81, 0.30, 0.62, 0.51. Total GT dogs = 6. IoU threshold for TP = 0.5; each GT matches at most one detection (the highest-scoring one above threshold). (a) Mark TP/FP for each. (b) Cumulative P-R after each. (c) AP via 11-point interpolation.10 m
  2. 2.CNN architecture calculation. Input 224×224×3. L1 conv 64@7×7 s=2 p=3; L2 maxpool 3×3 s=2 p=1; L3 conv 128@3×3 s=1 p=1; L4 conv 128@3×3 s=1 p=1; L5 maxpool 2×2 s=2 p=0; L6 conv 256@3×3 s=1 p=1; L7 GAP; L8 FC 1000. (a) Spatial dims after each layer. (b) Param count of L1, L3, L6, L8 (with biases). (c) Receptive field at L6.10 m
  3. 3.SSL comparison. For CLIP, DINO, MAE, JEPA: (a) what is the supervision signal (one sentence each)? (b) Table of (negatives required? augmentations? pixel reconstruction?) (c) For a robotics task needing temporal dynamics in video, which method is most natural to extend, and why?10 m
  4. 4.ViT pipeline. 224×224 RGB, patch 16, D = 768, 12 heads (d_k = 64), FFN hidden = 4D. (a) Sequence length entering Transformer (with CLS). (b) Order of operations in one encoder block (PreNorm) with LN placement. (c) Total params of one block. (d) Why is PE needed?10 m

Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)