Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Sample Papers/Mock Paper 1 — Comprehensive (full syllabus, 3 hr / 100 marks)

Mock Paper 1 — Comprehensive (full syllabus, 3 hr / 100 marks)

Duration: 180 min • Max marks: 100

20 marks

1.Compute the output size of a convolution on a 32×32 input with 5×5 kernel, stride 2, padding 1.1 m
2.Between a 5×5 Gaussian kernel and a 5×5 Mean filter, which produces less ringing in the frequency domain, and why?1 m
3.Why does YOLO use (√w − √ŵ)² + (√h − √ĥ)² rather than (w − ŵ)² + (h − ĥ)² for the size loss?2 m
4.How many anchor boxes does Faster R-CNN's RPN use per spatial location, and how is the number derived?1 m
5.A small binary image has two 2×2 'on' blocks at positions (rows 1-2, cols 1-2) and (rows 1-2, cols 4-5), and a 2×4 'on' block at (rows 4-5, cols 1-4) — all on a 7×7 black background. How many connected components under (a) 4-connectivity and (b) 8-connectivity?2 m
6.What is the typical output dimensionality of DINO's projection head, and why is it so large?1 m
7.ViT at 384×384 with patch size 16 — how many patch tokens (excluding CLS)?2 m
8.What masking ratio does MAE use, and why much higher than BERT's 15%?1 m
9.Why does RoI Align beat RoI Pool for masks even with the same backbone?2 m
10.State the IoU formula between two bounding boxes.1 m
11.A neural network has all weights initialised to zero. Will training succeed? Justify.2 m
12.OpenPose with 18 keypoint types and 19 limb types — how many output channels (excluding background)?1 m
13.Name SMPL's shape and pose parameters and their dimensionalities.1 m
14.Why is JEPA said to predict 'representations instead of pixels', and how does this differ from MAE?2 m

40 marks

1.Why is a Multi-Layer Perceptron unsuitable for image classification on real photographs? Quantify with parameter counts for an MLP first layer vs an equivalent conv layer on 200×200×3 input with 1000 hidden / 10 conv filters of size 5×5.5 m
2.Describe scaled dot-product attention and explain the √dₖ rescaling.5 m
3.Compare top-down vs bottom-up multi-person pose estimation. Pipeline (2-3 lines), 2 pros, 2 cons each. Mention representative methods.6 m
4.How does PointNet achieve permutation invariance? Identify the symmetric function.4 m
5.Explain DINO's centering and sharpening. What collapse does each prevent and why are both needed? Illustrate centering with a small numerical example.6 m
6.Describe the JPEG compression pipeline in 7 steps. Which step is lossy? Why DCT over DFT?5 m
7.Given a small binary 5×5 image with a hollow 3×3 ring (rows 2-4, cols 2-4 with centre off), compute erosion and dilation with a 3×3 cross structuring element. Then compute opening and closing.5 m
8.State the Convolution Theorem formally. Give one practical implication for image processing.4 m

40 marks

1.Object Detection / mAP. 8 detections sorted by score with the following IoU vs closest GT: 0.87, 0.42, 0.71, 0.55, 0.81, 0.30, 0.62, 0.51. Total GT dogs = 6. IoU threshold for TP = 0.5; each GT matches at most one detection (the highest-scoring one above threshold). (a) Mark TP/FP for each. (b) Cumulative P-R after each. (c) AP via 11-point interpolation.10 m
2.CNN architecture calculation. Input 224×224×3. L1 conv 64@7×7 s=2 p=3; L2 maxpool 3×3 s=2 p=1; L3 conv 128@3×3 s=1 p=1; L4 conv 128@3×3 s=1 p=1; L5 maxpool 2×2 s=2 p=0; L6 conv 256@3×3 s=1 p=1; L7 GAP; L8 FC 1000. (a) Spatial dims after each layer. (b) Param count of L1, L3, L6, L8 (with biases). (c) Receptive field at L6.10 m
3.SSL comparison. For CLIP, DINO, MAE, JEPA: (a) what is the supervision signal (one sentence each)? (b) Table of (negatives required? augmentations? pixel reconstruction?) (c) For a robotics task needing temporal dynamics in video, which method is most natural to extend, and why?10 m
4.ViT pipeline. 224×224 RGB, patch 16, D = 768, 12 heads (d_k = 64), FFN hidden = 4D. (a) Sequence length entering Transformer (with CLS). (b) Order of operations in one encoder block (PreNorm) with LN placement. (c) Total params of one block. (d) Why is PE needed?10 m

Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)