Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Sample Papers/Mock Paper 12 — True/False with Justification + Multiple Choice

Mock Paper 12 — True/False with Justification + Multiple Choice

Duration: 150 min • Max marks: 100

Section A — True/False with Justification (2 marks each, 50 marks)

50 marks

1.A 3×3 conv with stride 2 always reduces spatial dimensions by exactly half.2 m
2.Max pooling propagates gradient only to the max position during backprop.2 m
3.ReLU(x) = max(0, x) has gradient 0 everywhere except at x > 0.2 m
4.Two stacked 3×3 convs have the same receptive field as one 5×5 conv but with MORE parameters.2 m
5.Erosion followed by dilation (opening) restores the original image exactly.2 m
6.Gaussian smoothing in 2D is separable into two 1D convolutions.2 m
7.DCT is used in JPEG because it produces real-valued output.2 m
8.BatchNorm normalises with batch statistics during BOTH training and inference.2 m
9.ResNet skip y = F(x) + x requires F(x) and x to have the same dimensions.2 m
10.YOLO uses √w and √h to penalise small-box errors more than large-box errors.2 m
11.NMS should be applied across all classes simultaneously.2 m
12.Dice loss is equivalent to IoU loss up to a monotonic transformation.2 m
13.Mask R-CNN's RoI Align uses bilinear interpolation to avoid RoI Pool's quantisation errors.2 m
14.Heatmap pose regression beats coordinate regression because heatmaps preserve spatial structure.2 m
15.OpenPose's Part Affinity Fields are 1D scalar maps along each limb.2 m
16.Self-attention complexity is O(N²·D) for sequence length N and model dim D.2 m
17.Adding a constant c to softmax inputs changes the output: softmax(x + c) ≠ softmax(x).2 m
18.ViT's CLS token is initialised randomly and learned during training.2 m
19.Pre-norm Transformers apply LayerNorm AFTER the residual addition: y = LN(x + F(x)).2 m
20.CLIP uses contrastive learning with softmax over N candidates in a batch of N pairs.2 m
21.DINO's teacher is updated by gradient descent.2 m
22.MAE's encoder processes only the visible (unmasked) patches.2 m
23.JEPA predicts pixel values in the masked regions.2 m
24.3D Gaussian Splatting trains a neural network to represent the scene.2 m
25.Spherical Harmonics in 3DGS enable view-dependent colour (specularity, reflections).2 m

Section B — Multiple Choice (2 marks each, 30 marks)

30 marks

1.Output size of 224×224 image through ViT-Base/16?2 m
2.In Faster R-CNN, how many anchors are generated per spatial location?2 m
3.Which is NOT a property of bilateral filtering: edge-preserving / separable into 1D / non-linear / spatial+range kernels?2 m
4.In SMPL, what are β and θ?2 m
5.What does YOLO's grid output S×S×(B·5 + C) encode (for PASCAL VOC: 7×7×30)?2 m
6.Most appropriate loss for segmenting tiny objects on a large background?2 m
7.DINO teacher temperature τ_t is typically:2 m
8.MAE achieves best representations at what mask ratio?2 m
9.A 3D Gaussian in 3DGS uses how many numbers (SH degree 3)?2 m
10.Which method does NOT belong to the contrastive family: SimCLR / MoCo / CLIP / MAE?2 m
11.In SlowFast networks, the Fast pathway has:2 m
12.PointNet's permutation invariance comes from:2 m
13.PaliGemma's image-to-language connector is:2 m
14.Which is NOT a benefit of skip connections (ResNet-style)?2 m
15.For class imbalance, which is LEAST effective: focal loss / class-weighted CE / oversampling minority / increasing model capacity?2 m

Section C — Long-Form (10 marks each, 20 marks)

20 marks

1.T/F with justification: (a) DINOv2 uses Sinkhorn-Knopp centering instead of vanilla DINO's running-mean. (b) Stable Diffusion's U-Net operates in pixel space. (c) CLIP's similarity matrix is N×N where N is the dataset size. (d) YOLOv1 can detect at most one object per grid cell. (e) OpenPose performs top-down pose estimation.10 m
2.For each statement write CORRECT or rewrite CORRECTED: (a) Adam combines momentum with adaptive per-parameter learning rates. (b) L1 regularisation encourages weight decay; L2 encourages sparsity. (c) CE with softmax gives gradient = (target − softmax_output). (d) Convolution is commutative and associative. (e) Otsu's method finds the threshold minimising between-class variance.10 m

Track your attempt locally — score and time are recorded in your browser. (Coming soon: timed-attempt mode.)