Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

Subjective Questions

Short, long, derivation, numerical, proof, compare, architecture.

long · 10 marksch-1-1

Trace the R-CNN → Fast R-CNN → Faster R-CNN evolution. For each step, identify the specific bottleneck removed and the mechanism that replaces it.

derivation · 6 marksch-1-1

Define IoU and GIoU. Argue why pure IoU is unsuitable as a regression LOSS, and how GIoU fixes the problem.

short · 4 marksch-1-1

Write the YOLO output tensor shape and explain each dimension.

long · 8 marksch-1-1

Why does Focal Loss help single-stage detectors? Write its formula and explain the role of each component.

short · 5 marksch-1-1

Describe the NMS algorithm step-by-step. Why is it applied per class, and what does Soft-NMS change?

derivation · 6 marksch-1-1

Explain why YOLO regresses on √w, √h instead of w, h directly.

compare · 6 marksch-2-1

Differentiate semantic, instance, and panoptic segmentation with one-line examples each.

long · 8 marksch-2-1

Why does FCN need transposed convolutions, and what insight improves FCN-32s → FCN-8s?

short · 5 marksch-2-1

Compare RoI Pool and RoI Align. Why is the difference critical for Mask R-CNN?

long · 8 marksch-2-1

What is dilated/atrous convolution and why use it for segmentation? Identify one drawback.

short · 4 marksch-2-1

Why does MiDaS use a scale-and-shift-invariant loss? What does the output mean?

long · 8 marksch-3-1

Why does coordinate regression fail for pose estimation, and what does heatmap regression do differently?

compare · 6 marksch-3-1

Top-down vs bottom-up multi-person pose — one trade-off each direction.

long · 10 marksch-3-1

What are Part Affinity Fields in OpenPose? Explain the line-integral score and how PAFs convert grouping into bipartite matching.

short · 4 marksch-3-1

State the SMPL parameterisation. How many learnable values describe a posed body?

long · 8 marksch-4-1

List the four properties of point clouds that make standard CNNs unsuitable, then describe VoxNet's solution and its drawbacks.

proof · 8 marksch-4-1

State and informally justify the PointNet universal-approximation property.

short · 4 marksch-4-1

Why use MAX rather than SUM or AVERAGE in PointNet?

short · 4 marksch-4-1

What does DGCNN's 'dynamic graph' change versus PointNet++?

long · 10 marksch-5-1

Explain the three pillars of 3D Gaussian Splatting and what makes it 'optimisation, not learning'.

numerical · 5 marksch-5-1

Count the learnable parameters per Gaussian in 3DGS.

derivation · 6 marksch-5-1

Why parameterise Σ as R·S·Sᵀ·Rᵀ rather than as a 3×3 symmetric matrix?

short · 5 marksch-5-1

Describe Adaptive Density Control's three operations.

derivation · 6 marksch-6-1

Write scaled dot-product attention and explain the √dₖ factor with a variance argument.

short · 4 marksch-6-1

Distinguish self-attention, masked self-attention, and cross-attention.

short · 4 marksch-6-1

Why are positional encodings essential, and what is the sinusoidal formula?

compare · 5 marksch-6-1

Why use LayerNorm rather than BatchNorm in Transformers?

architecture · 8 marksch-7-1

Walk through the ViT input pipeline end-to-end.

numerical · 6 marksch-7-1

Compute the parameter count of ViT-B/16 (L=12, d=768, d_ff=3072, patch 16, image 224).

short · 4 marksch-7-1

When ViT resolution increases 224 → 336, what changes? When patch size 32 → 16, what changes?

long · 8 marksch-7-1

What is Swin Transformer's key idea, and how does it fix ViT's O(n²)?

long · 8 marksch-8-1

Walk through one SimCLR training step.

short · 4 marksch-8-1

Write InfoNCE / NT-Xent. Why is it 'just cross-entropy'?

short · 5 marksch-8-1

Why does SimCLR use a projection head g(·) and discard it for downstream tasks?

compare · 6 marksch-8-1

Differentiate SimCLR, MoCo, and BYOL in one line each.

short · 4 marksch-8-1

How does CLIP enable zero-shot classification?

long · 10 marksch-9-1

Explain DINO's anti-collapse mechanisms. Why must centering and sharpening work together?

short · 4 marksch-9-1

What is DINO's multi-crop strategy and why is it the key to emergent properties?

compare · 6 marksch-9-1

Why does MAE mask 75% while BERT only masks 15%?

short · 5 marksch-9-1

How does JEPA differ from MAE?

long · 10 marksch-10-1

Enumerate the seven modern transformer upgrades from the original 2017 Transformer to today's ViT-5/LLM stack. Give a one-line rationale per upgrade.

derivation · 5 marksch-10-1

Why does RoPE's dot product depend only on relative position?

short · 4 marksch-10-1

What problem does Flash Attention solve and how?

short · 4 marksch-10-1

What is GQA, and how does it differ from MQA and MHA?

long · 10 marksch-11-1

Describe the three-pillar VLM blueprint used by PaliGemma and the Prefix-LM masking pattern.

compare · 5 marksch-11-1

Differentiate CLIP and SigLIP losses. Why does SigLIP scale better?

long · 8 marksch-11-1

Why does 1D RoPE break for images, and how do 2D-RoPE and M-RoPE fix it?

long · 10 marksch-11-1

**Case study (PYQ-style):** You are given a digital alarm-clock image plus a text input listing the digits shown, e.g., ["05", "23"]. Design a Vision-Language Model that predicts the time the clock displays. (a) Architecture choice; (b) attention masking; (c) handling ambiguous digits; (d) zero-shot behaviour when the text input is dropped.

short · 4 marksch-12-1

Why is video ≠ images × T? Give one concrete example.

long · 8 marksch-12-1

Explain I3D's 'inflation' trick and why it matters.

compare · 6 marksch-12-1

Compare Two-Stream and SlowFast networks. What's the conceptual link?

long · 8 marksch-12-1

What are the four attention factorisations in TimeSformer, and why did divided space-time win?

architecture · 12 marksch-1-1

**Assignment 3 case study:** Extend Faster R-CNN to predict oriented bounding boxes (axis-aligned + rotation angle). (a) What changes in the dataset/annotations? (b) Compare direct angle regression vs multi-bin classification for angle prediction. (c) How would you modify mAP computation for oriented boxes? (d) Loss weighting between angle and existing terms — what would you tune?

architecture · 10 marksch-2-1

**Assignment 3 case study:** Modify a Vanilla U-Net to perform multi-task learning: semantic segmentation + monocular depth from a single RGB image. (a) Where in the U-Net do you branch? (b) What loss combination would you use? (c) Why is U-Net a good backbone for both tasks simultaneously? (d) How would you evaluate?

architecture · 12 marksch-7-1

**Assignment 4 case study:** Implement and train ViT from scratch on CIFAR-10. (a) Why is small-image classification a challenging benchmark for ViT? (b) Compare patch sizes 2, 4, 8 — what changes? (c) PreNorm vs PostNorm: what would the variance trajectory look like through depth? (d) Four positional embedding variants — what comparison would you expect?

architecture · 10 marksch-7-1

**Assignment 4 case study:** Implement Convolutional Vision Transformer (CvT) on CIFAR-10. (a) What CNN inductive biases does CvT inject into ViT? (b) Role of the Convolutional Token Embedding vs ViT's linear patch projection. (c) Why use depth-wise separable convolutions for Q, K, V projection — savings and impact? (d) Why does CvT often drop positional embeddings?