Computer Vision
CSE471Subjective Questions
Short, long, derivation, numerical, proof, compare, architecture.
Trace the R-CNN → Fast R-CNN → Faster R-CNN evolution. For each step, identify the specific bottleneck removed and the mechanism that replaces it.
Define IoU and GIoU. Argue why pure IoU is unsuitable as a regression LOSS, and how GIoU fixes the problem.
Write the YOLO output tensor shape and explain each dimension.
Why does Focal Loss help single-stage detectors? Write its formula and explain the role of each component.
Describe the NMS algorithm step-by-step. Why is it applied per class, and what does Soft-NMS change?
Explain why YOLO regresses on √w, √h instead of w, h directly.
Differentiate semantic, instance, and panoptic segmentation with one-line examples each.
Why does FCN need transposed convolutions, and what insight improves FCN-32s → FCN-8s?
Compare RoI Pool and RoI Align. Why is the difference critical for Mask R-CNN?
What is dilated/atrous convolution and why use it for segmentation? Identify one drawback.
Why does MiDaS use a scale-and-shift-invariant loss? What does the output mean?
Why does coordinate regression fail for pose estimation, and what does heatmap regression do differently?
Top-down vs bottom-up multi-person pose — one trade-off each direction.
What are Part Affinity Fields in OpenPose? Explain the line-integral score and how PAFs convert grouping into bipartite matching.
State the SMPL parameterisation. How many learnable values describe a posed body?
List the four properties of point clouds that make standard CNNs unsuitable, then describe VoxNet's solution and its drawbacks.
State and informally justify the PointNet universal-approximation property.
Why use MAX rather than SUM or AVERAGE in PointNet?
What does DGCNN's 'dynamic graph' change versus PointNet++?
Explain the three pillars of 3D Gaussian Splatting and what makes it 'optimisation, not learning'.
Count the learnable parameters per Gaussian in 3DGS.
Why parameterise Σ as R·S·Sᵀ·Rᵀ rather than as a 3×3 symmetric matrix?
Describe Adaptive Density Control's three operations.
Write scaled dot-product attention and explain the √dₖ factor with a variance argument.
Distinguish self-attention, masked self-attention, and cross-attention.
Why are positional encodings essential, and what is the sinusoidal formula?
Why use LayerNorm rather than BatchNorm in Transformers?
Walk through the ViT input pipeline end-to-end.
Compute the parameter count of ViT-B/16 (L=12, d=768, d_ff=3072, patch 16, image 224).
When ViT resolution increases 224 → 336, what changes? When patch size 32 → 16, what changes?
What is Swin Transformer's key idea, and how does it fix ViT's O(n²)?
Walk through one SimCLR training step.
Write InfoNCE / NT-Xent. Why is it 'just cross-entropy'?
Why does SimCLR use a projection head g(·) and discard it for downstream tasks?
Differentiate SimCLR, MoCo, and BYOL in one line each.
How does CLIP enable zero-shot classification?
Explain DINO's anti-collapse mechanisms. Why must centering and sharpening work together?
What is DINO's multi-crop strategy and why is it the key to emergent properties?
Why does MAE mask 75% while BERT only masks 15%?
How does JEPA differ from MAE?
Enumerate the seven modern transformer upgrades from the original 2017 Transformer to today's ViT-5/LLM stack. Give a one-line rationale per upgrade.
Why does RoPE's dot product depend only on relative position?
What problem does Flash Attention solve and how?
What is GQA, and how does it differ from MQA and MHA?
Describe the three-pillar VLM blueprint used by PaliGemma and the Prefix-LM masking pattern.
Differentiate CLIP and SigLIP losses. Why does SigLIP scale better?
Why does 1D RoPE break for images, and how do 2D-RoPE and M-RoPE fix it?
**Case study (PYQ-style):** You are given a digital alarm-clock image plus a text input listing the digits shown, e.g., ["05", "23"]. Design a Vision-Language Model that predicts the time the clock displays. (a) Architecture choice; (b) attention masking; (c) handling ambiguous digits; (d) zero-shot behaviour when the text input is dropped.
Why is video ≠ images × T? Give one concrete example.
Explain I3D's 'inflation' trick and why it matters.
Compare Two-Stream and SlowFast networks. What's the conceptual link?
What are the four attention factorisations in TimeSformer, and why did divided space-time win?
**Assignment 3 case study:** Extend Faster R-CNN to predict oriented bounding boxes (axis-aligned + rotation angle). (a) What changes in the dataset/annotations? (b) Compare direct angle regression vs multi-bin classification for angle prediction. (c) How would you modify mAP computation for oriented boxes? (d) Loss weighting between angle and existing terms — what would you tune?
**Assignment 3 case study:** Modify a Vanilla U-Net to perform multi-task learning: semantic segmentation + monocular depth from a single RGB image. (a) Where in the U-Net do you branch? (b) What loss combination would you use? (c) Why is U-Net a good backbone for both tasks simultaneously? (d) How would you evaluate?
**Assignment 4 case study:** Implement and train ViT from scratch on CIFAR-10. (a) Why is small-image classification a challenging benchmark for ViT? (b) Compare patch sizes 2, 4, 8 — what changes? (c) PreNorm vs PostNorm: what would the variance trajectory look like through depth? (d) Four positional embedding variants — what comparison would you expect?
**Assignment 4 case study:** Implement Convolutional Vision Transformer (CvT) on CIFAR-10. (a) What CNN inductive biases does CvT inject into ViT? (b) Role of the Convolutional Token Embedding vs ViT's linear patch projection. (c) Why use depth-wise separable convolutions for Q, K, V projection — savings and impact? (d) Why does CvT often drop positional embeddings?