Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits
Revision Notes/Unit 2 — Dense Prediction: Segmentation + Depth/Dense Prediction — Segmentation & Monocular Depth

Dense Prediction — Segmentation & Monocular Depth

Intuition

Detection labels regions; dense prediction labels pixels. The architectural challenge is restoring spatial resolution lost during downsampling — usually with transposed convolutions, skip connections, or dilated convolutions.

Explanation

Semantic segmentation assigns a class label to every pixel but does not distinguish between two instances of the same class. Instance segmentation additionally separates 'things' (countable: cars, people) into individual masks; 'stuff' (sky, road) is left unlabeled. Panoptic segmentation labels every pixel and additionally assigns instance IDs to 'things'.

Fully Convolutional Networks (FCN) replace the dense classifier head with a per-pixel output. A backbone (VGG/ResNet) downsamples by 32×; the decoder upsamples via transposed convolution. FCN-32s upsamples directly from the deepest layer — coarse boundaries. FCN-8s adds skip connections from pool3 and pool4 features, fusing them with the deep upsampled output to recover fine detail.

U-Net is a symmetric encoder–decoder with concat skip connections (not residual add). At each decoder stage the corresponding encoder feature map is concatenated along the channel dimension before further convolution. The same U-shape powers Stable Diffusion's denoising network — proving the architecture is general.

Dilated (atrous) convolutions insert gaps between kernel taps, enlarging the receptive field without losing resolution or adding parameters. DeepLab stacks dilated convs to maintain high-res feature maps while still capturing global context. Drawback: gridding artifacts at large dilation rates because the kernel only samples a sparse, regular grid — mitigated with co-prime rates.

Mask R-CNN extends Faster R-CNN with a third head: a small FCN producing a per-class binary mask (e.g., 28 × 28) for each RoI. The key engineering change is replacing RoI Pool with RoI Align, which uses bilinear interpolation at exact float coordinates instead of quantising. The mask AP gain on small objects is substantial.

Monocular depth is heterogeneous: ground truth comes in metric units (LiDAR/Kinect), relative units (stereo disparity), or up-to-unknown-scale (SfM). MiDaS trains across all of these by computing the loss after aligning prediction and target by an optimal scale and shift, so the network learns relative depth shape rather than absolute distance.

Definitions

  • Semantic / Instance / PanopticPixel-level: class only / class + instance id (things only) / class + instance id (everything).
  • Transposed convolutionLearnable upsampling implemented as a strided conv with the operation reversed; produces a larger output.
  • RoI AlignBilinear interpolation of the feature map at exact float coordinates within an RoI; replaces RoI Pool's quantization.
  • Dilated/Atrous convolutionConv with gaps of size (rate − 1) between kernel taps; expands receptive field without parameter growth.
  • Dice loss1 − Dice; less sensitive to class imbalance than CE.
  • mIoUMean IoU across classes; standard segmentation metric.

Formulas

  • \text{Dice} = 2|A \cap B| / (|A| + |B|) = 2\,\text{IoU} / (1 + \text{IoU})
  • \text{mIoU} = \tfrac{1}{C} \sum_c |A_c \cap B_c| / |A_c \cup B_c|
  • \text{Focal loss with class weight } \alpha_t :\ \ -\alpha_t (1-p_t)^{\gamma} \log p_t
  • \text{Scale-shift-invariant L1:}\ \min_{s,t} \sum_i |s \cdot d_i + t - d_i^{*}|

Derivations

Dice = 2·IoU/(1+IoU): start from |A∩B| / |A∪B| = x ⇒ |A∩B|/(|A|+|B|−|A∩B|) = x ⇒ Dice = 2|A∩B|/(|A|+|B|). Substituting |A∩B| = x(|A∪B|) gives Dice = 2x/(1+x).

Examples

  • Pixel accuracy fails on imbalanced scenes: 90% sky → 'predict sky everywhere' scores 0.9 pixel accuracy but ≈ 0 mIoU.
  • On medical 3D segmentation Dice loss is preferred because the foreground (tumor) typically occupies < 5% of voxels; CE alone is dominated by easy background gradients.
  • DeepLab v3 with dilation rates 6, 12, 18 — ASPP module — captures multi-scale context without resolution loss.

Diagrams

  • FCN-32s vs FCN-8s side-by-side: 32× upsample directly vs adding pool3/pool4 skips and upsampling in stages.
  • U-Net U-shape with horizontal concat skips at each resolution.
  • RoI Align bilinear sampling diagram: a floating-point RoI overlaid on the feature grid, with 4 sample points per bin interpolated bilinearly.

Edge cases

  • Dilation gridding artifacts at high rates — use co-prime dilation rates in series.
  • RoI Pool fails on small objects (sub-pixel quantization error dominates) — fix with RoI Align.
  • Class imbalance with CE: use Dice loss or weighted CE; pure CE saturates on background.

Common mistakes

  • Conflating semantic and instance segmentation in panoptic.
  • Using pixel accuracy alone for evaluation on imbalanced scenes.
  • Computing Dice with union in the denominator (it's a sum: |A| + |B|).
  • Forgetting that monocular depth from MiDaS is RELATIVE, not metric.

Shortcuts

  • U-Net skip = CONCAT. ResNet skip = ADD. Don't confuse.
  • Transposed conv is learnable upsampling, NOT 'deconvolution' (doesn't invert anything).
  • Mask R-CNN mask head loss = per-pixel BCE on the mask of the correct class only.

Proofs / Algorithms