Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits
Revision Notes/Unit 4 — Convolutional Neural Networks (CNNs)/CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field

CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field

NotesStory

Intuition

A convolution layer is a stack of learned filters slid across an image — a CNN is a hierarchy of these, with each layer composing simpler features (edges → texture → parts → objects). The reason CNNs work in vision: WEIGHT SHARING (one filter, all spatial locations) gives translation equivariance (a shifted input produces a shifted output) and slashes parameters by ~1000× vs a dense network. The decade of CNN architecture research (2012–2020) was about how to go deeper, wider, or more efficient without breaking trainability.

Explanation

Convolution layer parameters. A conv layer with filters, kernel , input channels : parameters = (the +1 is bias). **Independent of ** — same filter applied at every position. Compare to a fully-connected layer between and outputs: parameters — typically 1000× more. This is the *parameter efficiency* of CNNs.

Output spatial size. . Same padding with odd uses → output size equals input. Valid padding () shrinks by . Two stacked convs (stride 1) have the same RECEPTIVE FIELD as one conv but use FEWER parameters ( vs ) AND add an extra non-linearity. Three = RF, vs .

Receptive field. Set of input pixels that influence a given output. For stacked convs at stride 1: RF = . Doubles at every stride-2 step. Architectures balance RF (need large to see objects), spatial resolution (need fine for dense prediction), and compute.

1×1 convolution — four uses. (1) Channel reduction (bottleneck) — cheap way to drop before an expensive or . (2) Non-linearity injection — per-pixel MLP between conv layers. (3) Cross-channel mixing without spatial mixing. (4) Cheap — only params. Used in: Inception (before expensive convs), ResNet bottleneck, MobileNet pointwise, SENet, feature-pyramid laterals.

Pooling. Max-pool routes argmax → ROBUST to translation, dominant in classification. Avg-pool equal-weight average; Global Average Pooling (GAP) before classifier replaces giant FC layers (ResNet, EfficientNet). No parameters — pools only reduce FLOPs, not weights. Backward: max-pool routes gradient to argmax only; avg-pool distributes equally ().

Dilated (atrous) convolution. Insert zeros between kernel taps. Effective kernel size = . A with dilation 2 has a RF using 9 parameters. Use: semantic segmentation (DeepLab — keep resolution while enlarging RF), WaveNet (1D causal). Drawback: gridding artefact at large dilation rates.

Standard vs depthwise-separable convolution. Standard params. Depthwise-separable = depthwise ( per input channel) + pointwise (). Params = . Ratio ≈ — about cheaper for . Used in MobileNet, Xception.

BatchNorm in CNNs. Normalise per channel across the axes; learnable per channel (length ). At inference: use running mean/var, NOT batch stats. Effect: faster training, higher LR, less init-sensitive, slight regularisation. Variants: LayerNorm (across per sample — used in ViT), GroupNorm (groups of channels — works at batch=1), InstanceNorm (per-sample per-channel — style transfer).

Equivariance vs invariance. Convolutions are TRANSLATION EQUIVARIANT: — shift input, output shifts the same way. After global pooling + FC, the network becomes translation INVARIANT: . CNNs are NOT rotation equivariant — kernels are learned, not symmetric. Fix: data augmentation, or G-CNNs (group-equivariant CNNs).

LeNet-5 (LeCun, 1989/1998). First CNN to win at scale — handwritten digit recognition. Two conv-pool blocks then FC. ~60 k parameters. Trained on USPS / MNIST. The architecture template every later CNN follows: conv → pool → conv → pool → flatten → FC → softmax.

AlexNet (Krizhevsky, Sutskever, Hinton, 2012). Won ILSVRC-2012 by a huge margin (top-5 16% vs 26% next best) and started the deep-learning revolution. 60 M params, 8 layers (5 conv + 3 FC), trained on 2 GPUs. Three innovations: ReLU (much faster than tanh/sigmoid), dropout (in the FC layers, ), data augmentation (random crops + horizontal flips + colour jitter). Top-1 = 56.5%.

VGG (Simonyan & Zisserman, 2014). Stack lots of convs. VGG-16: 138 M params. VGG-19: 143.7 M params, 19.6 GFLOPs, top-1 74.2%. Heavy: most parameters live in the FC layers. Lesson: depth + tiny filters > shallow + huge filters.

Inception / GoogLeNet (Szegedy et al., 2014). Inception module = parallel branches at multiple kernel sizes (, , , max-pool), concatenated. ** bottleneck** placed BEFORE / to cut channels — slashes parameters. Inception-v3: 27.2 M params, 5.71 GFLOPs, top-1 77.3%.

ResNet (He et al., 2015) — residual connections. Beyond ~20 layers, plain CNNs DEGRADE (training loss goes up — not just overfitting, optimisation failure). Fix: . The skip connection lets the network learn the IDENTITY by default (set ) and only learn corrections. **Gradient: ** — the '+I' term ensures gradient never vanishes through the skip. Bottleneck block: with skip. ResNet-50: 25.6 M params, 4.09 GFLOPs, top-1 76.1%. The workhorse backbone.

DenseNet (Huang et al., 2016). — each layer takes ALL previous layer outputs as input via CONCATENATION (not addition like ResNet). Strong gradient flow, feature reuse. DenseNet-161: 28.7 M params, 7.73 GFLOPs, top-1 77.1%.

SENet (Squeeze-Excitation, Hu et al., 2017). Channel attention module added to existing backbones: GAP across spatial → 2-layer MLP → sigmoid scale → multiply with original channels. Recalibrates which channels matter. +1% top-1 on ResNet, almost free in compute.

MobileNet (Howard et al., 2017). Depthwise-separable convs throughout → fewer params/FLOPs than standard convs. MobileNet-v1: 4.2 M params, ~70% top-1. v2 adds inverted residuals and linear bottlenecks. Designed for phones / edge devices.

EfficientNet (Tan & Le, 2019). Compound scaling: scale depth , width , and input resolution together by a single coefficient , with where (so FLOPs scale as ). EfficientNet-B0: 5.3 M params, 0.39 GFLOPs, top-1 77.7% — the best efficiency on the chart for years.

Backprop through conv layer. Conv backward = conv with FLIPPED kernel for ; cross-correlation with input for . MaxPool backward routes to argmax (others = 0). AvgPool backward distributes equally (). ReLU backward is a gate (pass if forward , else 0). Skip backward through adds → no vanishing.

CNN for video. C3D uses 3D convs over — expensive ( factor over 2D), trained from scratch. I3D inflates 2D ImageNet kernels along time and divides by — borrows ImageNet pretraining for free. SlowFast has slow (low-fps, high-channel = semantics) and fast (high-fps, low-channel = motion) pathways with lateral connections. For long videos: sliding window + aggregation.

CNN for audio. 1D CNN on raw waveform — WaveNet uses DILATED 1D convs for long context. 2D CNN on spectrogram (STFT or mel) — treat audio as image; works because frequencies have spatial structure.

Definitions

  • Convolution layerStack of learnable filters slid across input. Weight-shared across space → translation equivariance. Params , independent of .
  • Receptive fieldSet of input pixels that influence one output unit. Grows with depth (and stride). stacked stride 1 → .
  • Pooling (max / avg / GAP)Downsample with NO parameters. Max routes argmax (translation-robust). Avg distributes equally. GAP collapses , replaces giant FC.
  • Same / Valid paddingSame: (odd ) → output size equals input. Valid: → output shrinks by .
  • 1×1 convolutionPer-pixel MLP across channels. Four uses: channel reduction (bottleneck), non-linearity injection, cross-channel mixing, cheap. Used in Inception, ResNet, MobileNet, SENet.
  • Dilated (atrous) convolutionInserts zeros between kernel taps → effective kernel . Enlarges RF without parameters or stride. Used in DeepLab, WaveNet. Gridding artefact at large .
  • Depthwise-separable convolutionDepthwise ( per channel) + pointwise (). cheaper than standard conv at . Foundation of MobileNet, Xception.
  • Batch Normalisation (CNN)Normalise per channel across ; learnable per channel. Inference uses running mean/var (not batch). Faster training, less init-sensitive.
  • Translation equivariance vs invariance vs . Conv layers are equivariant; after global pool + FC, network is invariant. CNNs are NOT rotation equivariant.
  • LeNet(LeCun, 1989/1998) First successful CNN. ~60k params. Two conv-pool blocks then FC. Template every later CNN follows.
  • AlexNet(Krizhevsky et al., 2012) 60M params, ReLU + Dropout + augmentation, trained on 2 GPUs. Won ILSVRC-2012 by 10% top-5 — started the deep-learning era.
  • VGG(Simonyan & Zisserman, 2014) Only convs, deep + simple. VGG-19: 143.7M params, mostly in FC layers. Top-1 74.2%.
  • Inception / GoogLeNet(Szegedy et al., 2014) Parallel branches at multiple kernel sizes with 1×1 bottlenecks placed BEFORE expensive 3×3/5×5. Inception-v3: 27.2M params, top-1 77.3%.
  • ResNet(He et al., 2015) residual connection. Gradient ⇒ no vanishing. Enables 100+ layers. ResNet-50: 25.6M params, 76.1% top-1 — the workhorse backbone.
  • DenseNet(Huang et al., 2016) — concatenation (not addition). Strong gradient flow, feature reuse.
  • SENet (Squeeze-Excitation)(Hu et al., 2017) Channel attention: GAP → 2-layer MLP → sigmoid scale → multiply. +1% top-1 at near-zero cost.
  • MobileNet(Howard et al., 2017) Depthwise-separable convs throughout. ~4.2M params, ~70% top-1. Designed for phones / edge.
  • EfficientNet(Tan & Le, 2019) Compound scaling: , . B0: 5.3M params, top-1 77.7%.
  • C3D / I3D / SlowFastVideo CNN family. C3D: 3D conv from scratch. I3D: inflate 2D pretrained weights along time. SlowFast: slow (semantics, low fps high channel) + fast (motion, high fps low channel) pathways.

Formulas

Derivations

**Two stacked = one RF, fewer params.** RF after one = 3; after two stacked = . Params: two at channels = . One = . Saving = (~28%) AND you get an extra non-linearity between the two. Same logic gives three = RF, vs .

Depthwise-separable param ratio. Standard ; depthwise-separable . Ratio = . For : → ~ cheaper.

Residual gradient no longer vanishes. . Even if is tiny, the identity term guarantees the gradient norm is at least 1 through the skip. Through stacked residual blocks the gradient through the skip path is exactly , not . This is the formal reason 100-layer networks train at all.

Inflation in I3D. Take a 2D filter pretrained on ImageNet. Inflate to 3D as for all . Why divide by ? A 'boring video' (still image repeated times) should produce the SAME activation as the 2D filter on the image. Without the division, the 3D filter outputs × the original.

1×1 conv parameter count. . For : parameters, vs a at the same widths which is 9× more. That's the bottleneck saving.

Examples

  • **Param count for conv with bias.** .
  • Output size. Input , conv , , : . So — exactly the standard first-layer output of ResNet-50.
  • **ViT-B/16 patch count at .** patches; +1 CLS = 197 tokens. (CNN-adjacent but examined.)
  • **Receptive field of 5 stacked stride-1 convs.** pixels. Add a stride-2 pool in the middle and the effective RF on the input doubles.
  • MobileNet param saving. Standard : . Depthwise-separable: . Saving . ✓
  • Two 3×3 vs one 5×5. Two with : . One : . Two-3×3 wins by 28% AND adds an extra ReLU.

Diagrams

  • Receptive field growth as a function of layer depth — log-scale plot.
  • Inception module: 4 parallel branches with 1×1 bottlenecks before 3×3/5×5, concatenated output.
  • ResNet bottleneck block: with skip-add to the input.
  • DenseNet dense block: each layer concatenates all previous layer outputs.
  • SE block: GAP → FC → ReLU → FC → sigmoid → multiply with input channels.
  • MobileNet depthwise-separable: depthwise (per-channel) + pointwise (1×1).
  • EfficientNet compound-scaling curve: top-1 accuracy vs FLOPs, EfficientNet on the Pareto front.
  • Equivariance vs invariance: shifted input + same shift in feature map (equivariant) vs same scalar output after pooling+FC (invariant).
  • I3D inflation: 2D filter replicated times along time, divided by .

Edge cases

  • 'Dying ReLU'. Once a neuron's pre-activation is always negative, gradient is 0 forever → neuron stays dead. Fix: Leaky ReLU, ELU, GELU, or higher Kaiming init.
  • BN at inference using batch stats is a deployment bug — use running mean/var.
  • BN with batch size 1 is meaningless (batch variance undefined). Use GroupNorm or LayerNorm.
  • **Dilated conv with high produces gridding artefacts** — neighbouring outputs sample disjoint input pixels. Mitigation: stack dilations rather than (HDC pattern).
  • CNN not rotation equivariant. Kernels are learned; nothing constrains them to be rotation-symmetric. Data augmentation or G-CNNs needed.
  • Very deep plain CNN (>20 layers) degrades. Training accuracy drops because gradient/optimisation fails — not overfitting. Residual connections fix.
  • Depthwise-separable convs are LESS expressive than standard convs at the same width — they trade expressivity for parameter efficiency. May need to widen to compensate.
  • Global Avg Pooling loses spatial structure — useful for classification, fatal for dense prediction (segmentation needs to keep spatial).

Common mistakes

  • **Conv params depend on .** WRONG — they depend on . Weight sharing across spatial locations.
  • Pooling has parameters. WRONG — max and avg pool have no learnable params.
  • 'BatchNorm normalises across all axes equally'. Wrong — in CNN, BN normalises across per channel; are PER CHANNEL (length ), not per spatial position.
  • Inception 1×1 convs are placed AFTER the expensive 3×3/5×5. Wrong — placed BEFORE to reduce channels first (bottleneck).
  • ResNet replaces all skip connections with identity. No — the skip is added to a residual function. , not .
  • Depthwise-separable = depthwise + 3×3 conv. No — depthwise + POINTWISE (1×1).
  • 'CNNs are rotation equivariant'. No — only translation. Rotation needs data aug or G-CNNs.
  • Translation EQUIvariance and INvariance are the same. No — equivariance shifts the output; invariance doesn't. CNN feature maps are equivariant; the post-pool classification is invariant.
  • 'I3D trains a 3D CNN from scratch'. No — that's C3D. I3D INFLATES 2D pretrained weights.
  • MobileNet uses inverted residuals from v1. No — that's v2. v1 is straight depthwise-separable.

Shortcuts

  • Conv params: , independent of .
  • Output size: .
  • Same pad odd K: .
  • Receptive field: stacked stride 1 → .
  • **Two < one :** fewer params + extra non-linearity.
  • 1×1 conv: bottleneck, mixer, cheap, plus a non-linearity.
  • Depthwise-separable: ~ cheaper at .
  • ResNet skip: ; gradient ⇒ no vanishing.
  • DenseNet: concatenate (not add).
  • EfficientNet: compound-scale together by one coefficient.
  • Inflation (I3D): copy 2D filter times, divide by .
  • Translation EQUIvariance (feature maps) ≠ INvariance (after pool + FC).

Proofs / Algorithms

Translation equivariance of convolution. Define . For a convolution : . Hence — equivariance.

Residual gradient retains scale. . Through stacked residuals, the contribution along the skip path is at every block, hence the product of identities along the skip path is still . Therefore even if is tiny, total gradient norm .

Depthwise-separable param savings. Standard ; depthwise-separable . Ratio . At : ratio , i.e., cheaper in the limit.

Compound scaling balances depth, width, resolution. For an isolated CNN with layers, width , resolution : FLOPs . Compound scaling with gives FLOPs . EfficientNet fixes so doubling doubles FLOPs.

End of chapterUnit 4 — Convolutional Neural Networks (CNNs) · CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field