CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field
Intuition
A convolution layer is a stack of learned filters slid across an image — a CNN is a hierarchy of these, with each layer composing simpler features (edges → texture → parts → objects). The reason CNNs work in vision: WEIGHT SHARING (one filter, all spatial locations) gives translation equivariance (a shifted input produces a shifted output) and slashes parameters by ~1000× vs a dense network. The decade of CNN architecture research (2012–2020) was about how to go deeper, wider, or more efficient without breaking trainability.
Explanation
Convolution layer parameters. A conv layer with filters, kernel , input channels : parameters = (the +1 is bias). **Independent of ** — same filter applied at every position. Compare to a fully-connected layer between and outputs: parameters — typically 1000× more. This is the *parameter efficiency* of CNNs.
Output spatial size. . Same padding with odd uses → output size equals input. Valid padding () shrinks by . Two stacked convs (stride 1) have the same RECEPTIVE FIELD as one conv but use FEWER parameters ( vs ) AND add an extra non-linearity. Three = RF, vs .
Receptive field. Set of input pixels that influence a given output. For stacked convs at stride 1: RF = . Doubles at every stride-2 step. Architectures balance RF (need large to see objects), spatial resolution (need fine for dense prediction), and compute.
1×1 convolution — four uses. (1) Channel reduction (bottleneck) — cheap way to drop before an expensive or . (2) Non-linearity injection — per-pixel MLP between conv layers. (3) Cross-channel mixing without spatial mixing. (4) Cheap — only params. Used in: Inception (before expensive convs), ResNet bottleneck, MobileNet pointwise, SENet, feature-pyramid laterals.
Pooling. Max-pool routes argmax → ROBUST to translation, dominant in classification. Avg-pool equal-weight average; Global Average Pooling (GAP) before classifier replaces giant FC layers (ResNet, EfficientNet). No parameters — pools only reduce FLOPs, not weights. Backward: max-pool routes gradient to argmax only; avg-pool distributes equally ().
Dilated (atrous) convolution. Insert zeros between kernel taps. Effective kernel size = . A with dilation 2 has a RF using 9 parameters. Use: semantic segmentation (DeepLab — keep resolution while enlarging RF), WaveNet (1D causal). Drawback: gridding artefact at large dilation rates.
Standard vs depthwise-separable convolution. Standard params. Depthwise-separable = depthwise ( per input channel) + pointwise (). Params = . Ratio ≈ — about – cheaper for . Used in MobileNet, Xception.
BatchNorm in CNNs. Normalise per channel across the axes; learnable per channel (length ). At inference: use running mean/var, NOT batch stats. Effect: faster training, higher LR, less init-sensitive, slight regularisation. Variants: LayerNorm (across per sample — used in ViT), GroupNorm (groups of channels — works at batch=1), InstanceNorm (per-sample per-channel — style transfer).
Equivariance vs invariance. Convolutions are TRANSLATION EQUIVARIANT: — shift input, output shifts the same way. After global pooling + FC, the network becomes translation INVARIANT: . CNNs are NOT rotation equivariant — kernels are learned, not symmetric. Fix: data augmentation, or G-CNNs (group-equivariant CNNs).
LeNet-5 (LeCun, 1989/1998). First CNN to win at scale — handwritten digit recognition. Two conv-pool blocks then FC. ~60 k parameters. Trained on USPS / MNIST. The architecture template every later CNN follows: conv → pool → conv → pool → flatten → FC → softmax.
AlexNet (Krizhevsky, Sutskever, Hinton, 2012). Won ILSVRC-2012 by a huge margin (top-5 16% vs 26% next best) and started the deep-learning revolution. 60 M params, 8 layers (5 conv + 3 FC), trained on 2 GPUs. Three innovations: ReLU (much faster than tanh/sigmoid), dropout (in the FC layers, ), data augmentation (random crops + horizontal flips + colour jitter). Top-1 = 56.5%.
VGG (Simonyan & Zisserman, 2014). Stack lots of convs. VGG-16: 138 M params. VGG-19: 143.7 M params, 19.6 GFLOPs, top-1 74.2%. Heavy: most parameters live in the FC layers. Lesson: depth + tiny filters > shallow + huge filters.
Inception / GoogLeNet (Szegedy et al., 2014). Inception module = parallel branches at multiple kernel sizes (, , , max-pool), concatenated. ** bottleneck** placed BEFORE / to cut channels — slashes parameters. Inception-v3: 27.2 M params, 5.71 GFLOPs, top-1 77.3%.
ResNet (He et al., 2015) — residual connections. Beyond ~20 layers, plain CNNs DEGRADE (training loss goes up — not just overfitting, optimisation failure). Fix: . The skip connection lets the network learn the IDENTITY by default (set ) and only learn corrections. **Gradient: ** — the '+I' term ensures gradient never vanishes through the skip. Bottleneck block: with skip. ResNet-50: 25.6 M params, 4.09 GFLOPs, top-1 76.1%. The workhorse backbone.
DenseNet (Huang et al., 2016). — each layer takes ALL previous layer outputs as input via CONCATENATION (not addition like ResNet). Strong gradient flow, feature reuse. DenseNet-161: 28.7 M params, 7.73 GFLOPs, top-1 77.1%.
SENet (Squeeze-Excitation, Hu et al., 2017). Channel attention module added to existing backbones: GAP across spatial → 2-layer MLP → sigmoid scale → multiply with original channels. Recalibrates which channels matter. +1% top-1 on ResNet, almost free in compute.
MobileNet (Howard et al., 2017). Depthwise-separable convs throughout → fewer params/FLOPs than standard convs. MobileNet-v1: 4.2 M params, ~70% top-1. v2 adds inverted residuals and linear bottlenecks. Designed for phones / edge devices.
EfficientNet (Tan & Le, 2019). Compound scaling: scale depth , width , and input resolution together by a single coefficient , with where (so FLOPs scale as ). EfficientNet-B0: 5.3 M params, 0.39 GFLOPs, top-1 77.7% — the best efficiency on the chart for years.
Backprop through conv layer. Conv backward = conv with FLIPPED kernel for ; cross-correlation with input for . MaxPool backward routes to argmax (others = 0). AvgPool backward distributes equally (). ReLU backward is a gate (pass if forward , else 0). Skip backward through adds → no vanishing.
CNN for video. C3D uses 3D convs over — expensive ( factor over 2D), trained from scratch. I3D inflates 2D ImageNet kernels along time and divides by — borrows ImageNet pretraining for free. SlowFast has slow (low-fps, high-channel = semantics) and fast (high-fps, low-channel = motion) pathways with lateral connections. For long videos: sliding window + aggregation.
CNN for audio. 1D CNN on raw waveform — WaveNet uses DILATED 1D convs for long context. 2D CNN on spectrogram (STFT or mel) — treat audio as image; works because frequencies have spatial structure.
Definitions
- Convolution layer — Stack of learnable filters slid across input. Weight-shared across space → translation equivariance. Params , independent of .
- Receptive field — Set of input pixels that influence one output unit. Grows with depth (and stride). stacked stride 1 → .
- Pooling (max / avg / GAP) — Downsample with NO parameters. Max routes argmax (translation-robust). Avg distributes equally. GAP collapses , replaces giant FC.
- Same / Valid padding — Same: (odd ) → output size equals input. Valid: → output shrinks by .
- 1×1 convolution — Per-pixel MLP across channels. Four uses: channel reduction (bottleneck), non-linearity injection, cross-channel mixing, cheap. Used in Inception, ResNet, MobileNet, SENet.
- Dilated (atrous) convolution — Inserts zeros between kernel taps → effective kernel . Enlarges RF without parameters or stride. Used in DeepLab, WaveNet. Gridding artefact at large .
- Depthwise-separable convolution — Depthwise ( per channel) + pointwise (). – cheaper than standard conv at . Foundation of MobileNet, Xception.
- Batch Normalisation (CNN) — Normalise per channel across ; learnable per channel. Inference uses running mean/var (not batch). Faster training, less init-sensitive.
- Translation equivariance vs invariance — vs . Conv layers are equivariant; after global pool + FC, network is invariant. CNNs are NOT rotation equivariant.
- LeNet — (LeCun, 1989/1998) First successful CNN. ~60k params. Two conv-pool blocks then FC. Template every later CNN follows.
- AlexNet — (Krizhevsky et al., 2012) 60M params, ReLU + Dropout + augmentation, trained on 2 GPUs. Won ILSVRC-2012 by 10% top-5 — started the deep-learning era.
- VGG — (Simonyan & Zisserman, 2014) Only convs, deep + simple. VGG-19: 143.7M params, mostly in FC layers. Top-1 74.2%.
- Inception / GoogLeNet — (Szegedy et al., 2014) Parallel branches at multiple kernel sizes with 1×1 bottlenecks placed BEFORE expensive 3×3/5×5. Inception-v3: 27.2M params, top-1 77.3%.
- ResNet — (He et al., 2015) residual connection. Gradient ⇒ no vanishing. Enables 100+ layers. ResNet-50: 25.6M params, 76.1% top-1 — the workhorse backbone.
- DenseNet — (Huang et al., 2016) — concatenation (not addition). Strong gradient flow, feature reuse.
- SENet (Squeeze-Excitation) — (Hu et al., 2017) Channel attention: GAP → 2-layer MLP → sigmoid scale → multiply. +1% top-1 at near-zero cost.
- MobileNet — (Howard et al., 2017) Depthwise-separable convs throughout. ~4.2M params, ~70% top-1. Designed for phones / edge.
- EfficientNet — (Tan & Le, 2019) Compound scaling: , . B0: 5.3M params, top-1 77.7%.
- C3D / I3D / SlowFast — Video CNN family. C3D: 3D conv from scratch. I3D: inflate 2D pretrained weights along time. SlowFast: slow (semantics, low fps high channel) + fast (motion, high fps low channel) pathways.
Formulas
Derivations
**Two stacked = one RF, fewer params.** RF after one = 3; after two stacked = . Params: two at channels = . One = . Saving = (~28%) AND you get an extra non-linearity between the two. Same logic gives three = RF, vs .
Depthwise-separable param ratio. Standard ; depthwise-separable . Ratio = . For : → ~ cheaper.
Residual gradient no longer vanishes. . Even if is tiny, the identity term guarantees the gradient norm is at least 1 through the skip. Through stacked residual blocks the gradient through the skip path is exactly , not . This is the formal reason 100-layer networks train at all.
Inflation in I3D. Take a 2D filter pretrained on ImageNet. Inflate to 3D as for all . Why divide by ? A 'boring video' (still image repeated times) should produce the SAME activation as the 2D filter on the image. Without the division, the 3D filter outputs × the original.
1×1 conv parameter count. . For : parameters, vs a at the same widths which is 9× more. That's the bottleneck saving.
Examples
- **Param count for conv with bias.** .
- Output size. Input , conv , , : . So — exactly the standard first-layer output of ResNet-50.
- **ViT-B/16 patch count at .** patches; +1 CLS = 197 tokens. (CNN-adjacent but examined.)
- **Receptive field of 5 stacked stride-1 convs.** pixels. Add a stride-2 pool in the middle and the effective RF on the input doubles.
- MobileNet param saving. Standard : . Depthwise-separable: . Saving . ✓
- Two 3×3 vs one 5×5. Two with : . One : . Two-3×3 wins by 28% AND adds an extra ReLU.
Diagrams
- Receptive field growth as a function of layer depth — log-scale plot.
- Inception module: 4 parallel branches with 1×1 bottlenecks before 3×3/5×5, concatenated output.
- ResNet bottleneck block: with skip-add to the input.
- DenseNet dense block: each layer concatenates all previous layer outputs.
- SE block: GAP → FC → ReLU → FC → sigmoid → multiply with input channels.
- MobileNet depthwise-separable: depthwise (per-channel) + pointwise (1×1).
- EfficientNet compound-scaling curve: top-1 accuracy vs FLOPs, EfficientNet on the Pareto front.
- Equivariance vs invariance: shifted input + same shift in feature map (equivariant) vs same scalar output after pooling+FC (invariant).
- I3D inflation: 2D filter replicated times along time, divided by .
Edge cases
- 'Dying ReLU'. Once a neuron's pre-activation is always negative, gradient is 0 forever → neuron stays dead. Fix: Leaky ReLU, ELU, GELU, or higher Kaiming init.
- BN at inference using batch stats is a deployment bug — use running mean/var.
- BN with batch size 1 is meaningless (batch variance undefined). Use GroupNorm or LayerNorm.
- **Dilated conv with high produces gridding artefacts** — neighbouring outputs sample disjoint input pixels. Mitigation: stack dilations rather than (HDC pattern).
- CNN not rotation equivariant. Kernels are learned; nothing constrains them to be rotation-symmetric. Data augmentation or G-CNNs needed.
- Very deep plain CNN (>20 layers) degrades. Training accuracy drops because gradient/optimisation fails — not overfitting. Residual connections fix.
- Depthwise-separable convs are LESS expressive than standard convs at the same width — they trade expressivity for parameter efficiency. May need to widen to compensate.
- Global Avg Pooling loses spatial structure — useful for classification, fatal for dense prediction (segmentation needs to keep spatial).
Common mistakes
- **Conv params depend on .** WRONG — they depend on . Weight sharing across spatial locations.
- Pooling has parameters. WRONG — max and avg pool have no learnable params.
- 'BatchNorm normalises across all axes equally'. Wrong — in CNN, BN normalises across per channel; are PER CHANNEL (length ), not per spatial position.
- Inception 1×1 convs are placed AFTER the expensive 3×3/5×5. Wrong — placed BEFORE to reduce channels first (bottleneck).
- ResNet replaces all skip connections with identity. No — the skip is added to a residual function. , not .
- Depthwise-separable = depthwise + 3×3 conv. No — depthwise + POINTWISE (1×1).
- 'CNNs are rotation equivariant'. No — only translation. Rotation needs data aug or G-CNNs.
- Translation EQUIvariance and INvariance are the same. No — equivariance shifts the output; invariance doesn't. CNN feature maps are equivariant; the post-pool classification is invariant.
- 'I3D trains a 3D CNN from scratch'. No — that's C3D. I3D INFLATES 2D pretrained weights.
- MobileNet uses inverted residuals from v1. No — that's v2. v1 is straight depthwise-separable.
Shortcuts
- Conv params: , independent of .
- Output size: .
- Same pad odd K: .
- Receptive field: stacked stride 1 → .
- **Two < one :** fewer params + extra non-linearity.
- 1×1 conv: bottleneck, mixer, cheap, plus a non-linearity.
- Depthwise-separable: ~– cheaper at .
- ResNet skip: ; gradient ⇒ no vanishing.
- DenseNet: concatenate (not add).
- EfficientNet: compound-scale together by one coefficient.
- Inflation (I3D): copy 2D filter times, divide by .
- Translation EQUIvariance (feature maps) ≠ INvariance (after pool + FC).
Proofs / Algorithms
Translation equivariance of convolution. Define . For a convolution : . Hence — equivariance.
Residual gradient retains scale. . Through stacked residuals, the contribution along the skip path is at every block, hence the product of identities along the skip path is still . Therefore even if is tiny, total gradient norm .
Depthwise-separable param savings. Standard ; depthwise-separable . Ratio . At : ratio , i.e., cheaper in the limit.
Compound scaling balances depth, width, resolution. For an isolated CNN with layers, width , resolution : FLOPs . Compound scaling with gives FLOPs . EfficientNet fixes so doubling doubles FLOPs.