Saral Shiksha Yojna
Courses/Computer Vision

Computer Vision

CSE471
Prof. Makarand Tapaswi + Prof. Charu SharmaSpring 2025-264 credits

CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field

NotesStory
Unit 4 — Convolutional Neural Networks (CNNs)

Why CNNs Win

Take a RGB image and a dense layer that maps it to 1000 outputs. Parameter count: M, for ONE LAYER. Plus, that layer treats every pixel position as independent — a cat in the top-left and a cat in the bottom-right have no shared evidence.

A convolutional layer fixes both problems with one idea: weight sharing. One small filter (say ) is slid across all positions, producing a feature map. Number of parameters: . Plus we get translation equivariance — shift the input, the output shifts the same way — because the same filter is applied everywhere.

That's the whole insight. The rest of CNN history (2012–2020) is about how to stack and shape these filters to go deeper, wider, and more efficient.

The Numbers Every Exam Wants

Conv output size. . Floor, not ceiling.

Conv params. . The +1 is bias. **Independent of ** — the filter doesn't care how big the image is.

Same padding (output size = input size) for odd : . . . .

Receptive field. A single conv has RF 3. Stack of them at stride 1 and RF = . Add a stride-2 pool and the effective RF doubles. Modern backbones get RFs of hundreds of pixels by depth alone, no big kernels needed.

Two 3×3 Beats One 5×5

A neat result that justifies the whole VGG philosophy. RF of two stacked convs = 5 (same as one ). Parameters: vs — about 28% fewer. AND you get an extra ReLU in between for free. Same logic gives three = RF, vs .

This is why VGG is just everywhere.

1×1 Convolution — Four Uses

A filter doesn't mix spatially — but it mixes ACROSS CHANNELS. Four uses to memorise.

(1) Channel reduction (bottleneck). Drop before the expensive . Inception, ResNet bottleneck blocks. (2) Non-linearity injection. A per-pixel MLP between conv layers. (3) Cross-channel mixing. Without spatial mixing. MobileNet's pointwise piece. (4) Cheap. Only params.

Pooling

Max-pool routes the argmax through (gradient flows only to the max). Avg-pool averages equally. Global Average Pooling (GAP) at the end of a network collapses and replaces the giant FC layer that used to sit there (compare VGG's heavy FC tail with ResNet's lean GAP). Pooling has NO parameters — it reduces FLOPs but not weights.

Dilated Convolutions

Insert zeros between kernel taps. A with dilation 2 has the same RF as a but uses only 9 parameters. Used in semantic segmentation (DeepLab needs big RF but can't downsample much) and WaveNet (long temporal context). Failure mode: gridding artefacts at large — neighbouring outputs sample disjoint input pixels. Mitigation: stack with varying dilation rates.

Depthwise-Separable Convolution

Standard conv mixes spatial AND channels at once. Depthwise-separable splits it: depthwise ( per channel, no channel mixing) + pointwise (, channel mixing only).

Param ratio: . At that's roughly — about cheaper. This is the magic behind MobileNet.

Batch Normalisation In CNNs

Normalise per channel across — i.e., across all spatial positions in all samples in the batch, separately for each channel. Then learnable per channel (length , not per spatial position — that's a common exam trap).

Effect: faster training, higher LR, less init-sensitive, slight regularisation. The deployment bug to remember: at inference, use the RUNNING mean/var accumulated during training, NOT the batch's own statistics.

Variants: LayerNorm (across per sample — used in ViT; works at batch size 1), GroupNorm (groups of channels — modern fallback for batch size 1 / large activations), InstanceNorm (per-sample per-channel — style transfer).

Equivariance vs Invariance — The Distinction Exams Love

Translation equivariant: . Shift input, output shifts the same way. Conv feature maps are translation equivariant.

Translation invariant: . Shift input, output unchanged. CNN classification (after global pool + FC + softmax) is translation invariant.

CNNs are NOT rotation equivariant. The kernels are LEARNED, not constrained to be rotation-symmetric. To get rotation equivariance, either augment with rotations during training (cheap fix) or use Group-equivariant CNNs (G-CNNs, principled fix).

The Architecture Parade — Memorise The Table

LeNet (1989/1998, LeCun): First CNN to work at scale. ~60 k params. MNIST digits.

AlexNet (2012, Krizhevsky/Sutskever/Hinton): Started deep learning. 60 M params, 8 layers, 2 GPUs. Three innovations — ReLU (much faster than tanh), Dropout in FC layers (), data augmentation (flips, crops, colour jitter). Top-1: 56.5%.

VGG (2014, Simonyan/Zisserman): Only convs, depth. VGG-19: 143.7 M params (mostly in FC tail), 19.6 GFLOPs, top-1 74.2%. Heavy but conceptually clean.

Inception/GoogLeNet (2014, Szegedy): Parallel branches at multiple kernel sizes, bottlenecks BEFORE the expensive /. Inception-v3: 27.2 M, 5.71 GFLOPs, top-1 77.3%.

ResNet (2015, He): . Solves the deep-network degradation problem (training accuracy stopped improving past ~20 layers). The skip path's gradient never vanishes. ResNet-50: 25.6 M params, 4.09 GFLOPs, top-1 76.1% — the workhorse.

DenseNet (2016, Huang): Each layer concatenates all previous outputs (vs ResNet's addition). DenseNet-161: 28.7 M, top-1 77.1%.

SENet (2017, Hu): Squeeze-Excitation channel attention bolted onto existing backbones. GAP → MLP → sigmoid → multiply with channels. +1% top-1, near-zero compute.

MobileNet (2017, Howard): Depthwise-separable convs throughout. 4.2 M, ~70% top-1. Phone-ready.

EfficientNet (2019, Tan/Le): Compound scaling — with . B0: 5.3 M, 0.39 GFLOPs, top-1 77.7% — best efficiency on the chart for years.

Modern (post-CNN era): ViT (2020) — first transformer to beat CNN at scale. Swin (2021) — hierarchical ViT with windowed attention. ConvNeXt (2022) — modernised ResNet that closes the gap with ViT, proving CNNs aren't actually obsolete.

Backprop Through The CNN

The exam might ask you to handle the backward pass for any of these layers. Memorise:

Conv backward. Two gradients. = convolution with the FLIPPED kernel (it's still a conv operation, just with the kernel reversed in both spatial dims). = cross-correlation of input with the upstream gradient. So conv backward = also a conv.

MaxPool backward. The gradient flows ONLY to the argmax of the forward pass; all other positions get 0.

AvgPool backward. Distribute equally — divide by across the pooled positions.

ReLU backward. A gate: pass the gradient through if the forward pre-activation was , else 0.

Skip backward. . The identity term is what prevents vanishing.

CNNs For Video And Audio

Video. C3D uses 3D convs over — expensive ( over 2D), trained from scratch. I3D is the clever trick: inflate 2D ImageNet kernels into 3D by copying along time and dividing by . A "boring video" (still image repeated) gives identical activations, so you get ImageNet pretraining for free. SlowFast has two pathways: slow (low fps, high channels — captures semantics) + fast (high fps, low channels — captures motion), with lateral connections.

Audio. 1D CNN on raw waveform — WaveNet uses dilated 1D convs for long temporal context. 2D CNN on spectrogram (STFT or mel) — treat audio as image; works because frequency bands have spatial structure.

What You Walk In Carrying

The conv-output and conv-param formulas (independent of ). Same-padding recipe. Receptive-field arithmetic. Two--beats-one- argument. Four uses of . Max vs avg pool semantics and their backward behaviour. Dilated convolution and the effective kernel size. Depthwise-separable saving formula and the ~ ratio. BatchNorm formula, axis convention in CNNs, inference vs training divergence. Translation equivariance vs invariance, "CNN not rotation equivariant" remark. The architecture table — LeNet → AlexNet → VGG → Inception → ResNet → DenseNet → SENet → MobileNet → EfficientNet, with the headline params/top-1/innovation for each. ResNet's gradient argument. Compound scaling formula. I3D inflation trick and the divide-by- reason.

That's the CNN unit. The next unit (Object Detection — formerly Unit 1) lives on top of everything in this one.

End of storyUnit 4 — Convolutional Neural Networks (CNNs) · CNN Concepts + Architectures — LeNet to EfficientNet, BN, 1×1, Dilation, Receptive Field