Revision Notes/Unit 7 — Vision Transformers (ViT)/ViT Pipeline, Scaling, and Swin/Story

ViT Pipeline, Scaling, and Swin

Unit 7 — Vision Transformers (ViT)

Eyes That Read Pixels Like Words

For 9 years — from AlexNet in 2012 to 2021 — every state-of-the-art vision model was a convolutional neural network. The local-receptive-field, weight-sharing, translation-equivariant inductive bias of convolutions was treated as a *truth* about how vision must work. There were occasional attempts to swap in self-attention, but they always lost to ResNets.

Then in late 2020, a paper from Google appeared with a startling claim: the convolutional inductive bias is not necessary. A plain Transformer — the same architecture used for translating English to French — could match or beat the best CNNs on ImageNet, *provided you fed it enough data*.

The paper was titled "An Image is Worth 16×16 Words" — and that's literally the recipe. Cut the image into $16 \times 16$ patches. Treat each patch as a token. Feed the sequence to a standard Transformer encoder. That's it.

The model is called ViT — Vision Transformer (Dosovitskiy et al., ICLR 2021). And from 2021 onward, ViT replaced CNNs as the default vision backbone for nearly everything that operates at scale.

This unit covers two things: the ViT architecture itself, and Swin Transformer — a hierarchical variant that fixes ViT's main weakness.

Why "pixels as a sequence" doesn't work

The very first question the lecture asks: can we just unroll an image into a sequence of pixels and feed it to a Transformer?

A $224 \times 224$ RGB image has $224 \times 224 = 50, 176$ pixels. A Transformer's self-attention is $O (N^{2})$ in sequence length. That's $50, 17 6^{2} \approx 2.5$ *billion* attention scores per layer. Computationally impossible.

So you need to compress the sequence. The simplest possible compression: chunk the image into non-overlapping patches and treat each patch as one token.

The ViT architecture

Patch tokenisation

Given input $x \in R^{H \times W \times C}$ , patch size $P \times P$ (typically $P = 16$ ):

N = (H / P) (W / P) patches . For H = W = 224, P = 16 : N = 1 4^{2} = 196.

Each $P \times P \times C$ patch is flattened into a $P^{2} C$ -dim vector and linearly projected to $D$ dimensions:

z_{i} = patch_{i} \cdot E + e_{i}^{pos}, E \in R^{P^{2} C \times D}

For $P = 16, C = 3, D = 768$ : $P^{2} C = 768$ . So $E$ is $768 \times 768$ — the patch projection happens to be roughly square (ViT-B was sized to make these match).

The [CLS] token

Following BERT, ViT prepends a special learnable $[CLS]$ token at position 0:

Input sequence: $[CLS, patch_{1}, patch_{2}, \dots, patch_{N}]$

Total length: $N + 1$

After the Transformer, the final hidden state of $[CLS]$ is used for classification — pass it through a single linear layer to get class logits.

The lecture asks the natural question: *why not just average all patch tokens?* The answer is yes, you can — a global-average-pooled version of ViT works similarly. But the original paper used $[CLS]$ because BERT did, and the convention stuck.

Position embedding choices

Your lecture explicitly asks: 1D, 2D, or none?

None — drop position info entirely. Result: ViT becomes a bag of patches. Accuracy drops ~3% on ImageNet (Transformers can do a lot without explicit position info).
1D learned — assign each patch a 1D position index (raster scan), learn an embedding for each. This is what ViT actually uses. Simple, works well.
2D learned — separate row and column embeddings. Marginal gain over 1D in the original paper.

Memorise: ViT uses 1D learned position embeddings, despite images being 2D. The Transformer figures out the 2D structure on its own.

The Transformer block

Same as the 2017 original, with one tweak — Pre-Norm:

z^{'} = MSA (LN (z)) + z z = MLP (LN (z^{'})) + z^{'}

The "Simple notation!" slide flags this — you should be able to write the block in 2 lines from memory.

The MLP is $Linear_{1} (D \to 4 D) \to GELU \to Linear_{2} (4 D \to D)$ . The 4× expansion ratio is standard ( $768 \to 3072$ for ViT-B).

The ViT family

The "Herd of Vision Transformers" slide names the variants. Memorise these:

| Model | $L$ | $D$ | MLP | $H$ | Params | | --- | --- | --- | --- | --- | --- | | ViT-Tiny | 12 | 192 | 768 | 3 | ~5M | | ViT-Small | 12 | 384 | 1536 | 6 | ~22M | | ViT-Base | 12 | 768 | 3072 | 12 | ~86M | | ViT-Large | 24 | 1024 | 4096 | 16 | ~307M | | ViT-Huge | 32 | 1280 | 5120 | 16 | ~632M |

Consistent pattern: MLP size = $4 D$ , and $D / H = 64$ is the per-head dimension. *ViT-B/16* means ViT-Base with $16 \times 16$ patches.

Parameter calculation — exam favourite

For one Transformer block ( $L = 1$ ):

Attention: $W_{Q}, W_{K}, W_{V}, W_{O}$ each $D \times D$ → $4 D^{2}$ .
MLP: up ( $D \to 4 D$ ) + down ( $4 D \to D$ ) → $8 D^{2}$ .
**Total per block: $12 D^{2}$ .**

For all $L$ layers:

A = L \cdot 12 D^{2} = 12 \cdot 12 \cdot 76 8^{2} \approx 84.9 M

Patch embedding (one-time):

B = D \cdot P^{2} C = 768 \cdot 768 \approx 0.6 M

**Total: $A + B \approx 85.5$ M** — matches the slide's $85, 524, 480$ .

The Quiz — this WILL be on your exam

The lecture has an explicit quiz slide. What happens to parameters and sequence length when …

Q1 — Resolution $224 \to 336$ (patch fixed at 16)

Old $N = (224/16)^{2} = 196$ . New $N = (336/16)^{2} = 441$ .
**Sequence length: increases ( $\sim 2.25 \times$ ).**
Parameters: ~unchanged. Transformer weights don't depend on $N$ . *Subtlety:* learned PEs are $(N + 1) \times D$ , sized to the old length. Fix: bilinear interpolation of the PEs to the new grid — no new parameters.

Q2 — Patch size $32 \to 16$ (resolution fixed)

Old $N = 49$ . New $N = 196$ . **Sequence length: $\times 4$ .**
Parameters: slightly change. Patch embedding $E$ shrinks from $3072 \cdot D$ to $768 \cdot D$ (factor of 4). The rest is unchanged. Dominant effect: $\sim 4 \times$ more attention compute, $\sim 16 \times$ more attention memory.

Q3 — Layers $8 \to 12$

Sequence length: unchanged.
**Parameters: increase linearly in $L$ .** Each block adds $12 D^{2}$ ; going $8 \to 12$ adds $\sim 28$ M.

Q4 — Heads $1 \to 8$ (D constant) — the trickiest

In MHA, $D$ is partitioned across heads: each has $d_{k} = D / H$ . Projections are $D \to D$ regardless of $H$ .

Sequence length: unchanged (heads don't affect tokens).
Parameters: unchanged. The surprising answer most students miss.

The exam line: *the number of heads doesn't change the parameter count of MSA — it only changes how the $D$ -dim space is partitioned for parallel attention computation.*

The data hunger — ViT's defining property

ViT's central finding: at small data scales (ImageNet-only), ResNets win. At large data scales (JFT-300M, 300 million images), ViT wins by a wide margin.

The reason: CNNs come with built-in inductive biases — *locality* (a feature depends mainly on its neighbours) and *translation equivariance* (moving an object doesn't change what it is). These priors help when data is small. When data is large, the priors *limit* the model — it can't learn task-specific structure that violates them.

ViT has no spatial inductive bias beyond the patch grid. Every patch can attend to every other patch from layer 1. With small data, this looks like overfitting. With huge data, it allows richer relationships than CNNs.

"ViTs transfer better than ResNets" comes from this scaling: pretrain on JFT-300M, transfer to ImageNet (+18 other tasks), ViTs win across the board.

Do ViTs see like CNNs?

Raghu et al. (NeurIPS 2021) asked: do ViTs act like CNNs from scratch? Or do they develop novel representations?

The analysis uses CKA — Centered Kernel Alignment — a similarity metric for representations. Two findings:

CNN representations evolve gradually. Early layers are texture-y, middle layers are part-y, late layers are object-y — a clear hierarchy.
ViT representations are remarkably uniform across layers. Even early layers attend globally; the representation doesn't have a CNN-like feature pyramid.

The attention distance slide makes this concrete. In CNNs, early-layer receptive fields are tiny ( $3 \times 3 \to 7 \times 7 \to \dots$ ); in ViTs, even layer 1 attention spans the full image in many heads.

This is both a strength (ViT captures global context immediately) and a weakness (ViT lacks a multi-scale feature pyramid, making vanilla ViT worse for dense prediction tasks like segmentation and detection).

What the [CLS] token learns

Through training, the $[CLS]$ token learns to attend most strongly to semantically meaningful regions — the object, the salient parts. This emergent "object-localising" behaviour is what DINO amplified into actual segmentation maps without any segmentation labels.

Position embeddings are self-similar

A striking finding: take ViT's learned 1D position embeddings and compute cosine similarity between the embedding at position $(r, c)$ and all other positions. You get a 2D pattern centred at $(r, c)$ with high similarity decreasing radially. The position embeddings were 1D-learned with no 2D structure imposed — yet ViT figured out the 2D layout on its own from training.

Swin Transformer — fixing ViT's weaknesses

ViT has two acknowledged weaknesses:

1. Quadratic complexity in sequence length. At high resolution, $N$ explodes and self-attention becomes prohibitive. 2. No hierarchical features. Detection and segmentation work best with multi-scale features (FPN). ViT just has one resolution throughout.

Swin Transformer (Liu et al., ICCV 2021 Best Paper) fixes both. The name is Shifted Window Transformer.

Idea 1 — Windowed self-attention

Instead of letting every patch attend to every other, divide patches into non-overlapping $M \times M$ windows (typically $M = 7$ ) and compute self-attention only within each window.

Standard ViT: $O (N^{2})$ over the whole image.
Window attention: $O (M^{2} \cdot N)$ — **linear in $N$ ** because $M$ is constant.

For a $56 \times 56$ patch grid: full attention costs $\sim 9.8$ M scores per layer; windowed at $7 \times 7$ costs $\sim 154$ k. **About $64 \times$ cheaper.**

Idea 2 — Shifted windows

Pure window attention has an obvious flaw: tokens in different windows never talk. The fix is elegant. In *alternate* Transformer blocks, **shift the window grid by $(M /2, M /2)$ pixels**.

Block $ℓ$ : W-MSA — windows aligned to $(0, 0)$ .
Block $ℓ + 1$ : SW-MSA — windows shifted by $(M /2, M /2)$ .
Block $ℓ + 2$ : W-MSA — aligned again.
…

After the shift, patches that were in different windows in the previous block now share a window. Information propagates across original window boundaries. The shift produces some edge windows of irregular size; Swin handles this with cyclic shift + masked self-attention — patches are cyclically wrapped to maintain regular window sizes, with attention masks blocking the wrap-around computations.

Idea 3 — Hierarchical patch merging

Standard ViT keeps the same number of tokens throughout. Swin progressively downsamples like a CNN feature pyramid:

Stage 1: $56 \times 56$ patches at $C$ channels

Stage 2: $28 \times 28$ patches at $2 C$ channels

Stage 3: $14 \times 14$ patches at $4 C$ channels

Stage 4: $7 \times 7$ patches at $8 C$ channels

Patch merging at the start of each new stage: take a $2 \times 2$ block of patches, concatenate their features, linearly project $4 C \to 2 C$ . Halves spatial dims and doubles channels — *exactly like a strided convolution in a CNN*.

Result: Swin produces a multi-scale hierarchical feature pyramid like ResNet/FPN, making it a drop-in replacement for CNNs in detection (Mask R-CNN with Swin) and segmentation pipelines.

Swin vs ViT — one line

| | ViT | Swin | | --- | --- | --- | | Attention scope | Global (all patches) | Local windows + shifted alternation | | Complexity | $O (N^{2})$ | $O (N)$ | | Feature hierarchy | Single resolution throughout | 4-stage pyramid (like ResNet) | | Best for | Classification, contrastive pretraining | Dense prediction (detection, segmentation) | | When to use | Large-scale pretraining target | Backbone for downstream perception tasks |

What you carry into the exam

The "16×16 words" insight and the patch-token compression. The ViT pipeline in 3 lines: linear-project flattened patches → $D$ -dim tokens → prepend $[CLS]$ , add 1D learned PEs → stack $L$ Pre-Norm Transformer blocks. The MLP's $D \to 4 D \to D$ shape. ViT-B specs: $L = 12, D = 768$ , MLP = 3072, 12 heads, $d_{k} = 64$ , ~86M params. The $12 D^{2}$ per-block formula and the worked total of 85.5M. The four quiz answers — *especially Q4's counter-intuitive parameters-unchanged*. The data-scaling story (ImageNet small → ResNet wins; JFT-300M big → ViT wins). The Raghu CKA findings — ViT representations are uniform across layers; CNN's are hierarchical. The attention-distance plot. Why $[CLS]$ 's attention localises objects. Why 1D PEs end up self-similar in 2D. Swin's three innovations — window self-attention (linear in $N$ ), shifted windows (cross-window communication), hierarchical patch merging (4-stage pyramid). The trade-off table.

You now know exactly what was inside SigLIP, what DINO was pretrained on, and what the family of "Vision Transformer Advances" was modifying.

Computer Vision