Revision Notes/Unit 7 — Vision Transformers (ViT)/ViT Pipeline, Scaling, and Swin

ViT Pipeline, Scaling, and Swin

Intuition

Feeding 50 K raw pixels into self-attention is infeasible (O(n²)). ViT solves this by treating non-overlapping 16×16 patches as 'words' — for a 224² image that's 14×14 = 196 tokens, well within Transformer reach.

Explanation

ViT pipeline end-to-end: (1) split a H × W × 3 image into N patches of size P × P; (2) flatten each patch to a P²·3 vector; (3) project linearly to d_model (equivalent to a single conv with stride P); (4) prepend a learnable [CLS] token; (5) add learned 1D positional embeddings to the N + 1 token sequence; (6) pass through L Transformer encoder layers; (7) take the final [CLS] embedding through an MLP head for the class logits.

The [CLS] token is a learnable global summary — not tied to any patch — that aggregates information from all patches via self-attention. Empirically, global average pooling of patch tokens works similarly, but [CLS] is the canonical choice and gives interpretable attention maps showing which regions drove classification.

Positional embeddings: the original paper compared 1D, 2D, and no PE. 1D works essentially as well as 2D for ViT — the embeddings are learned, attention is fully connected, and the model recovers any structure it needs. No PE drops accuracy significantly.

Scaling effects to memorise: increasing image resolution increases sequence length (n²) and compute, but transformer body parameters are unchanged (the body is a function of d_model and L, not n). Decreasing patch size (32 → 16) quadruples sequence length. Increasing layers scales parameters linearly. Increasing heads at fixed d_model is essentially parameter-free.

ViT-B/16 quick numbers: 224² image, 16² patch → 14×14 = 196 patches + [CLS] = 197 tokens. d_model = 768, L = 12, heads = 12. Per layer ≈ 7 M params (4·768² for QKVO + 2·768·3072 for FFN). Total ≈ 86 M.

ViTs differ from CNNs in two important ways (Raghu et al.): (1) early ViT layers mix LOCAL and GLOBAL information simultaneously (some heads attend across the entire image even in layer 1) — CNNs are strictly local in early layers; (2) skip connections matter much more in ViT — removing them collapses representations harder than in CNNs.

Swin Transformer fixes ViT's O(n²) for dense prediction. Self-attention is computed within M×M local windows, giving O(M²·n/M²) = O(n) per layer. Alternating layers shift the window grid by M/2 so tokens at window boundaries land in the interior of a different window in the next layer — gives global receptive field while keeping each layer linear.

Definitions

Patch embedding — Linear projection of flattened P × P × 3 patches to d_model; equivalent to a single conv with kernel P and stride P.
[CLS] token — Learnable summary token prepended to the patch sequence; its final embedding is used for classification.
Inductive bias (CNN vs ViT) — CNNs have locality + translation equivariance baked in. ViTs have neither and must learn them from data.
Swin shifted-window attention — Self-attention within M × M local windows; window grid shifts by M/2 every other layer for cross-window information flow.

Formulas

\text{ViT input length:}\ N + 1 = HW/P^2 + 1
\text{Per-block params:}\ 4 d_{\text{model}}^2 + 2 d_{\text{model}} d_{\text{ff}}
\text{Patch embedding params:}\ P^2 \cdot 3 \cdot d_{\text{model}}
\text{Attention complexity:}\ O(N^2 d)\ \text{(ViT)},\ O(M^2 N)\ \text{(Swin)}

Derivations

ViT-B/16 parameter count: per layer ≈ 4·768² + 2·768·3072 ≈ 2.36 M + 4.72 M = 7.08 M. ×12 layers ≈ 85 M. Patch embedding: 768 · (16²·3) = 0.59 M. Final classifier: 768·K (negligible). Total ≈ 86 M.

Why 16×16 not 32×32 for high-res tasks: smaller patches → more tokens → more detail captured at attention level. For dense prediction (segmentation) you want as many tokens as compute allows, hence small patches and windowed attention.

Examples

224×224 image at patch 16 → 196 patches; at patch 32 → 49 patches (4× faster, much less detail).
Swin tiny on 224² uses windows of 7×7 patches (49 tokens) — far cheaper than ViT's 196² attention.
Position embedding interpolation when fine-tuning at higher resolution: interpolate the learned 1D PE to the new sequence length.

Diagrams

ViT block: patchify image → linear project → +[CLS] → +PE → L × (LN, MHA, +res, LN, FFN, +res) → CLS → MLP → class.
Swin shifted-window: window grid in layer 1 vs shifted grid in layer 2; same token now sits in the interior of a different window.
Attention distance per ViT head: early heads include both local (small distance) and global (large distance) heads, unlike CNNs.

Edge cases

ViT on small datasets underperforms CNNs — no built-in inductive biases.
Fine-tuning at higher resolution requires PE interpolation; otherwise the model sees an unfamiliar sequence length.
Token redundancy is high — many patches contribute little; token pruning (DynamicViT) drops 30-50% with minimal accuracy loss.

Common mistakes

Saying patch embedding is 'just flattening' — it's a learned linear projection.
Forgetting the [CLS] token in parameter accounting.
Stating ViT has 'no positional information' — it has learned 1D PEs.
Treating Swin's shifted window as a separate concept from 'cyclic shift' — the shift is per layer, alternating.

Shortcuts

ViT-B/16 ≈ 86 M params; ViT-L ≈ 307 M; ViT-H ≈ 632 M.
PreNorm everywhere in modern ViT (and the original).
Swin gives O(n) per layer + global receptive field across layers via shifting.

Computer Vision