ViT Pipeline, Scaling, and Swin
Intuition
Feeding 50 K raw pixels into self-attention is infeasible (O(n²)). ViT solves this by treating non-overlapping 16×16 patches as 'words' — for a 224² image that's 14×14 = 196 tokens, well within Transformer reach.
Explanation
ViT pipeline end-to-end: (1) split a H × W × 3 image into N patches of size P × P; (2) flatten each patch to a P²·3 vector; (3) project linearly to d_model (equivalent to a single conv with stride P); (4) prepend a learnable [CLS] token; (5) add learned 1D positional embeddings to the N + 1 token sequence; (6) pass through L Transformer encoder layers; (7) take the final [CLS] embedding through an MLP head for the class logits.
The [CLS] token is a learnable global summary — not tied to any patch — that aggregates information from all patches via self-attention. Empirically, global average pooling of patch tokens works similarly, but [CLS] is the canonical choice and gives interpretable attention maps showing which regions drove classification.
Positional embeddings: the original paper compared 1D, 2D, and no PE. 1D works essentially as well as 2D for ViT — the embeddings are learned, attention is fully connected, and the model recovers any structure it needs. No PE drops accuracy significantly.
Scaling effects to memorise: increasing image resolution increases sequence length (n²) and compute, but transformer body parameters are unchanged (the body is a function of d_model and L, not n). Decreasing patch size (32 → 16) quadruples sequence length. Increasing layers scales parameters linearly. Increasing heads at fixed d_model is essentially parameter-free.
ViT-B/16 quick numbers: 224² image, 16² patch → 14×14 = 196 patches + [CLS] = 197 tokens. d_model = 768, L = 12, heads = 12. Per layer ≈ 7 M params (4·768² for QKVO + 2·768·3072 for FFN). Total ≈ 86 M.
ViTs differ from CNNs in two important ways (Raghu et al.): (1) early ViT layers mix LOCAL and GLOBAL information simultaneously (some heads attend across the entire image even in layer 1) — CNNs are strictly local in early layers; (2) skip connections matter much more in ViT — removing them collapses representations harder than in CNNs.
Swin Transformer fixes ViT's O(n²) for dense prediction. Self-attention is computed within M×M local windows, giving O(M²·n/M²) = O(n) per layer. Alternating layers shift the window grid by M/2 so tokens at window boundaries land in the interior of a different window in the next layer — gives global receptive field while keeping each layer linear.
Definitions
- Patch embedding — Linear projection of flattened P × P × 3 patches to d_model; equivalent to a single conv with kernel P and stride P.
- [CLS] token — Learnable summary token prepended to the patch sequence; its final embedding is used for classification.
- Inductive bias (CNN vs ViT) — CNNs have locality + translation equivariance baked in. ViTs have neither and must learn them from data.
- Swin shifted-window attention — Self-attention within M × M local windows; window grid shifts by M/2 every other layer for cross-window information flow.
Formulas
\text{ViT input length:}\ N + 1 = HW/P^2 + 1\text{Per-block params:}\ 4 d_{\text{model}}^2 + 2 d_{\text{model}} d_{\text{ff}}\text{Patch embedding params:}\ P^2 \cdot 3 \cdot d_{\text{model}}\text{Attention complexity:}\ O(N^2 d)\ \text{(ViT)},\ O(M^2 N)\ \text{(Swin)}
Derivations
ViT-B/16 parameter count: per layer ≈ 4·768² + 2·768·3072 ≈ 2.36 M + 4.72 M = 7.08 M. ×12 layers ≈ 85 M. Patch embedding: 768 · (16²·3) = 0.59 M. Final classifier: 768·K (negligible). Total ≈ 86 M.
Why 16×16 not 32×32 for high-res tasks: smaller patches → more tokens → more detail captured at attention level. For dense prediction (segmentation) you want as many tokens as compute allows, hence small patches and windowed attention.
Examples
- 224×224 image at patch 16 → 196 patches; at patch 32 → 49 patches (4× faster, much less detail).
- Swin tiny on 224² uses windows of 7×7 patches (49 tokens) — far cheaper than ViT's 196² attention.
- Position embedding interpolation when fine-tuning at higher resolution: interpolate the learned 1D PE to the new sequence length.
Diagrams
- ViT block: patchify image → linear project → +[CLS] → +PE → L × (LN, MHA, +res, LN, FFN, +res) → CLS → MLP → class.
- Swin shifted-window: window grid in layer 1 vs shifted grid in layer 2; same token now sits in the interior of a different window.
- Attention distance per ViT head: early heads include both local (small distance) and global (large distance) heads, unlike CNNs.
Edge cases
- ViT on small datasets underperforms CNNs — no built-in inductive biases.
- Fine-tuning at higher resolution requires PE interpolation; otherwise the model sees an unfamiliar sequence length.
- Token redundancy is high — many patches contribute little; token pruning (DynamicViT) drops 30-50% with minimal accuracy loss.
Common mistakes
- Saying patch embedding is 'just flattening' — it's a learned linear projection.
- Forgetting the [CLS] token in parameter accounting.
- Stating ViT has 'no positional information' — it has learned 1D PEs.
- Treating Swin's shifted window as a separate concept from 'cyclic shift' — the shift is per layer, alternating.
Shortcuts
- ViT-B/16 ≈ 86 M params; ViT-L ≈ 307 M; ViT-H ≈ 632 M.
- PreNorm everywhere in modern ViT (and the original).
- Swin gives O(n) per layer + global receptive field across layers via shifting.