Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Revision Notes/Unit 11 — Vision Transformers (ViT)

Unit 11 — Vision Transformers (ViT)

Image-as-tokens: patchify, project, prepend a [CLS], add positional embeddings, run a Transformer encoder, classify. ViT scales beautifully but needs massive data; Swin localises attention for efficiency.

ViT Pipeline, Scaling, and Swin

13 min

For 9 years (AlexNet 2012 → 2021), every state-of-the-art vision model was a CNN. The locality + translation-equivariance inductive bias of convolutions was treated as a truth about how vision must work. Then Google's *"An Image is Worth 16×16 Words"* (Dosovitskiy et al., ICLR 2021) showed that a plain Transformer — the same architecture used to translate English to French — matched or beat the best CNNs on ImageNet, *provided you fed it enough data*. The recipe is literally in the title: cut the image into $16 \times 16$ patches, treat each as a token, feed a standard Transformer encoder.