Computer Vision
CSE471Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits
Unit 7 — Vision Transformers (ViT)
Image-as-tokens: patchify, project, prepend a [CLS], add positional embeddings, run a Transformer encoder, classify. ViT scales beautifully but needs massive data; Swin localises attention for efficiency.