Courses/Computer Vision

Computer Vision

CSE471

Prof. Makarand Tapaswi + Prof. Charu Sharma•Spring 2025-26•4 credits

Revision Notes/Unit 6 — Attention & Transformers

Unit 6 — Attention & Transformers

Why attention beats RNN bottlenecks, scaled dot-product attention with the √dₖ rationale, multi-head attention, encoder/decoder masking, positional encodings, and the Show-Attend-and-Tell precursor.

Attention Mechanism & Transformer Architecture

15 min

Attention is a soft, learnable dictionary lookup. The decoder asks a question (Q); each position in the source provides a key (K); the value (V) at the matching position is returned. Multi-head attention runs many such lookups in parallel, each learning a different relationship.