Revision Notes/Unit 6 — Attention & Transformers/Attention Mechanism & Transformer Architecture/Story

Attention Mechanism & Transformer Architecture

Unit 6 — Attention & Transformers

The Whisper of Attention

Before 2014, translating English to French in a neural net worked like this: feed the English sentence to an RNN. At the end, take the final hidden state — *one vector* — and use it as the seed for another RNN that generates French token by token.

This was machine translation in 2014. It was bad. The reason was simple: one vector cannot hold an entire sentence's worth of information. "The cat that sat on the mat" and "The cat sat on the mat" must compress to the same 512-dimensional vector? Impossible.

Three papers fixed this, each more radical than the last:

1. Bahdanau et al., ICLR 2015 — *Neural Machine Translation by Jointly Learning to Align and Translate.* Invented attention as an addition to RNNs. 2. Xu et al., ICML 2015 — *Show, Attend and Tell.* Took Bahdanau's attention and applied it to images. 3. Vaswani et al., NeurIPS 2017 — *Attention Is All You Need.* Killed the RNN entirely. The whole model is just attention. This is the Transformer.

Today's vision models (ViT, Swin, SigLIP), today's LLMs (GPT, LLaMA, Gemma), today's multimodal systems (PaliGemma, Qwen2-VL) — all descendants of paper #3. So this lecture is the origin.

Part 1 — Seq2Seq and the bottleneck

An RNN processes a sequence one token at a time, maintaining a hidden state: $h_{t} = f (h_{t - 1}, x_{t})$ where $f$ is some learnable cell (vanilla RNN, LSTM, GRU). LSTMs and GRUs add gating so the hidden state can selectively remember or forget across long ranges.

You should know three Seq2Seq task types: *image captioning* (image → text — single input, sequence output); *sentiment classification* (text → label — sequence input, single output); *machine translation* (text → text — sequence to sequence).

The encoder reads the source sentence and produces a final hidden state. The decoder uses that hidden state as its initial state and generates tokens autoregressively — at each step, take the previously-generated token, produce a probability distribution over the vocabulary, sample/argmax, repeat.

This is a conditional language model $p (y_{1}, \dots, y_{M} ∣ x_{1}, \dots, x_{N})$ factored as $\prod_{t} p (y_{t} ∣ y_{< t}, x)$ .

Teacher vs student forcing

When training the decoder, what do you feed as the input at step $t$ ?

Teacher forcing — feed the *ground-truth* previous word $y_{t - 1}^{⋆}$ . Fast, stable, but creates *exposure bias* (at inference, the model only sees its own predictions, which may differ from training distribution).
Student forcing (scheduled sampling) — feed the decoder's own *predicted* previous word $\overset{y}{^}_{t - 1}$ . More realistic to test time, but training becomes harder because early mistakes propagate.
At inference: always student forcing. There's no ground truth.

Greedy vs beam search

Greedy — at each step, pick the argmax token. Fast, but locally myopic.
Beam search — keep the top $B$ partial sequences. Expand each, score each extension, keep the top $B$ again. After $T$ steps, return the highest-scoring complete sequence. Typical $B = 4$ or $5$ .

Softmax temperature

p_{i} = \frac{exp ( z _{i} / T )}{\sum _{j} exp ( z _{j} / T )}

$T = 1$ normal; $T ≫ 1$ flat (exploratory); $T ≪ 1$ peaky (deterministic). You met this idea twice already — DINO's teacher sharpening (small $τ_{t}$ ) and InfoNCE's temperature.

The bottleneck

The decoder's first hidden state is $h_{0} = h_{final}^{enc}$ — one fixed vector encoding the entire source sentence. As sentences get longer, performance collapses. Information is squeezed through a too-narrow pipe.

This is what attention fixed.

Part 2 — Bahdanau attention

The idea: the decoder shouldn't summarise the entire source into one vector. At each decoding step, it should *look back* over the encoder's entire hidden-state history and pull out just what it needs.

Concretely, at decoder step $t$ , take all encoder hidden states $(h_{1}^{enc}, \dots, h_{N}^{enc})$ . Use the decoder's current state $h_{t - 1}^{dec}$ to compute a *weighted average* over them — with weights focused on the encoder positions most relevant to the current decoder step.

The math, four steps:

1. Alignment scores — one scalar per encoder position:

e_{t, i} = v^{⊤} tanh (W_{dec} h_{t - 1}^{dec} + W_{enc} h_{i}^{enc})

(Bahdanau's additive MLP score; other choices — dot product, scaled dot product — came later.)

2. Softmax over alignment scores:

α_{t, i} = \frac{exp ( e _{t, i} )}{\sum _{j} exp ( e _{t, j} )}

These are the attention weights, summing to 1 over $i$ .

3. Context vector — weighted average of encoder states:

c_{t} = i \sum α_{t, i} h_{i}^{enc}

4. Combine and predict: $h_{t}^{dec} = RNN ([y_{t - 1}; c_{t}], h_{t - 1}^{dec})$ and $p (y_{t}) = softmax (W_{out} h_{t}^{dec})$ .

The decoder now has access to the entire encoder history, focused on whichever positions matter most for the current word.

Why it's called "attention"

Because $α_{t, i}$ literally *attends to* encoder position $i$ at decoder step $t$ . The model learns where to look.

Attention learns alignment as a byproduct

The beautiful empirical finding: when you visualise the $α_{t, i}$ matrix for a translated sentence, you see a near-diagonal alignment — when generating the French word for "cat", attention peaks on the encoder hidden state for "cat". Bahdanau's paper showed this with heatmaps that became iconic. No alignment supervision is given, yet the weights learn to spike on the source positions most relevant to each target position.

Soft vs hard attention

Soft attention — $α_{t, i}$ is a continuous distribution; $c_{t}$ is a weighted average. Differentiable.
Hard attention — sample one encoder position discretely (or argmax). Not differentiable; needs REINFORCE. Faster at inference but harder to train.

Image-captioning's *Show, Attend and Tell* compared both. Soft won and became standard.

Part 3 — Attention Is All You Need

By 2017, attention was a successful add-on to RNNs. Vaswani et al. asked the radical question: do we need the RNN at all?

The recurrence in RNNs has a real cost: it serialises computation. You cannot process step 5 until step 4 finishes. This makes RNNs slow to train and impossible to parallelise across sequence length.

If attention already lets the decoder look at any encoder position, why not let *every* encoder position look at every *other* encoder position, in parallel, with no recurrence at all? That's the Transformer.

Overall architecture

Encoder (stack of $N$ identical blocks): self-attention → feedforward MLP.

Decoder (stack of $N$ identical blocks): masked self-attention → cross-attention → feedforward MLP.

Output: softmax over vocabulary at each decoder position.

Original paper: $N = 6$ encoder blocks, $N = 6$ decoder blocks. Modern variants use 12 to 96+.

Self-attention — the engine

For input $X \in R^{n \times d}$ , project to Queries, Keys, Values:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

Three separate learnable matrices — each token gets a "query" role (what am I looking for?), a "key" role (what can I offer?), and a "value" role (what content do I carry?).

Then scaled dot-product attention:

Attn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Read each piece. $Q K^{⊤}$ is $n \times n$ — every token's query dot-producted with every token's key (pairwise similarities). Divide by $d_{k}$ . Why? Without this, when $d_{k}$ is large, dot products grow large (sum of $d_{k}$ products of unit-variance terms scales as $d_{k}$ ). The softmax saturates — one logit dominates and gradients vanish. Scaling by $d_{k}$ keeps the variance of dot products approximately constant regardless of $d_{k}$ . Exam-question gold: "why scaled?" → "to prevent softmax saturation." Softmax along the last axis gives, for each query, a probability distribution over keys. Multiply by $V$ — each query gets a weighted average of values.

Why it replaces RNNs

In an RNN, position 100 depends on position 1 through 99 sequential applications of the cell. In self-attention, position 100 attends to position 1 in one step. Distance becomes $O (1)$ instead of $O (n)$ . The cost is computing every pair — $O (n^{2})$ per layer (the bill Flash Attention pays in a later unit).

Multi-head attention

Instead of one big attention with $d_{k} = d$ , run $h$ parallel heads, each with $d_{k} = d / h$ :

MultiHead (X) = Concat (head_{1}, \dots, head_{h}) W_{O}

$W_{O} \in R^{d \times d}$ is a final projection. Intuition: each head can specialise — one for syntactic dependencies, another for coreference, another for local patterns. Concatenation and projection recombine these views.

Critical detail (exam-favourite): *the number of heads doesn't change the total parameter count.* Whether you have 1 head with $d_{k} = 768$ or 12 heads with $d_{k} = 64$ , $W_{Q}, W_{K}, W_{V}$ together always use $3 d^{2}$ parameters. Heads just partition the per-head dimension.

Masked self-attention

The decoder is autoregressive — generating token $t$ cannot peek at $t + 1, t + 2, \dots$ . Enforce this with a causal mask:

M_{ij} = {0 - \infty j \leq i j > i, Attn = softmax (\frac{Q K ^{⊤} + M}{d _{k}}) V

After softmax, $- \infty$ entries become 0 — token $i$ can only attend to tokens $\leq i$ . Synonyms for the mask: *autoregressive, look-ahead, left-to-right.*

Cross-attention

The connection between encoder and decoder. Q from the decoder; K, V from the encoder output.

Attn (Q_{dec}, K_{enc}, V_{enc}) = softmax (\frac{Q _{dec} K _{enc}^{⊤}}{d _{k}}) V_{enc}

This is exactly Bahdanau attention generalised to Q-K-V form. Memorise the connection.

Positional encoding — restoring order

Self-attention is permutation-equivariant: shuffle the input tokens and the output tokens shuffle the same way. Without help, "the cat sat on the mat" and "the mat sat on the cat" produce equivalent outputs.

Vaswani's fix — sinusoidal positional encoding added to the input embeddings:

PE (pos, 2 i) = sin (\frac{pos}{1000 0 ^{2 i / d}}), PE (pos, 2 i + 1) = cos (\frac{pos}{1000 0 ^{2 i / d}})

Each position gets a unique $d$ -dim vector built from sines/cosines at exponentially decreasing frequencies. Why sinusoids? $PE (pos + k)$ is a linear function of $PE (pos)$ for any fixed offset $k$ — so a linear layer can recover relative position.

Variants you'll meet later: learned absolute PEs (BERT, ViT); RoPE (rotary, multiplicative on $Q$ and $K$ ); 2D RoPE / M-RoPE for images and video.

Practical Transformers

The "Practical Transformers" slide lists three points: batch (parallel sequences); padding (pad to longest with $[PAD]$ ); masking (set attention scores at padding positions to $- \infty$ ). For decoders, two masks combine: the padding mask + the causal mask.

Why the Transformer won

Parallelisation — self-attention computes all interactions simultaneously. On GPUs, Transformer training is dramatically faster than an equally-sized RNN.
Long-range modelling — every token sees every other token in one layer. No long-distance information decay.
Universality — the same architecture handles text, images (ViT), audio, video, even DNA. Just change the tokenisation.

Within four years (2017 → 2021), Transformers had taken over NLP, vision, speech, and multimodal everything.

What you carry into the exam

The three landmark papers and what each contributed. Three Seq2Seq task types. The encoder-decoder paradigm and its bottleneck. Bahdanau's four-step recipe and why it learns alignment as a byproduct. Soft vs hard attention. Teacher vs student forcing — and that inference is always student-forcing. Greedy vs beam search; softmax temperature directions. The Transformer's three flavours of attention (encoder self, decoder masked self, cross). Scaled dot-product math with the $d_{k}$ rationale. Multi-head's parameter invariance under $h$ . Causal mask synonyms. Sinusoidal PE and why it expresses relative position linearly. The three reasons the Transformer won.

Every Transformer-based model in every story we've covered or will cover — ViT, SigLIP, Qwen2-VL, CLIP's text encoder, DINO's vision encoder — is built on the math we just walked through.

Computer Vision