← Step 3: Attention 4.2 Multi-Head Attention: The Idea →

What Changes

Step 3 gave the model its biggest upgrade: attention. The model can now look back at previous tokens, decide which ones matter, and use that context to make predictions. That’s the core idea behind every GPT.

But our attention mechanism has two limitations:

One attention head. The model can only attend in one way. It might learn to focus on recent characters, but then it can’t simultaneously track vowel patterns or word boundaries.
One layer. The attention + MLP block runs once. Deeper models stack multiple layers, each refining the representation further.

Step 4 fixes both. The changes are structural — the same building blocks, arranged more powerfully.

What stays the same

Dataset, tokenizer, vocabulary (27 tokens)
Autograd (Value class, loss.backward())
SGD optimizer with linear learning rate decay
All building blocks: linear, softmax, rmsnorm
Attention mechanics (Q, K, V, scores, weighted sum)
MLP block (linear → ReLU → linear)
Residual connections around each block
lm_head output projection

What’s new

Multi-head attention (n_head = 4) — four independent attention heads, each operating on a 4-dimensional slice
Configurable layers (n_layer) — the attention+MLP block wraps in a loop with per-layer parameters
Per-layer KV cache — each layer maintains its own key/value history

The architecture

Step 3’s model:

Step 4 wraps the attention+MLP in a repeatable layer:

The function signature stays the same — gpt(token_id, pos_id, keys, values) — because the changes are inside the model, not at its interface.

New hyperparameters

With n_layer = 1, the model has the same depth as Step 3 — we’re not making it deeper yet. But the architecture now supports depth. And n_head = 4 splits the 16-dimensional embedding into four 4-dimensional heads, each attending independently.

After this step, the model is structurally a complete GPT. The only remaining difference from the final version is the optimizer — Step 5 replaces SGD with Adam.

← Step 3: Attention 4.2 Multi-Head Attention: The Idea →