What Changes
Step 3 gave the model its biggest upgrade: attention. The model can now look back at previous tokens, decide which ones matter, and use that context to make predictions. That’s the core idea behind every GPT.
But our attention mechanism has two limitations:
-
One attention head. The model can only attend in one way. It might learn to focus on recent characters, but then it can’t simultaneously track vowel patterns or word boundaries.
-
One layer. The attention + MLP block runs once. Deeper models stack multiple layers, each refining the representation further.
Step 4 fixes both. The changes are structural — the same building blocks, arranged more powerfully.
What stays the same
- Dataset, tokenizer, vocabulary (27 tokens)
- Autograd (
Valueclass,loss.backward()) - SGD optimizer with linear learning rate decay
- All building blocks:
linear,softmax,rmsnorm - Attention mechanics (Q, K, V, scores, weighted sum)
- MLP block (linear → ReLU → linear)
- Residual connections around each block
lm_headoutput projection
What’s new
- Multi-head attention (
n_head = 4) — four independent attention heads, each operating on a 4-dimensional slice - Configurable layers (
n_layer) — the attention+MLP block wraps in a loop with per-layer parameters - Per-layer KV cache — each layer maintains its own key/value history
The architecture
Step 3’s model:
Step 4 wraps the attention+MLP in a repeatable layer:
The function signature stays the same — gpt(token_id, pos_id, keys, values) — because the changes are inside the model, not at its interface.
New hyperparameters
With n_layer = 1, the model has the same depth as Step 3 — we’re not making it deeper yet. But the architecture now supports depth. And n_head = 4 splits the 16-dimensional embedding into four 4-dimensional heads, each attending independently.
After this step, the model is structurally a complete GPT. The only remaining difference from the final version is the optimizer — Step 5 replaces SGD with Adam.