Training and Results
Previously Defined
- Multi-head attention (4 heads × 4 dims)
- Configurable layer loop (
n_layer = 1) - Per-layer KV cache
No training loop changes
The training loop is identical to Step 3. Same hyperparameters:
num_steps = 1000
learning_rate = 0.1
Same forward pass, backward pass, and SGD update. The only difference is the initialisation of the KV cache (as we saw in 4.5):
Parameter count
With n_layer = 1 and n_head = 4, the parameter count stays at 4,192 — the same as Step 3. Multi-head attention doesn’t add parameters; it rearranges how existing parameters are used. And the layer loop with n_layer = 1 produces the same number of weight matrices, just with layer0. prefixed names.
Try it
Hover to compare the loss curves across all steps. Toggle each series on or off:
With n_layer = 1, Step 4’s loss curve is very similar to Step 3’s — the multi-head split redistributes computation without adding capacity. The real benefit of multi-head attention shows when the model is larger and the heads can specialise on genuinely different patterns.
The full comparison
| Step 2 | Step 3 | Step 4 | |
|---|---|---|---|
| Architecture | MLP only | Attention + MLP | Multi-head attention + MLP |
| Attention | None | Single-head (16-dim) | 4 heads (4-dim each) |
| Layers | 1 (implicit) | 1 (implicit) | Configurable (n_layer) |
| Parameters | 3,184 | 4,192 | 4,192 |
| New concepts | Autograd | Attention, position embeddings, RMSNorm, residuals, KV cache | Multi-head attention, layer loop |
| Model function | mlp(token_id) | gpt(token_id, pos_id, keys, values) | |
What we built
The model is now a complete GPT transformer. The same architecture — multi-head attention interleaved with MLPs on a residual stream, with RMSNorm and position embeddings — scales from our 4,192-parameter toy model to systems with billions of parameters. The difference is just the numbers: wider embeddings, more heads, more layers, bigger vocabulary.
One thing remains: the optimizer. We’re still using SGD with linear learning rate decay — a simple strategy that works, but doesn’t adapt to each parameter’s gradient history. Step 5 replaces SGD with Adam, which adds momentum and per-parameter learning rates.