Training and Results
Previously Defined
- Position embeddings + RMSNorm + single-head attention + residuals
gpt(token_id, pos_id, keys, values)replacesmlp(token_id)- KV cache accumulates context across positions
Training loop changes
Two small differences from Step 2:
num_steps = 1000
learning_rate = 0.1
The learning rate drops from 1.0 to 0.1 — the attention mechanism introduces more complex interactions between parameters, so smaller updates are needed to keep training stable.
And the sequence length is now capped:
block_size = 16 limits how far back the model can attend. For our names dataset (most names are under 16 characters), this rarely matters — but it prevents the model from trying to handle sequences longer than its position embedding table.
Try it
Hover to compare the loss curves. Toggle each series on or off:
The full comparison
| Step 2 | Step 3 | |
|---|---|---|
| Architecture | MLP only | Attention + MLP |
| Context | Current token only | All previous tokens |
| Parameters | 3,184 | 4,192 |
| New concepts | Autograd | Attention, position embeddings, RMSNorm, residuals, KV cache |
| Learning rate | 1.0 | 0.1 |
| Model function | mlp(token_id) | gpt(token_id, pos_id, keys, values) |
What we gained
The model can now use context. When predicting what follows “em” in “emma”, it doesn’t just see “m” — it sees that “e” came first, that “m” is at position 2, and it can weight that information through learned attention patterns.
This is the biggest conceptual leap in the tutorial. Steps 0–2 built progressively better models, but they all shared a fundamental limitation: each prediction was based on a single token. Step 3 breaks that barrier — the model can now look back at everything it’s seen so far and decide what matters.
We’re getting close to the full architecture. The remaining differences: single attention head (vs multiple) and single layer (vs stacked) — Step 4 adds those. Then Step 5 upgrades the optimizer from SGD to Adam.