← 3.7 The KV Cache Step 4: Transformer →

Training and Results

Previously Defined

Position embeddings + RMSNorm + single-head attention + residuals
gpt(token_id, pos_id, keys, values) replaces mlp(token_id)
KV cache accumulates context across positions

Training loop changes

Two small differences from Step 2:

num_steps = 1000
learning_rate = 0.1

The learning rate drops from 1.0 to 0.1 — the attention mechanism introduces more complex interactions between parameters, so smaller updates are needed to keep training stable.

And the sequence length is now capped:

block_size = 16 limits how far back the model can attend. For our names dataset (most names are under 16 characters), this rarely matters — but it prevents the model from trying to handle sequences longer than its position embedding table.

Try it

Hover to compare the loss curves. Toggle each series on or off:

The full comparison

	Step 2	Step 3
Architecture	MLP only	Attention + MLP
Context	Current token only	All previous tokens
Parameters	3,184	4,192
New concepts	Autograd	Attention, position embeddings, RMSNorm, residuals, KV cache
Learning rate	1.0	0.1
Model function	`mlp(token_id)`	`gpt(token_id, pos_id, keys, values)`

What we gained

The model can now use context. When predicting what follows “em” in “emma”, it doesn’t just see “m” — it sees that “e” came first, that “m” is at position 2, and it can weight that information through learned attention patterns.

This is the biggest conceptual leap in the tutorial. Steps 0–2 built progressively better models, but they all shared a fundamental limitation: each prediction was based on a single token. Step 3 breaks that barrier — the model can now look back at everything it’s seen so far and decide what matters.

We’re getting close to the full architecture. The remaining differences: single attention head (vs multiple) and single layer (vs stacked) — Step 4 adds those. Then Step 5 upgrades the optimizer from SGD to Adam.

← 3.7 The KV Cache Step 4: Transformer →