← 4.5 KV Cache: Per-Layer Step 5: Adam →

Training and Results

Previously Defined

Multi-head attention (4 heads × 4 dims)
Configurable layer loop (n_layer = 1)
Per-layer KV cache

No training loop changes

The training loop is identical to Step 3. Same hyperparameters:

num_steps = 1000
learning_rate = 0.1

Same forward pass, backward pass, and SGD update. The only difference is the initialisation of the KV cache (as we saw in 4.5):

Parameter count

With n_layer = 1 and n_head = 4, the parameter count stays at 4,192 — the same as Step 3. Multi-head attention doesn’t add parameters; it rearranges how existing parameters are used. And the layer loop with n_layer = 1 produces the same number of weight matrices, just with layer0. prefixed names.

Try it

Hover to compare the loss curves across all steps. Toggle each series on or off:

With n_layer = 1, Step 4’s loss curve is very similar to Step 3’s — the multi-head split redistributes computation without adding capacity. The real benefit of multi-head attention shows when the model is larger and the heads can specialise on genuinely different patterns.

The full comparison

	Step 2	Step 3	Step 4
Architecture	MLP only	Attention + MLP	Multi-head attention + MLP
Attention	None	Single-head (16-dim)	4 heads (4-dim each)
Layers	1 (implicit)	1 (implicit)	Configurable (`n_layer`)
Parameters	3,184	4,192	4,192
New concepts	Autograd	Attention, position embeddings, RMSNorm, residuals, KV cache	Multi-head attention, layer loop
Model function	`mlp(token_id)`	`gpt(token_id, pos_id, keys, values)`

What we built

The model is now a complete GPT transformer. The same architecture — multi-head attention interleaved with MLPs on a residual stream, with RMSNorm and position embeddings — scales from our 4,192-parameter toy model to systems with billions of parameters. The difference is just the numbers: wider embeddings, more heads, more layers, bigger vocabulary.

One thing remains: the optimizer. We’re still using SGD with linear learning rate decay — a simple strategy that works, but doesn’t adapt to each parameter’s gradient history. Step 5 replaces SGD with Adam, which adds momentum and per-parameter learning rates.

← 4.5 KV Cache: Per-Layer Step 5: Adam →