MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 4.5 KV Cache: Per-Layer Step 5: Adam →
Step 4: Transformer › 4.6

Training and Results

Previously Defined

  • Multi-head attention (4 heads × 4 dims)
  • Configurable layer loop (n_layer = 1)
  • Per-layer KV cache

No training loop changes

The training loop is identical to Step 3. Same hyperparameters:

num_steps = 1000
learning_rate = 0.1

Same forward pass, backward pass, and SGD update. The only difference is the initialisation of the KV cache (as we saw in 4.5):

Parameter count

With n_layer = 1 and n_head = 4, the parameter count stays at 4,192 — the same as Step 3. Multi-head attention doesn’t add parameters; it rearranges how existing parameters are used. And the layer loop with n_layer = 1 produces the same number of weight matrices, just with layer0. prefixed names.

Try it

Hover to compare the loss curves across all steps. Toggle each series on or off:

With n_layer = 1, Step 4’s loss curve is very similar to Step 3’s — the multi-head split redistributes computation without adding capacity. The real benefit of multi-head attention shows when the model is larger and the heads can specialise on genuinely different patterns.

The full comparison

Step 2Step 3Step 4
ArchitectureMLP onlyAttention + MLPMulti-head attention + MLP
AttentionNoneSingle-head (16-dim)4 heads (4-dim each)
Layers1 (implicit)1 (implicit)Configurable (n_layer)
Parameters3,1844,1924,192
New conceptsAutogradAttention, position embeddings, RMSNorm, residuals, KV cacheMulti-head attention, layer loop
Model functionmlp(token_id)gpt(token_id, pos_id, keys, values)

What we built

The model is now a complete GPT transformer. The same architecture — multi-head attention interleaved with MLPs on a residual stream, with RMSNorm and position embeddings — scales from our 4,192-parameter toy model to systems with billions of parameters. The difference is just the numbers: wider embeddings, more heads, more layers, bigger vocabulary.

One thing remains: the optimizer. We’re still using SGD with linear learning rate decay — a simple strategy that works, but doesn’t adapt to each parameter’s gradient history. Step 5 replaces SGD with Adam, which adds momentum and per-parameter learning rates.

← 4.5 KV Cache: Per-Layer Step 5: Adam →