MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 3.7 The KV Cache Step 4: Transformer →
Step 3: Attention › 3.8

Training and Results

Previously Defined

  • Position embeddings + RMSNorm + single-head attention + residuals
  • gpt(token_id, pos_id, keys, values) replaces mlp(token_id)
  • KV cache accumulates context across positions

Training loop changes

Two small differences from Step 2:

num_steps = 1000
learning_rate = 0.1

The learning rate drops from 1.0 to 0.1 — the attention mechanism introduces more complex interactions between parameters, so smaller updates are needed to keep training stable.

And the sequence length is now capped:

block_size = 16 limits how far back the model can attend. For our names dataset (most names are under 16 characters), this rarely matters — but it prevents the model from trying to handle sequences longer than its position embedding table.

Try it

Hover to compare the loss curves. Toggle each series on or off:

The full comparison

Step 2Step 3
ArchitectureMLP onlyAttention + MLP
ContextCurrent token onlyAll previous tokens
Parameters3,1844,192
New conceptsAutogradAttention, position embeddings, RMSNorm, residuals, KV cache
Learning rate1.00.1
Model functionmlp(token_id)gpt(token_id, pos_id, keys, values)

What we gained

The model can now use context. When predicting what follows “em” in “emma”, it doesn’t just see “m” — it sees that “e” came first, that “m” is at position 2, and it can weight that information through learned attention patterns.

This is the biggest conceptual leap in the tutorial. Steps 0–2 built progressively better models, but they all shared a fundamental limitation: each prediction was based on a single token. Step 3 breaks that barrier — the model can now look back at everything it’s seen so far and decide what matters.

We’re getting close to the full architecture. The remaining differences: single attention head (vs multiple) and single layer (vs stacked) — Step 4 adds those. Then Step 5 upgrades the optimizer from SGD to Adam.

← 3.7 The KV Cache Step 4: Transformer →