MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

Step 4

Transformer

Multi-head attention and a configurable layer loop. The model is now a full GPT — only the optimizer remains to upgrade.

  1. 4.1 What Changes
  2. 4.2 Multi-Head Attention: The Idea
  3. 4.3 Multi-Head Attention: The Code
  4. 4.4 The Layer Loop
  5. 4.5 KV Cache: Per-Layer
  6. 4.6 Training and Results
The big idea: One attention head sees one pattern. Multiple heads let the model attend to different things simultaneously — position, character type, recent context — like a committee of specialists.
← Step 3: Attention Step 5: Adam →