Step 4

Transformer

Multi-head attention and a configurable layer loop. The model is now a full GPT — only the optimizer remains to upgrade.

4.1 What Changes
4.2 Multi-Head Attention: The Idea
4.3 Multi-Head Attention: The Code
4.4 The Layer Loop
4.5 KV Cache: Per-Layer
4.6 Training and Results

The big idea: One attention head sees one pattern. Multiple heads let the model attend to different things simultaneously — position, character type, recent context — like a committee of specialists.

← Step 3: Attention Step 5: Adam →