Transformer
Multi-head attention and a configurable layer loop. The model is now a full GPT — only the optimizer remains to upgrade.
- 4.1 What Changes
- 4.2 Multi-Head Attention: The Idea
- 4.3 Multi-Head Attention: The Code
- 4.4 The Layer Loop
- 4.5 KV Cache: Per-Layer
- 4.6 Training and Results
The big idea: One attention head sees one pattern. Multiple heads let the model attend to different things simultaneously — position, character type, recent context — like a committee of specialists.