Attention
The model can now look back at previous tokens. Position embeddings, single-head attention, RMSNorm, and residual connections give the MLP context.
- 3.1 What Changes
- 3.2 Position Embeddings
- 3.3 RMSNorm
- 3.4 Attention: Q, K, V
- 3.5 Attention: Scores and Weights
- 3.6 Attention Output and Residuals
- 3.7 The KV Cache
- 3.8 Training and Results
The big idea: A token processed in isolation can only memorize statistics. Attention lets each token decide which previous tokens matter for predicting what comes next.