Step 3

Attention

The model can now look back at previous tokens. Position embeddings, single-head attention, RMSNorm, and residual connections give the MLP context.

3.1 What Changes
3.2 Position Embeddings
3.3 RMSNorm
3.4 Attention: Q, K, V
3.5 Attention: Scores and Weights
3.6 Attention Output and Residuals
3.7 The KV Cache
3.8 Training and Results

The big idea: A token processed in isolation can only memorize statistics. Attention lets each token decide which previous tokens matter for predicting what comes next.

← Step 2: Autograd Step 4: Transformer →