MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

Step 3

Attention

The model can now look back at previous tokens. Position embeddings, single-head attention, RMSNorm, and residual connections give the MLP context.

  1. 3.1 What Changes
  2. 3.2 Position Embeddings
  3. 3.3 RMSNorm
  4. 3.4 Attention: Q, K, V
  5. 3.5 Attention: Scores and Weights
  6. 3.6 Attention Output and Residuals
  7. 3.7 The KV Cache
  8. 3.8 Training and Results
The big idea: A token processed in isolation can only memorize statistics. Attention lets each token decide which previous tokens matter for predicting what comes next.
← Step 2: Autograd Step 4: Transformer →