What Changes
In Steps 1 and 2, the model processed each token independently. Given the token “m”, it predicted what comes next — but it had no idea whether “m” appeared at the start of a name, after “e”, or after “em”. Every position got the same prediction for the same token.
That’s a big limitation. In the name “emma”, predicting what follows the second “m” should depend on the “e” and first “m” that came before. Language is sequential — context matters.
Step 3 fixes this. The model can now look back at previous tokens and decide which ones are relevant. This mechanism is called attention, and it’s the core idea behind GPT and all modern large language models.
What stays the same
- Dataset, tokenizer, vocabulary (27 tokens)
- Autograd (
Valueclass,loss.backward()) - SGD optimizer
- The MLP block (linear → ReLU → linear)
What’s new
- Position embeddings — the model knows where each token is in the sequence
- RMSNorm — normalization that keeps activations stable
- Single-head attention — each token can attend to all previous tokens
- Residual connections — skip connections around each block
- Separate output projection (
lm_head) — decouples the output from the MLP - KV cache — stores keys and values so previous tokens don’t need recomputation
The architecture
Step 2’s model was a straight pipeline:
Step 3 adds context:
New blocks are highlighted in green.
The function signature tells the story:
The model now takes a position (pos_id) and accumulated context (keys, values) — it’s no longer stateless.