← Step 2: Autograd 3.2 Position Embeddings →

What Changes

In Steps 1 and 2, the model processed each token independently. Given the token “m”, it predicted what comes next — but it had no idea whether “m” appeared at the start of a name, after “e”, or after “em”. Every position got the same prediction for the same token.

That’s a big limitation. In the name “emma”, predicting what follows the second “m” should depend on the “e” and first “m” that came before. Language is sequential — context matters.

Step 3 fixes this. The model can now look back at previous tokens and decide which ones are relevant. This mechanism is called attention, and it’s the core idea behind GPT and all modern large language models.

What stays the same

Dataset, tokenizer, vocabulary (27 tokens)
Autograd (Value class, loss.backward())
SGD optimizer
The MLP block (linear → ReLU → linear)

What’s new

Position embeddings — the model knows where each token is in the sequence
RMSNorm — normalization that keeps activations stable
Single-head attention — each token can attend to all previous tokens
Residual connections — skip connections around each block
Separate output projection (lm_head) — decouples the output from the MLP
KV cache — stores keys and values so previous tokens don’t need recomputation

The architecture

Step 2’s model was a straight pipeline:

Step 3 adds context:

New blocks are highlighted in green.

The function signature tells the story:

The model now takes a position (pos_id) and accumulated context (keys, values) — it’s no longer stateless.

← Step 2: Autograd 3.2 Position Embeddings →