MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← Step 2: Autograd 3.2 Position Embeddings →
Step 3: Attention › 3.1

What Changes

In Steps 1 and 2, the model processed each token independently. Given the token “m”, it predicted what comes next — but it had no idea whether “m” appeared at the start of a name, after “e”, or after “em”. Every position got the same prediction for the same token.

That’s a big limitation. In the name “emma”, predicting what follows the second “m” should depend on the “e” and first “m” that came before. Language is sequential — context matters.

Step 3 fixes this. The model can now look back at previous tokens and decide which ones are relevant. This mechanism is called attention, and it’s the core idea behind GPT and all modern large language models.

What stays the same

What’s new

The architecture

Step 2’s model was a straight pipeline:

Step 3 adds context:

New blocks are highlighted in green.

The function signature tells the story:

The model now takes a position (pos_id) and accumulated context (keys, values) — it’s no longer stateless.

← Step 2: Autograd 3.2 Position Embeddings →