MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← Step 3: Attention 4.2 Multi-Head Attention: The Idea →
Step 4: Transformer › 4.1

What Changes

Step 3 gave the model its biggest upgrade: attention. The model can now look back at previous tokens, decide which ones matter, and use that context to make predictions. That’s the core idea behind every GPT.

But our attention mechanism has two limitations:

  1. One attention head. The model can only attend in one way. It might learn to focus on recent characters, but then it can’t simultaneously track vowel patterns or word boundaries.

  2. One layer. The attention + MLP block runs once. Deeper models stack multiple layers, each refining the representation further.

Step 4 fixes both. The changes are structural — the same building blocks, arranged more powerfully.

What stays the same

What’s new

The architecture

Step 3’s model:

Step 4 wraps the attention+MLP in a repeatable layer:

The function signature stays the same — gpt(token_id, pos_id, keys, values) — because the changes are inside the model, not at its interface.

New hyperparameters

With n_layer = 1, the model has the same depth as Step 3 — we’re not making it deeper yet. But the architecture now supports depth. And n_head = 4 splits the 16-dimensional embedding into four 4-dimensional heads, each attending independently.

After this step, the model is structurally a complete GPT. The only remaining difference from the final version is the optimizer — Step 5 replaces SGD with Adam.

← Step 3: Attention 4.2 Multi-Head Attention: The Idea →