MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 4.1 What Changes 4.3 Multi-Head Attention: The Code →
Step 4: Transformer › 4.2

Multi-Head Attention: The Idea

In Step 3, each token produced one query (Q), one key (K), and one value (V) — all 16-dimensional. The model had one way to decide what to attend to. If it learned to focus on the immediately preceding character, that was its only strategy.

But language has multiple kinds of structure simultaneously. In the name “emma”, predicting what follows “a” might benefit from knowing:

One head can’t attend to all three at once — it produces a single set of weights. Multi-head attention solves this by running multiple attention heads in parallel, each looking at a different slice of the embedding.

These patterns aren’t programmed — they’re learned during training. We don’t tell any head what to look for. The model discovers which attention strategies are useful for predicting the next character, and different heads end up specialising in different patterns.

The split

Instead of one 16-dimensional attention, we run four 4-dimensional attentions:

n_embd = 16 Full embedding dimension (unchanged)
n_head = 4 Number of independent attention heads
head_dim = n_embd // n_head = 4 dimensions per head

The Q, K, V projections still produce 16-dimensional vectors (same weight matrices). But instead of computing attention over all 16 dimensions at once, we partition them into four groups of 4:

Each head computes its own attention scores and weighted sum independently. Then the four 4-dimensional outputs are concatenated back into a 16-dimensional vector.

The committee analogy

Think of it as a committee of four specialists reviewing the same document. Each specialist reads the same text but focuses on different aspects — one might track syntax, another semantics, another position. They each write a 4-dimensional summary. The final report concatenates all four summaries into a 16-dimensional picture that’s richer than any single reviewer could produce.

The total computation is roughly the same as single-head attention (four heads × 4 dimensions ≈ one head × 16 dimensions), but the model can learn different attention patterns in each head. This is strictly more expressive — the model could learn to make all heads identical (recovering single-head behaviour), but it can also learn to specialise them.

Try it

Toggle heads on and off to see how they combine. Each head has a different hypothetical attention pattern when predicting what follows “a” in “emma”:

← 4.1 What Changes 4.3 Multi-Head Attention: The Code →