MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 3.3 RMSNorm 3.5 Attention: Scores and Weights →
Step 3: Attention › 3.4

Attention: Q, K, V

Previously Defined

  • Token + position → combined embedding
  • rmsnorm(x) — rescales a vector to unit magnitude
  • linear(x, W) — matrix multiply (from Step 1.3)
  • state_dict — all the model’s learned weight matrices

This is the big one. Attention lets each token look back at all previous tokens and decide which ones are relevant for predicting what comes next.

The idea: each token produces three vectors from its embedding.

    x_residual = x
    x = rmsnorm(x)
    q = linear(x, state_dict['attn_wq'])
    k = linear(x, state_dict['attn_wk'])
    v = linear(x, state_dict['attn_wv'])
    ...

The x_residual = x saves a copy of the input before normalization — we’ll need it in 3.6. Then three learned projections, three different roles:

q = linear(x, state_dict['attn_wq']) Query — “what am I looking for?”
k = linear(x, state_dict['attn_wk']) Key — “what do I contain?”
v = linear(x, state_dict['attn_wv']) Value — “what do I offer if selected?”

Each is a 16-dimensional vector, produced by multiplying the (normalized) embedding by a learned weight matrix. These are called projections because each matrix extracts a different view of the same embedding — like shining three different lights on the same object and getting three different shadows. The embedding contains everything the model knows about the token; Q, K, and V each pick out the aspects relevant to their role.

Unlike the MLP in Step 1.4 where dimensions changed (16→64→27), here the input and output are both 16-dimensional — each projection reshapes the information without changing its size:

Three new parameter matrices — attn_wq, attn_wk, attn_wv — all the same shape (16×16).

The analogy

Think of it like a library search. You walk in with a query (“I need information about vowels that appeared recently”). Every book on the shelf has a key on its spine describing what’s inside. You compare your query against each key to find the best matches. Then you read the value — the actual content — from the books that matched.

The query comes from the current token. The keys and values come from all previous tokens (including the current one). The model learns what to put in each vector during training — the Q, K, V weight matrices are parameters, just like the embedding tables.

The new parameters

Three new 16×16 weight matrices for Q, K, V:

There’s also an output projection matrix attn_wo — we’ll see that in 3.6.

← 3.3 RMSNorm 3.5 Attention: Scores and Weights →