MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 3.4 Attention: Q, K, V 3.6 Attention Output and Residuals →
Step 3: Attention › 3.5

Attention: Scores and Weights

Previously Defined

  • Each token produces three projections: q (query), k (key), and v (value)
  • softmax(logits) — converts scores to probabilities (from Step 1.3)

Now comes the actual “attending.” The current token’s query is compared against every previous token’s key to produce attention scores:

    keys.append(k)
    values.append(v)
    attn_logits = [sum(q[j] * keys[t][j] for j in range(n_embd)) / n_embd**0.5 for t in range(len(keys))]
    attn_weights = softmax(attn_logits)

First, the current token’s key and value are appended to the cache (more on the cache in 3.7). Then:

sum(q[j] * keys[t][j] for j in range(n_embd)) Dot product between current query and key at position t (each previous position). Higher = better match
/ n_embd**0.5 Scale by √16 = 4. Without this, dot products grow with embedding dimension and softmax saturates
softmax(attn_logits) Convert scores to probabilities — a distribution over previous positions

The result: attn_logits is a list with one score per previous position, measuring how relevant each previous token’s key is to the current token’s query. After softmax, these become attn_weights — a probability distribution that sums to 1.

Scaled dot-product

This is called scaled dot-product attention:

\[\text{attn logits}_t = \frac{q \cdot k_t}{\sqrt{d}}\]

where d is the embedding dimension (16). The scaling matters: a dot product of two random 16-dimensional vectors has variance proportional to 16. Without dividing by √d, the logits would be large, softmax would produce near-one-hot distributions (almost all the weight on a single position), and gradients would vanish. The scaling keeps the variance around 1, so softmax stays in a useful range.

What the scores mean

The attention weights form a probability distribution over previous positions — the model’s answer to “where should I look?” These weights are learned indirectly: the Q and K matrices are trained so that useful query-key pairs produce high dot products.

Try it

Click a token to see its attention weights — how much it attends to each previous position. The bar heights show the weights (which sum to 1).

← 3.4 Attention: Q, K, V 3.6 Attention Output and Residuals →