Attention: Scores and Weights
Previously Defined
- Each token produces three projections:
q(query),k(key), andv(value) softmax(logits)— converts scores to probabilities (from Step 1.3)
Now comes the actual “attending.” The current token’s query is compared against every previous token’s key to produce attention scores:
keys.append(k)
values.append(v)
attn_logits = [sum(q[j] * keys[t][j] for j in range(n_embd)) / n_embd**0.5 for t in range(len(keys))]
attn_weights = softmax(attn_logits)
First, the current token’s key and value are appended to the cache (more on the cache in 3.7). Then:
| sum(q[j] * keys[t][j] for j in range(n_embd)) | → | Dot product between current query and key at position t (each previous position). Higher = better match |
| / n_embd**0.5 | → | Scale by √16 = 4. Without this, dot products grow with embedding dimension and softmax saturates |
| softmax(attn_logits) | → | Convert scores to probabilities — a distribution over previous positions |
The result: attn_logits is a list with one score per previous position, measuring how relevant each previous token’s key is to the current token’s query. After softmax, these become attn_weights — a probability distribution that sums to 1.
Scaled dot-product
This is called scaled dot-product attention:
where d is the embedding dimension (16). The scaling matters: a dot product of two random 16-dimensional vectors has variance proportional to 16. Without dividing by √d, the logits would be large, softmax would produce near-one-hot distributions (almost all the weight on a single position), and gradients would vanish. The scaling keeps the variance around 1, so softmax stays in a useful range.
What the scores mean
The attention weights form a probability distribution over previous positions — the model’s answer to “where should I look?” These weights are learned indirectly: the Q and K matrices are trained so that useful query-key pairs produce high dot products.
Try it
Click a token to see its attention weights — how much it attends to each previous position. The bar heights show the weights (which sum to 1).