Attention Output and Residuals
Previously Defined
The attention weights (attn_weights) tell us how much to attend to each previous position. Now we use them to create a weighted combination of the values:
x_attn = [sum(attn_weights[t] * values[t][j] for t in range(len(values))) for j in range(n_embd)]
x = linear(x_attn, state_dict['attn_wo'])
x = [a + b for a, b in zip(x, x_residual)]
The weighted sum
Recall from 3.4 that the value vectors are “what I offer if selected.” Now the attention weights decide how much to select from each position:
Each value vector is scaled by its attention weight and the results are summed element-wise into x_attn. This is the “answer” that the attention mechanism produces — a blend of information from the positions the model found most relevant.
The output projection
The attn_wo matrix mentioned in 3.4 appears here — the “o” stands for output. In 3.4 we saw three projections into attention (Q, K, V — “what am I looking for?”, “what do I contain?”, “what do I offer?”). This is the projection out — transforming what attention found back into a form the rest of the model can use. It’s another 16×16 learned matrix, giving the model a chance to remix the attended information before it’s added back to the residual stream.
Everything stays 16-dimensional — the residual addition works because all the vectors are the same size.
The residual connection
Remember that x_residual = x was saved before the attention block started (in 3.4). The third line adds it back — element-wise addition of x and x_residual. This means the attention block’s output is added to the input, not replacing it.
Why? Two reasons:
-
Gradient flow. Without the skip connection, gradients have to pass through every matrix multiplication on the way back. The residual provides a direct path for gradients to flow through, making training more stable.
-
Incremental refinement. Each block refines the representation rather than building it from scratch. The model can learn “add a little attention information” rather than “reconstruct everything.”
The MLP block — same pattern
# 2) MLP block
x_residual = x
x = rmsnorm(x)
x = linear(x, state_dict['mlp_fc1'])
x = [xi.relu() for xi in x]
x = linear(x, state_dict['mlp_fc2'])
x = [a + b for a, b in zip(x, x_residual)]
logits = linear(x, state_dict['lm_head'])
return logits
The MLP follows the same structure: save residual → normalize → compute → add residual back. The MLP itself is familiar from Step 2 — project up to 64 dimensions, ReLU, project back down — but now it projects back to n_embd (16) instead of directly to vocab_size (27):
A separate lm_head (language model head) matrix handles the final projection to logits — decoupling the MLP from the vocabulary size:
The full picture
The complete gpt() function alternates communication (attention — tokens talk to each other) and computation (MLP — each token thinks independently):