← 3.5 Attention: Scores and Weights 3.7 The KV Cache →

Attention Output and Residuals

Previously Defined

attn_weights — probability distribution over previous positions (from 3.5)
values — cached value vectors from all previous tokens (from 3.4)
x_residual — saved copy of the input before attention (from 3.4)

The attention weights (attn_weights) tell us how much to attend to each previous position. Now we use them to create a weighted combination of the values:

    x_attn = [sum(attn_weights[t] * values[t][j] for t in range(len(values))) for j in range(n_embd)]
    x = linear(x_attn, state_dict['attn_wo'])
    x = [a + b for a, b in zip(x, x_residual)]

The weighted sum

Recall from 3.4 that the value vectors are “what I offer if selected.” Now the attention weights decide how much to select from each position:

Each value vector is scaled by its attention weight and the results are summed element-wise into x_attn. This is the “answer” that the attention mechanism produces — a blend of information from the positions the model found most relevant.

The output projection

The attn_wo matrix mentioned in 3.4 appears here — the “o” stands for output. In 3.4 we saw three projections into attention (Q, K, V — “what am I looking for?”, “what do I contain?”, “what do I offer?”). This is the projection out — transforming what attention found back into a form the rest of the model can use. It’s another 16×16 learned matrix, giving the model a chance to remix the attended information before it’s added back to the residual stream.

Everything stays 16-dimensional — the residual addition works because all the vectors are the same size.

The residual connection

Remember that x_residual = x was saved before the attention block started (in 3.4). The third line adds it back — element-wise addition of x and x_residual. This means the attention block’s output is added to the input, not replacing it.

Why? Two reasons:

Gradient flow. Without the skip connection, gradients have to pass through every matrix multiplication on the way back. The residual provides a direct path for gradients to flow through, making training more stable.
Incremental refinement. Each block refines the representation rather than building it from scratch. The model can learn “add a little attention information” rather than “reconstruct everything.”

The MLP block — same pattern

    # 2) MLP block
    x_residual = x
    x = rmsnorm(x)
    x = linear(x, state_dict['mlp_fc1'])
    x = [xi.relu() for xi in x]
    x = linear(x, state_dict['mlp_fc2'])
    x = [a + b for a, b in zip(x, x_residual)]
    logits = linear(x, state_dict['lm_head'])
    return logits

The MLP follows the same structure: save residual → normalize → compute → add residual back. The MLP itself is familiar from Step 2 — project up to 64 dimensions, ReLU, project back down — but now it projects back to n_embd (16) instead of directly to vocab_size (27):

A separate lm_head (language model head) matrix handles the final projection to logits — decoupling the MLP from the vocabulary size:

The full picture

The complete gpt() function alternates communication (attention — tokens talk to each other) and computation (MLP — each token thinks independently):

← 3.5 Attention: Scores and Weights 3.7 The KV Cache →