Analytic Gradient
So far
forward(tokens, n)— computes average lossnumerical_gradient()— slow but correct
The chain rule from calculus lets us compute all 2,480 gradients in a single pass through the network — working backwards from the loss. This is backpropagation.
Let’s build it up piece by piece.
Setup and Forward Pass
def analytic_gradient(tokens, n):
grad = {k: [[0.0] * len(row) for row in mat] for k, mat in state_dict.items()}
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
x = list(state_dict['wte'][token_id])
h_pre = linear(x, state_dict['mlp_fc1'])
h = [max(0, v) for v in h_pre]
logits = linear(h, state_dict['mlp_fc2'])
probs = softmax(logits)
loss_t = -math.log(probs[target_id])
losses.append(loss_t)
This is the same forward pass as mlp() + softmax() + loss, but we save every intermediate value (x, h_pre, h, logits, probs). We’ll need them for the backward pass.
We also initialize grad — a dictionary of zero-matrices with the same shape as state_dict. This is where we’ll accumulate the gradients.
Step 1: Softmax + Cross-entropy → dlogits
dlogits = [p / n for p in probs]
dlogits[target_id] -= 1.0 / n
$$\frac{\partial L}{\partial \text{logits}_i} = \frac{\text{probs}_i - \mathbb{1}[i = \text{target}]}{n}$$
This is the elegant result: when you combine softmax with cross-entropy loss, the gradient simplifies to just “probabilities minus the correct answer.” If the model gave the right token 80% probability, the gradient is small. If it gave it 1%, the gradient is large — a strong signal to change.
Step 2: Output layer → dh, grad[mlp_fc2]
dh = [0.0] * len(h)
for i in range(len(dlogits)):
for j in range(len(h)):
grad['mlp_fc2'][i][j] += dlogits[i] * h[j]
dh[j] += state_dict['mlp_fc2'][i][j] * dlogits[i]
The logits came from linear(h, mlp_fc2), meaning logits[i] = sum(mlp_fc2[i][j] * h[j]). The chain rule gives us two things at once:
| grad['mlp_fc2'][i][j] | → | How much to change this weight: dlogits[i] * h[j] |
| dh[j] | → | How much to blame this hidden unit: pass the error backward through the weights |
Step 3: ReLU → dh_pre
dh_pre = [dh[j] * (1.0 if h_pre[j] > 0 else 0.0) for j in range(len(h_pre))]
ReLU’s gradient is binary: 1 if the input was positive, 0 if it was negative. Where ReLU zeroed out a value in the forward pass, the gradient is also zero — that neuron gets no update this step.
Step 4: Hidden layer → dx, grad[mlp_fc1]
dx = [0.0] * len(x)
for i in range(len(dh_pre)):
for j in range(len(x)):
grad['mlp_fc1'][i][j] += dh_pre[i] * x[j]
dx[j] += state_dict['mlp_fc1'][i][j] * dh_pre[i]
Same pattern as step 2. h_pre came from linear(x, mlp_fc1), so:
| grad['mlp_fc1'][i][j] | → | How much to change this weight: dh_pre[i] * x[j] |
| dx[j] | → | How much to blame this embedding dimension: pass the error backward again |
Step 5: Embedding → grad[wte]
for j in range(len(x)):
grad['wte'][token_id][j] += dx[j]
The embedding was just a table lookup (x = wte[token_id]), so dx goes directly into the gradient for that token’s row. No further backward pass needed — we’ve reached the input.
Collecting the result
loss = (1 / n) * sum(losses)
grad_flat = [g for mat in grad.values() for row in mat for g in row]
return loss, grad_flat
After processing all token pairs, we flatten the gradient dictionary into a single list — same shape as params — so the training loop can pair each gradient with its parameter.
Why This Matters
The numerical gradient needs 2,480 forward passes. The analytic gradient needs one forward pass and one backward pass — roughly 2× the cost of a single forward pass, regardless of how many parameters there are.
This is what makes training neural networks practical. Without backpropagation, even this tiny 2,480-parameter model would be too slow to train.