MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 1.6 Numerical Gradient 1.8 Gradient Check →
Step 1: Gradient Descent › 1.7

Analytic Gradient

So far

  • forward(tokens, n) — computes average loss
  • numerical_gradient() — slow but correct

The chain rule from calculus lets us compute all 2,480 gradients in a single pass through the network — working backwards from the loss. This is backpropagation.

Let’s build it up piece by piece.

Setup and Forward Pass

def analytic_gradient(tokens, n):
    grad = {k: [[0.0] * len(row) for row in mat] for k, mat in state_dict.items()}
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        x = list(state_dict['wte'][token_id])
        h_pre = linear(x, state_dict['mlp_fc1'])
        h = [max(0, v) for v in h_pre]
        logits = linear(h, state_dict['mlp_fc2'])
        probs = softmax(logits)
        loss_t = -math.log(probs[target_id])
        losses.append(loss_t)

This is the same forward pass as mlp() + softmax() + loss, but we save every intermediate value (x, h_pre, h, logits, probs). We’ll need them for the backward pass.

We also initialize grad — a dictionary of zero-matrices with the same shape as state_dict. This is where we’ll accumulate the gradients.

Step 1: Softmax + Cross-entropy → dlogits

        dlogits = [p / n for p in probs]
        dlogits[target_id] -= 1.0 / n

$$\frac{\partial L}{\partial \text{logits}_i} = \frac{\text{probs}_i - \mathbb{1}[i = \text{target}]}{n}$$

This is the elegant result: when you combine softmax with cross-entropy loss, the gradient simplifies to just “probabilities minus the correct answer.” If the model gave the right token 80% probability, the gradient is small. If it gave it 1%, the gradient is large — a strong signal to change.

Step 2: Output layer → dh, grad[mlp_fc2]

        dh = [0.0] * len(h)
        for i in range(len(dlogits)):
            for j in range(len(h)):
                grad['mlp_fc2'][i][j] += dlogits[i] * h[j]
                dh[j] += state_dict['mlp_fc2'][i][j] * dlogits[i]

The logits came from linear(h, mlp_fc2), meaning logits[i] = sum(mlp_fc2[i][j] * h[j]). The chain rule gives us two things at once:

grad['mlp_fc2'][i][j] How much to change this weight: dlogits[i] * h[j]
dh[j] How much to blame this hidden unit: pass the error backward through the weights

Step 3: ReLU → dh_pre

        dh_pre = [dh[j] * (1.0 if h_pre[j] > 0 else 0.0) for j in range(len(h_pre))]

ReLU’s gradient is binary: 1 if the input was positive, 0 if it was negative. Where ReLU zeroed out a value in the forward pass, the gradient is also zero — that neuron gets no update this step.

Step 4: Hidden layer → dx, grad[mlp_fc1]

        dx = [0.0] * len(x)
        for i in range(len(dh_pre)):
            for j in range(len(x)):
                grad['mlp_fc1'][i][j] += dh_pre[i] * x[j]
                dx[j] += state_dict['mlp_fc1'][i][j] * dh_pre[i]

Same pattern as step 2. h_pre came from linear(x, mlp_fc1), so:

grad['mlp_fc1'][i][j] How much to change this weight: dh_pre[i] * x[j]
dx[j] How much to blame this embedding dimension: pass the error backward again

Step 5: Embedding → grad[wte]

        for j in range(len(x)):
            grad['wte'][token_id][j] += dx[j]

The embedding was just a table lookup (x = wte[token_id]), so dx goes directly into the gradient for that token’s row. No further backward pass needed — we’ve reached the input.

Collecting the result

    loss = (1 / n) * sum(losses)
    grad_flat = [g for mat in grad.values() for row in mat for g in row]
    return loss, grad_flat

After processing all token pairs, we flatten the gradient dictionary into a single list — same shape as params — so the training loop can pair each gradient with its parameter.

Why This Matters

The numerical gradient needs 2,480 forward passes. The analytic gradient needs one forward pass and one backward pass — roughly 2× the cost of a single forward pass, regardless of how many parameters there are.

This is what makes training neural networks practical. Without backpropagation, even this tiny 2,480-parameter model would be too slow to train.

← 1.6 Numerical Gradient 1.8 Gradient Check →