MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 1.5 The Forward Pass 1.7 Analytic Gradient →
Step 1: Gradient Descent › 1.6

Numerical Gradient

So far

  • forward(tokens, n) — computes average loss
  • params — flat list of all 2,480 parameters

In Step 0, we didn’t need gradients — we just incremented the right count. But with 2,480 parameters in a neural network, we need to know: for each parameter, which direction should we nudge it to reduce the loss?

That’s the gradient: the slope of the loss with respect to each parameter.

The simplest way to estimate it is to just try:

def numerical_gradient(tokens, n):
    loss = forward(tokens, n)
    eps = 1e-5
    grad = []
    for mat in state_dict.values():
        for row in mat:
            for j in range(len(row)):
                old = row[j]
                row[j] = old + eps
                loss_plus = forward(tokens, n)
                row[j] = old
                grad.append((loss_plus - loss) / eps)
    return loss, grad

For each parameter:

old = row[j] Save the current value
row[j] = old + eps Nudge it up by a tiny amount (0.00001)
forward(tokens, n) Re-run the entire model to see the new loss
(loss_plus - loss) / eps The slope: how much did the loss change per unit of nudge?
row[j] = old Put it back before testing the next parameter

Try it

Click anywhere on the curve to pick a parameter value, then drag the ε slider to see how the slope estimate changes:

This is the definition of a derivative from calculus: $\frac{\partial L}{\partial \theta} \approx \frac{L(\theta + \varepsilon) - L(\theta)}{\varepsilon}$.

It works, but it’s extremely slow: we need one full forward pass for each of the 2,480 parameters. That’s 2,480 forward passes per training step.

We need a faster way.

← 1.5 The Forward Pass 1.7 Analytic Gradient →