Numerical Gradient
So far
forward(tokens, n)— computes average lossparams— flat list of all 2,480 parameters
In Step 0, we didn’t need gradients — we just incremented the right count. But with 2,480 parameters in a neural network, we need to know: for each parameter, which direction should we nudge it to reduce the loss?
That’s the gradient: the slope of the loss with respect to each parameter.
The simplest way to estimate it is to just try:
def numerical_gradient(tokens, n):
loss = forward(tokens, n)
eps = 1e-5
grad = []
for mat in state_dict.values():
for row in mat:
for j in range(len(row)):
old = row[j]
row[j] = old + eps
loss_plus = forward(tokens, n)
row[j] = old
grad.append((loss_plus - loss) / eps)
return loss, grad
For each parameter:
| old = row[j] | → | Save the current value |
| row[j] = old + eps | → | Nudge it up by a tiny amount (0.00001) |
| forward(tokens, n) | → | Re-run the entire model to see the new loss |
| (loss_plus - loss) / eps | → | The slope: how much did the loss change per unit of nudge? |
| row[j] = old | → | Put it back before testing the next parameter |
Try it
Click anywhere on the curve to pick a parameter value, then drag the ε slider to see how the slope estimate changes:
This is the definition of a derivative from calculus: $\frac{\partial L}{\partial \theta} \approx \frac{L(\theta + \varepsilon) - L(\theta)}{\varepsilon}$.
It works, but it’s extremely slow: we need one full forward pass for each of the 2,480 parameters. That’s 2,480 forward passes per training step.
We need a faster way.