MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 1.7 Analytic Gradient 1.9 Training & Inference →
Step 1: Gradient Descent › 1.8

Gradient Check

So far

  • numerical_gradient() — slow but correct
  • analytic_gradient() — fast, uses chain rule

How do we know the analytic gradient is correct? We check it against the numerical gradient:

doc = docs[0]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = len(tokens) - 1
loss_n, grad_n = numerical_gradient(tokens, n)
loss_a, grad_a = analytic_gradient(tokens, n)
grad_diff = max(abs(gn - ga) for gn, ga in zip(grad_n, grad_a))
print(f"gradient check | loss_n {loss_n:.6f} | loss_a {loss_a:.6f} | max diff {grad_diff:.8f}")

Take one name. Compute the gradient both ways. Compare.

The losses should be identical (both run the same forward pass). The gradients should be nearly identical — differing only by floating-point noise (around $10^{-7}$ or smaller).

If the max difference were large, it would mean the chain rule derivation has a bug. This is the standard sanity check in neural network development: always gradient-check your backward pass.

Try it

Each dot is one of the 2,480 parameters, computed on the name “yuheng.” If the two methods agree, every dot falls on the diagonal:

Once verified, we never use the numerical gradient again — it was only needed to validate the fast version.

← 1.7 Analytic Gradient 1.9 Training & Inference →