Gradient Check
So far
numerical_gradient()— slow but correctanalytic_gradient()— fast, uses chain rule
How do we know the analytic gradient is correct? We check it against the numerical gradient:
doc = docs[0]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = len(tokens) - 1
loss_n, grad_n = numerical_gradient(tokens, n)
loss_a, grad_a = analytic_gradient(tokens, n)
grad_diff = max(abs(gn - ga) for gn, ga in zip(grad_n, grad_a))
print(f"gradient check | loss_n {loss_n:.6f} | loss_a {loss_a:.6f} | max diff {grad_diff:.8f}")
Take one name. Compute the gradient both ways. Compare.
The losses should be identical (both run the same forward pass). The gradients should be nearly identical — differing only by floating-point noise (around $10^{-7}$ or smaller).
If the max difference were large, it would mean the chain rule derivation has a bug. This is the standard sanity check in neural network development: always gradient-check your backward pass.
Try it
Each dot is one of the 2,480 parameters, computed on the name “yuheng.” If the two methods agree, every dot falls on the diagonal:
Once verified, we never use the numerical gradient again — it was only needed to validate the fast version.