What Changes
The model architecture is done. Step 4 gave us multi-head attention, a configurable layer loop, and per-layer KV caches — a complete GPT transformer. The only thing left is how we train it.
The problem with SGD
Since Step 1, we’ve used stochastic gradient descent (SGD) with a linearly decaying learning rate. SGD is simple: compute the gradient, multiply by the learning rate, subtract from the parameter.
But SGD treats every parameter identically. The same learning rate applies to the token embeddings, the attention weights, and the output projection — even though their gradients behave very differently. Some parameters get large, noisy gradients. Others get small, consistent ones. A single learning rate can’t be right for all of them.
What Adam adds
Adam — short for Adaptive Moment Estimation — tracks two running statistics for each parameter:
-
Momentum (first moment) — a smoothed average of recent gradients. Instead of reacting to each gradient in isolation, Adam remembers which direction a parameter has been moving and continues in that direction.
-
Adaptive rate (second moment) — a smoothed average of recent squared gradients. Parameters with large, volatile gradients get smaller effective step sizes. Parameters with small, stable gradients get larger ones.
Together, these give each parameter its own effective learning rate that adapts over time.
What stays the same
Everything. The model, the dataset, the autograd, the inference loop — all identical to Step 4. The only changes are in the training loop.
What’s new
- Adam optimizer — momentum + adaptive learning rates
- Lower base learning rate (0.01 vs 0.1) — Adam doesn’t need as aggressive a rate
- Optimizer buffers — two extra arrays (
mandv) tracking per-parameter statistics
After this step, the code is functionally identical to Karpathy’s original train.py.