Adam
Adam replaces SGD — per-parameter momentum and adaptive learning rates. The final piece. This is the complete microGPT.
The big idea: SGD treats every parameter the same. Adam gives each one its own effective step size, adapting to how noisy or stable its gradient has been. A small change in code, a big change in optimisation.