Step 5

Adam

Adam replaces SGD — per-parameter momentum and adaptive learning rates. The final piece. This is the complete microGPT.

5.1 What Changes
5.2 Momentum and Adaptive Rates
5.3 The Code
5.4 Training and Results

The big idea: SGD treats every parameter the same. Adam gives each one its own effective step size, adapting to how noisy or stable its gradient has been. A small change in code, a big change in optimisation.

← Step 4: Transformer