MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 5.2 Momentum and Adaptive Rates 5.4 Training and Results →
Step 5: Adam › 5.3

The Code

Previously Defined

  • params — the list of all model parameters (from Step 4)
  • Exponential moving averages for momentum (\(m\)) and adaptive rate (\(v\))
  • Bias correction: divide by \(1 - \beta^t\)
  • Full update: \(\theta_t = \theta_{t-1} - \alpha_t \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)\)

Hyperparameters

learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
learning_rate = 0.01 10× smaller than SGD’s 0.1 — Adam adapts internally
beta1 = 0.85 \(\beta_1\) — momentum decay (blends 85% old + 15% new gradient)
beta2 = 0.99 \(\beta_2\) — adaptive rate decay (longer memory for gradient magnitude)
eps_adam = 1e-8 \(\epsilon\) — prevents division by zero when \(v\) is tiny

Note that beta1 and beta2 differ from the standard Adam defaults (0.9 and 0.999). Karpathy’s values are tuned for this small model — lower \(\beta_1\) means less momentum, higher effective \(\beta_2\) means a longer memory for gradient scale.

Optimizer buffers

m = [0.0] * len(params) # first moment buffer
v = [0.0] * len(params) # second moment buffer

One entry per parameter. With 4,192 parameters, that’s two arrays of 4,192 floats — the per-parameter memory that makes Adam adaptive. SGD had no buffers at all.

The update rule

SGD’s update was a single line — multiply gradient by learning rate, subtract:

$$\theta_t = \theta_{t-1} - \alpha_t \cdot g_t$$

Adam replaces this with five lines that implement the full update from 5.2:

        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)

Line by line, mapping code to math:

m[i] = beta1 * m[i] + (1 - beta1) * p.grad \(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\)
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2 \(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\)
m_hat = m[i] / (1 - beta1 ** (step + 1)) \(\hat{m}_t = m_t / (1 - \beta_1^t)\)
v_hat = v[i] / (1 - beta2 ** (step + 1)) \(\hat{v}_t = v_t / (1 - \beta_2^t)\)
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam) \(\theta_t = \theta_{t-1} - \alpha_t \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)\)

The division by \(\sqrt{\hat{v}_t} + \epsilon\) is where the adaptation happens. Parameters with large recent gradients (large \(\hat{v}_t\)) get smaller steps. Parameters with small gradients get larger steps. The \(\epsilon\) prevents division by zero when a parameter’s gradient has been near zero.

Try it

Step through Adam’s update for a single parameter. Click to advance one line at a time and watch the numbers flow through the five-line update:

← 5.2 Momentum and Adaptive Rates 5.4 Training and Results →