← 5.2 Momentum and Adaptive Rates 5.4 Training and Results →

The Code

Previously Defined

params — the list of all model parameters (from Step 4)
Exponential moving averages for momentum ($m$) and adaptive rate ($v$)
Bias correction: divide by $1 - \beta^t$
Full update: $\theta_t = \theta_{t-1} - \alpha_t \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$

Hyperparameters

learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8

learning_rate = 0.01	→	10× smaller than SGD’s 0.1 — Adam adapts internally
beta1 = 0.85	→	$\beta_1$ — momentum decay (blends 85% old + 15% new gradient)
beta2 = 0.99	→	$\beta_2$ — adaptive rate decay (longer memory for gradient magnitude)
eps_adam = 1e-8	→	$\epsilon$ — prevents division by zero when $v$ is tiny

Note that beta1 and beta2 differ from the standard Adam defaults (0.9 and 0.999). Karpathy’s values are tuned for this small model — lower $\beta_1$ means less momentum, higher effective $\beta_2$ means a longer memory for gradient scale.

Optimizer buffers

m = [0.0] * len(params) # first moment buffer
v = [0.0] * len(params) # second moment buffer

One entry per parameter. With 4,192 parameters, that’s two arrays of 4,192 floats — the per-parameter memory that makes Adam adaptive. SGD had no buffers at all.

The update rule

SGD’s update was a single line — multiply gradient by learning rate, subtract:

$$\theta_t = \theta_{t-1} - \alpha_t \cdot g_t$$

Adam replaces this with five lines that implement the full update from 5.2:

        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)

Line by line, mapping code to math:

m[i] = beta1 * m[i] + (1 - beta1) * p.grad	→	$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2	→	$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$
m_hat = m[i] / (1 - beta1 ** (step + 1))	→	$\hat{m}_t = m_t / (1 - \beta_1^t)$
v_hat = v[i] / (1 - beta2 ** (step + 1))	→	$\hat{v}_t = v_t / (1 - \beta_2^t)$
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)	→	$\theta_t = \theta_{t-1} - \alpha_t \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$

The division by $\sqrt{\hat{v}_t} + \epsilon$ is where the adaptation happens. Parameters with large recent gradients (large $\hat{v}_t$) get smaller steps. Parameters with small gradients get larger steps. The $\epsilon$ prevents division by zero when a parameter’s gradient has been near zero.

Try it

Step through Adam’s update for a single parameter. Click to advance one line at a time and watch the numbers flow through the five-line update:

← 5.2 Momentum and Adaptive Rates 5.4 Training and Results →

m[i] = beta1 * m[i] + (1 - beta1) * p.grad	→	\(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\)
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2	→	\(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\)
m_hat = m[i] / (1 - beta1 ** (step + 1))	→	\(\hat{m}_t = m_t / (1 - \beta_1^t)\)
v_hat = v[i] / (1 - beta2 ** (step + 1))	→	\(\hat{v}_t = v_t / (1 - \beta_2^t)\)
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)	→	\(\theta_t = \theta_{t-1} - \alpha_t \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)\)