Momentum and Adaptive Rates

Exponential moving averages

Both of Adam’s statistics use the same technique: an exponential moving average (EMA). Instead of keeping a running sum or storing every past gradient, an EMA blends the new value with the old:

$$\text{ema} = \beta \cdot \text{ema} + (1 - \beta) \cdot x_t$$

When $\beta$ is close to 1 (like 0.85 or 0.99), the EMA changes slowly — it remembers the past. When $\beta$ is small, it reacts quickly to new values. The choice of $\beta$ controls how far back the “memory” reaches.

Try it

Adjust β to see how the EMA (orange) tracks a noisy gradient signal (blue). Higher β = smoother but slower to react:

First moment: momentum

The first moment $m$ is an EMA of the gradient itself:

$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

With $\beta_1 = 0.85$, this blends 85% of the previous momentum with 15% of the current gradient $g_t$. The effect: if a parameter’s gradient keeps pointing the same direction, momentum builds and the parameter moves faster. If the gradient oscillates, the positive and negative values partially cancel, damping the noise.

This is similar to a ball rolling downhill — it picks up speed in consistent directions and resists sudden changes.

Second moment: adaptive rate

The second moment $v$ is an EMA of the squared gradient:

$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

With $\beta_2 = 0.99$, this tracks the recent magnitude of the gradient (regardless of direction). A parameter with large, noisy gradients will have a large $v$. A parameter with small, stable gradients will have a small $v$.

The update divides by $\sqrt{v}$ — so parameters with large $v$ take smaller steps, and parameters with small $v$ take larger steps. Each parameter automatically gets a step size matched to its gradient’s scale.

Bias correction

Both $m$ and $v$ start at zero and are multiplied by $\beta$ each step, so they’re biased toward zero early in training. Adam corrects for this:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

At step 1, $1 - \beta_1^1 = 0.15$, so $\hat{m}_t = m_t / 0.15$ — a large correction. By step 100, $1 - \beta_1^{100} \approx 1.0$ — the correction vanishes. This ensures the first few updates aren’t artificially small.

The full update

Putting it all together, Adam’s update rule for each parameter is:

$$\theta_t = \theta_{t-1} - \alpha_t \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

where $\alpha_t$ is the learning rate (with decay), $\hat{m}_t$ provides direction and momentum, $\sqrt{\hat{v}_t}$ provides per-parameter scaling, and $\epsilon$ (a tiny constant like $10^{-8}$) prevents division by zero.

← 5.1 What Changes 5.3 The Code →