Momentum and Adaptive Rates
Exponential moving averages
Both of Adam’s statistics use the same technique: an exponential moving average (EMA). Instead of keeping a running sum or storing every past gradient, an EMA blends the new value with the old:
$$\text{ema} = \beta \cdot \text{ema} + (1 - \beta) \cdot x_t$$
When \(\beta\) is close to 1 (like 0.85 or 0.99), the EMA changes slowly — it remembers the past. When \(\beta\) is small, it reacts quickly to new values. The choice of \(\beta\) controls how far back the “memory” reaches.
Try it
Adjust β to see how the EMA (orange) tracks a noisy gradient signal (blue). Higher β = smoother but slower to react:
First moment: momentum
The first moment \(m\) is an EMA of the gradient itself:
$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$
With \(\beta_1 = 0.85\), this blends 85% of the previous momentum with 15% of the current gradient \(g_t\). The effect: if a parameter’s gradient keeps pointing the same direction, momentum builds and the parameter moves faster. If the gradient oscillates, the positive and negative values partially cancel, damping the noise.
This is similar to a ball rolling downhill — it picks up speed in consistent directions and resists sudden changes.
Second moment: adaptive rate
The second moment \(v\) is an EMA of the squared gradient:
$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$
With \(\beta_2 = 0.99\), this tracks the recent magnitude of the gradient (regardless of direction). A parameter with large, noisy gradients will have a large \(v\). A parameter with small, stable gradients will have a small \(v\).
The update divides by \(\sqrt{v}\) — so parameters with large \(v\) take smaller steps, and parameters with small \(v\) take larger steps. Each parameter automatically gets a step size matched to its gradient’s scale.
Bias correction
Both \(m\) and \(v\) start at zero and are multiplied by \(\beta\) each step, so they’re biased toward zero early in training. Adam corrects for this:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
At step 1, \(1 - \beta_1^1 = 0.15\), so \(\hat{m}_t = m_t / 0.15\) — a large correction. By step 100, \(1 - \beta_1^{100} \approx 1.0\) — the correction vanishes. This ensures the first few updates aren’t artificially small.
The full update
Putting it all together, Adam’s update rule for each parameter is:
$$\theta_t = \theta_{t-1} - \alpha_t \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
where \(\alpha_t\) is the learning rate (with decay), \(\hat{m}_t\) provides direction and momentum, \(\sqrt{\hat{v}_t}\) provides per-parameter scaling, and \(\epsilon\) (a tiny constant like \(10^{-8}\)) prevents division by zero.