Adam Optimizer¶
Formula¶
\[
m_t=\beta_1 m_{t-1}+(1-\beta_1)g_t,\quad
v_t=\beta_2 v_{t-1}+(1-\beta_2)g_t^2
\]
\[
\theta_{t+1}=\theta_t-\eta \frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}
\]
Parameters¶
- \(g_t\): gradient at step \(t\)
- \(m_t,v_t\): first and second moment estimates
- \(\beta_1,\beta_2\): decay rates
- \(\eta\): learning rate
What it means¶
Adam combines momentum and adaptive per-parameter step sizes using running gradient statistics.
What it's used for¶
- Default optimizer for many deep learning models.
- Fast convergence on noisy, high-dimensional problems.
Key properties¶
- Uses bias correction for early steps.
- Often works with less tuning than SGD.
Common gotchas¶
- Can generalize worse than SGD in some settings.
- Weight decay should usually use AdamW-style decoupling.
Example¶
Transformer training often starts with Adam/AdamW plus learning-rate warmup.
How to Compute (Pseudocode)¶
Input: gradients g_t, parameters theta, lr eta, betas beta1,beta2, epsilon, steps T
Output: updated parameters theta
initialize m <- 0, v <- 0
for t from 1 to T:
m <- beta1 * m + (1 - beta1) * g_t
v <- beta2 * v + (1 - beta2) * (g_t * g_t)
m_hat <- m / (1 - beta1^t)
v_hat <- v / (1 - beta2^t)
theta <- theta - eta * m_hat / (sqrt(v_hat) + epsilon)
return theta
Complexity¶
- Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
- Space: \(O(p)\) additional space for first- and second-moment accumulators (about 2 extra parameter-sized buffers)
- Assumptions: \(p\) parameters; bias-corrected Adam update shown; backprop/gradient computation usually dominates end-to-end training cost