AdamW Optimizer¶

Formula¶

\[ \theta_{t+1}=\theta_t-\eta \frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}-\eta\lambda\theta_t \]

Parameters¶

\(\theta_t\): parameters
\(\hat m_t,\hat v_t\): Adam moment estimates
\(\eta\): learning rate
\(\lambda\): weight decay coefficient

What it means¶

AdamW applies decoupled weight decay, separating parameter shrinkage from the gradient-based Adam update.

What it's used for¶

Modern training of Transformers and large neural networks.
Better control of regularization than L2-inside-Adam.

Key properties¶

Decoupled weight decay behaves differently from adding \(\lambda\|\theta\|^2\) to the loss under Adam.
Usually preferred over vanilla Adam for deep learning.

Common gotchas¶

Excluding biases/norm parameters from weight decay is common but implementation-specific.
"weight_decay" flags in libraries may not all behave identically.

Example¶

A common setup is AdamW with warmup + cosine decay for Transformer training.

How to Compute (Pseudocode)¶

Input: gradients g_t, parameters theta, lr eta, Adam hyperparameters, weight decay lambda, steps T
Output: updated parameters theta

perform the Adam moment updates and bias correction
for each parameter element:
  theta <- theta - eta * adam_update(theta, g_t)
  theta <- theta - eta * lambda * theta    # decoupled weight decay
return theta

Complexity¶

Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
Space: \(O(p)\) additional Adam moment buffers, similar to Adam
Assumptions: \(p\) parameters; decoupled decay shown conceptually (implementations may fuse the operations)