Skip to content

Weight Decay

Formula

\[ \theta_{t+1}=(1-\eta\lambda)\theta_t - \eta g_t \]

Parameters

  • \(\theta_t\): parameters
  • \(g_t\): gradient update term
  • \(\eta\): learning rate
  • \(\lambda\): decay strength

What it means

Weight decay shrinks parameters toward zero during training, discouraging overly large weights.

What it's used for

  • Regularization in neural network training.
  • Improving generalization and optimization stability.

Key properties

  • In SGD, closely related to L2 regularization.
  • In adaptive optimizers, decoupled decay (AdamW) is often preferred.

Common gotchas

  • Some parameters (biases, norm scales) are often excluded.
  • Library APIs may implement L2 penalty instead of decoupled decay.

Example

Applying weight decay to a large linear layer gradually reduces weight magnitudes unless supported by the data gradient.

How to Compute (Pseudocode)

Input: parameters theta, gradient g, learning rate eta, decay lambda
Output: updated parameters theta

theta <- (1 - eta * lambda) * theta - eta * g
return theta

Complexity

  • Time: \(O(p)\) elementwise parameter updates once gradients are available
  • Space: \(O(1)\) extra optimizer state for plain decoupled decay (beyond parameter/gradient storage)
  • Assumptions: \(p\) parameters; shown as a per-step update rule (overall training cost scales with the number of optimization steps)

See also