Weight Decay¶
Formula¶
\[
\theta_{t+1}=(1-\eta\lambda)\theta_t - \eta g_t
\]
Parameters¶
- \(\theta_t\): parameters
- \(g_t\): gradient update term
- \(\eta\): learning rate
- \(\lambda\): decay strength
What it means¶
Weight decay shrinks parameters toward zero during training, discouraging overly large weights.
What it's used for¶
- Regularization in neural network training.
- Improving generalization and optimization stability.
Key properties¶
- In SGD, closely related to L2 regularization.
- In adaptive optimizers, decoupled decay (AdamW) is often preferred.
Common gotchas¶
- Some parameters (biases, norm scales) are often excluded.
- Library APIs may implement L2 penalty instead of decoupled decay.
Example¶
Applying weight decay to a large linear layer gradually reduces weight magnitudes unless supported by the data gradient.
How to Compute (Pseudocode)¶
Input: parameters theta, gradient g, learning rate eta, decay lambda
Output: updated parameters theta
theta <- (1 - eta * lambda) * theta - eta * g
return theta
Complexity¶
- Time: \(O(p)\) elementwise parameter updates once gradients are available
- Space: \(O(1)\) extra optimizer state for plain decoupled decay (beyond parameter/gradient storage)
- Assumptions: \(p\) parameters; shown as a per-step update rule (overall training cost scales with the number of optimization steps)