Gradient Clipping¶
Formula¶
\[
g \leftarrow g \cdot \min\!\left(1,\frac{\tau}{\|g\|}\right)
\]
Parameters¶
- \(g\): gradient vector (or concatenated parameter gradients)
- \(\tau\): clipping threshold (max norm)
What it means¶
Gradient clipping limits gradient magnitude to prevent unstable, excessively large parameter updates.
What it's used for¶
- Stabilizing training (especially RNNs and large models).
- Reducing exploding-gradient failures.
Key properties¶
- Norm clipping preserves gradient direction when clipping occurs.
- Value clipping (per-component) is a different method.
Common gotchas¶
- Too-small thresholds slow learning.
- Clipping frequency can indicate upstream instability (LR too high, bad initialization).
Example¶
If \(\|g\|=10\) and \(\tau=1\), the gradient is scaled down by \(0.1\).
How to Compute (Pseudocode)¶
Input: gradient vector/tensor collection g, clipping threshold tau
Output: clipped gradient g_clipped
norm_g <- l2_norm(g)
scale <- min(1, tau / norm_g)
g_clipped <- scale * g
return g_clipped
Complexity¶
- Time: \(O(p)\) to compute the gradient norm and rescale \(p\) parameters (once gradients are available)
- Space: \(O(1)\) extra space beyond the gradient storage (or \(O(p)\) if writing to a separate clipped copy)
- Assumptions: Norm clipping shown; value clipping and per-parameter-group clipping use different rules but similar linear-time scans