Skip to content

Gradient Clipping

Formula

\[ g \leftarrow g \cdot \min\!\left(1,\frac{\tau}{\|g\|}\right) \]

Parameters

  • \(g\): gradient vector (or concatenated parameter gradients)
  • \(\tau\): clipping threshold (max norm)

What it means

Gradient clipping limits gradient magnitude to prevent unstable, excessively large parameter updates.

What it's used for

  • Stabilizing training (especially RNNs and large models).
  • Reducing exploding-gradient failures.

Key properties

  • Norm clipping preserves gradient direction when clipping occurs.
  • Value clipping (per-component) is a different method.

Common gotchas

  • Too-small thresholds slow learning.
  • Clipping frequency can indicate upstream instability (LR too high, bad initialization).

Example

If \(\|g\|=10\) and \(\tau=1\), the gradient is scaled down by \(0.1\).

How to Compute (Pseudocode)

Input: gradient vector/tensor collection g, clipping threshold tau
Output: clipped gradient g_clipped

norm_g <- l2_norm(g)
scale <- min(1, tau / norm_g)
g_clipped <- scale * g
return g_clipped

Complexity

  • Time: \(O(p)\) to compute the gradient norm and rescale \(p\) parameters (once gradients are available)
  • Space: \(O(1)\) extra space beyond the gradient storage (or \(O(p)\) if writing to a separate clipped copy)
  • Assumptions: Norm clipping shown; value clipping and per-parameter-group clipping use different rules but similar linear-time scans

See also