Skip to content

Momentum (SGD with Momentum)

Formula

\[ v_t = \beta v_{t-1} + g_t,\qquad \theta_{t+1}=\theta_t-\eta v_t \]

Parameters

  • \(g_t\): gradient
  • \(v_t\): velocity / running direction
  • \(\beta\): momentum coefficient
  • \(\eta\): learning rate

What it means

Momentum smooths noisy gradients by accumulating a running direction, often speeding optimization.

What it's used for

  • Faster SGD training in deep networks.
  • Reducing oscillations in steep directions.

Key properties

  • Adds inertia to updates.
  • Works especially well with learning-rate schedules.

Common gotchas

  • Too-large \(\beta\) or learning rate can destabilize training.
  • Implementation conventions differ (classical vs damped variants).

Example

With momentum, parameters keep moving in a consistent downhill direction even if mini-batch gradients are noisy.

How to Compute (Pseudocode)

Input: stochastic gradients g_t, parameters theta, learning rate eta, momentum beta, steps T
Output: updated parameters theta

initialize velocity v <- 0
for t from 1 to T:
  v <- beta * v + g_t
  theta <- theta - eta * v
return theta

Complexity

  • Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
  • Space: \(O(p)\) additional space for the velocity vector
  • Assumptions: \(p\) parameters; optimizer update cost is usually small relative to backprop/gradient computation

See also