Momentum (SGD with Momentum)¶
Formula¶
\[
v_t = \beta v_{t-1} + g_t,\qquad
\theta_{t+1}=\theta_t-\eta v_t
\]
Parameters¶
- \(g_t\): gradient
- \(v_t\): velocity / running direction
- \(\beta\): momentum coefficient
- \(\eta\): learning rate
What it means¶
Momentum smooths noisy gradients by accumulating a running direction, often speeding optimization.
What it's used for¶
- Faster SGD training in deep networks.
- Reducing oscillations in steep directions.
Key properties¶
- Adds inertia to updates.
- Works especially well with learning-rate schedules.
Common gotchas¶
- Too-large \(\beta\) or learning rate can destabilize training.
- Implementation conventions differ (classical vs damped variants).
Example¶
With momentum, parameters keep moving in a consistent downhill direction even if mini-batch gradients are noisy.
How to Compute (Pseudocode)¶
Input: stochastic gradients g_t, parameters theta, learning rate eta, momentum beta, steps T
Output: updated parameters theta
initialize velocity v <- 0
for t from 1 to T:
v <- beta * v + g_t
theta <- theta - eta * v
return theta
Complexity¶
- Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
- Space: \(O(p)\) additional space for the velocity vector
- Assumptions: \(p\) parameters; optimizer update cost is usually small relative to backprop/gradient computation