Skip to content

RMSProp

Formula

\[ s_t=\rho s_{t-1}+(1-\rho)g_t^2 \]
\[ \theta_{t+1}=\theta_t-\eta \frac{g_t}{\sqrt{s_t}+\epsilon} \]

Parameters

  • \(g_t\): gradient
  • \(s_t\): running average of squared gradients
  • \(\rho\): decay rate
  • \(\eta\): learning rate

What it means

RMSProp rescales updates by a running estimate of gradient magnitude, giving adaptive per-parameter learning rates.

What it's used for

  • Training deep networks with noisy gradients.
  • Historical precursor to Adam.

Key properties

  • Large-gradient directions get smaller updates.
  • Helps when gradient scales differ across parameters.

Common gotchas

  • Often sensitive to learning-rate tuning.
  • Multiple variants exist across frameworks.

Example

Parameters with consistently large gradients receive smaller effective step sizes.

How to Compute (Pseudocode)

Input: gradients g_t, parameters theta, learning rate eta, decay rho, epsilon, steps T
Output: updated parameters theta

initialize accumulator s <- 0
for t from 1 to T:
  s <- rho * s + (1 - rho) * (g_t * g_t)   # elementwise square
  theta <- theta - eta * g_t / (sqrt(s) + epsilon)
return theta

Complexity

  • Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
  • Space: \(O(p)\) additional space for the squared-gradient accumulator
  • Assumptions: \(p\) parameters; elementwise operations shown for a dense parameter vector/tensor collection

See also