RMSProp¶

Formula¶

\[ s_t=\rho s_{t-1}+(1-\rho)g_t^2 \]

\[ \theta_{t+1}=\theta_t-\eta \frac{g_t}{\sqrt{s_t}+\epsilon} \]

Parameters¶

\(g_t\): gradient
\(s_t\): running average of squared gradients
\(\rho\): decay rate
\(\eta\): learning rate

What it means¶

RMSProp rescales updates by a running estimate of gradient magnitude, giving adaptive per-parameter learning rates.

What it's used for¶

Training deep networks with noisy gradients.
Historical precursor to Adam.

Key properties¶

Large-gradient directions get smaller updates.
Helps when gradient scales differ across parameters.

Common gotchas¶

Often sensitive to learning-rate tuning.
Multiple variants exist across frameworks.

Example¶

Parameters with consistently large gradients receive smaller effective step sizes.

How to Compute (Pseudocode)¶

Input: gradients g_t, parameters theta, learning rate eta, decay rho, epsilon, steps T
Output: updated parameters theta

initialize accumulator s <- 0
for t from 1 to T:
  s <- rho * s + (1 - rho) * (g_t * g_t)   # elementwise square
  theta <- theta - eta * g_t / (sqrt(s) + epsilon)
return theta

Complexity¶

Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
Space: \(O(p)\) additional space for the squared-gradient accumulator
Assumptions: \(p\) parameters; elementwise operations shown for a dense parameter vector/tensor collection