RMSProp¶
Formula¶
\[
s_t=\rho s_{t-1}+(1-\rho)g_t^2
\]
\[
\theta_{t+1}=\theta_t-\eta \frac{g_t}{\sqrt{s_t}+\epsilon}
\]
Parameters¶
- \(g_t\): gradient
- \(s_t\): running average of squared gradients
- \(\rho\): decay rate
- \(\eta\): learning rate
What it means¶
RMSProp rescales updates by a running estimate of gradient magnitude, giving adaptive per-parameter learning rates.
What it's used for¶
- Training deep networks with noisy gradients.
- Historical precursor to Adam.
Key properties¶
- Large-gradient directions get smaller updates.
- Helps when gradient scales differ across parameters.
Common gotchas¶
- Often sensitive to learning-rate tuning.
- Multiple variants exist across frameworks.
Example¶
Parameters with consistently large gradients receive smaller effective step sizes.
How to Compute (Pseudocode)¶
Input: gradients g_t, parameters theta, learning rate eta, decay rho, epsilon, steps T
Output: updated parameters theta
initialize accumulator s <- 0
for t from 1 to T:
s <- rho * s + (1 - rho) * (g_t * g_t) # elementwise square
theta <- theta - eta * g_t / (sqrt(s) + epsilon)
return theta
Complexity¶
- Time: \(O(Tp)\) elementwise optimizer-state updates once gradients are available (plus gradient computation cost)
- Space: \(O(p)\) additional space for the squared-gradient accumulator
- Assumptions: \(p\) parameters; elementwise operations shown for a dense parameter vector/tensor collection