Learning Rate Schedule¶
Formula¶
\[
\eta_t = s(t)
\]
Plot¶
fn: exp(-0.6*x)
xmin: 0
xmax: 8
ymin: 0
ymax: 1.05
height: 280
title: Example decay schedule (normalized)
Parameters¶
- \(\eta_t\): learning rate at step/epoch \(t\)
- \(s(t)\): schedule function (step, cosine, exponential, etc.)
What it means¶
A learning-rate schedule changes the step size over training instead of keeping it constant.
What it's used for¶
- Faster early training and better final convergence.
- Stabilizing large-model optimization.
Key properties¶
- Common schedules: step decay, cosine decay, exponential decay.
- Often combined with warmup.
Common gotchas¶
- Scheduler step timing (per step vs per epoch) matters.
- Misconfigured schedules can decay too fast and stall training.
Example¶
Cosine decay starts high and gradually reduces the learning rate toward a small final value.
How to Compute (Pseudocode)¶
Input: current step/epoch t and a schedule definition s(t)
Output: learning rate eta_t
eta_t <- s(t)
return eta_t
Complexity¶
- Time: \(O(1)\) per step for common closed-form schedules (step, cosine, exponential, linear)
- Space: \(O(1)\)
- Assumptions: Schedule value computation only; total training cost is dominated by optimizer/model updates over many steps