Learning Rate Warmup¶
Formula¶
\[
\eta_t = \eta_{\max}\frac{t}{T_{\text{warmup}}}
\quad \text{for } t \le T_{\text{warmup}}
\]
Plot¶
fn: x
xmin: 0
xmax: 1
ymin: 0
ymax: 1.05
height: 280
title: Linear warmup phase (normalized)
Parameters¶
- \(\eta_t\): learning rate at step \(t\)
- \(\eta_{\max}\): target learning rate after warmup
- \(T_{\text{warmup}}\): number of warmup steps
What it means¶
Warmup starts training with a small learning rate and ramps it up over the first steps.
What it's used for¶
- Stabilizing early optimization in Transformers and large models.
- Preventing large unstable updates before statistics/activations settle.
Key properties¶
- Usually only applied at the beginning of training.
- Commonly followed by cosine or inverse-sqrt decay.
Common gotchas¶
- Too-short warmup can still explode; too-long warmup wastes training steps.
- Warmup should match the schedule step granularity.
Example¶
Train with linear warmup for 1,000 steps, then cosine decay for the remaining steps.
How to Compute (Pseudocode)¶
Input: step t, warmup steps T_warmup, target lr eta_max, post-warmup schedule
Output: learning rate eta_t
if t <= T_warmup:
eta_t <- eta_max * t / T_warmup
else:
eta_t <- post_warmup_schedule(t)
return eta_t
Complexity¶
- Time: \(O(1)\) per step to compute the warmup-adjusted learning rate
- Space: \(O(1)\)
- Assumptions: Schedule evaluation only; warmup is typically composed with a longer decay schedule and does not dominate training cost