Learning Rate Warmup¶

Formula¶

\[ \eta_t = \eta_{\max}\frac{t}{T_{\text{warmup}}} \quad \text{for } t \le T_{\text{warmup}} \]

Plot¶

fn: x
xmin: 0
xmax: 1
ymin: 0
ymax: 1.05
height: 280
title: Linear warmup phase (normalized)

Parameters¶

\(\eta_t\): learning rate at step \(t\)
\(\eta_{\max}\): target learning rate after warmup
\(T_{\text{warmup}}\): number of warmup steps

What it means¶

Warmup starts training with a small learning rate and ramps it up over the first steps.

What it's used for¶

Stabilizing early optimization in Transformers and large models.
Preventing large unstable updates before statistics/activations settle.

Key properties¶

Usually only applied at the beginning of training.
Commonly followed by cosine or inverse-sqrt decay.

Common gotchas¶

Too-short warmup can still explode; too-long warmup wastes training steps.
Warmup should match the schedule step granularity.

Example¶

Train with linear warmup for 1,000 steps, then cosine decay for the remaining steps.

How to Compute (Pseudocode)¶

Input: step t, warmup steps T_warmup, target lr eta_max, post-warmup schedule
Output: learning rate eta_t

if t <= T_warmup:
  eta_t <- eta_max * t / T_warmup
else:
  eta_t <- post_warmup_schedule(t)
return eta_t

Complexity¶

Time: \(O(1)\) per step to compute the warmup-adjusted learning rate
Space: \(O(1)\)
Assumptions: Schedule evaluation only; warmup is typically composed with a longer decay schedule and does not dominate training cost