Stochastic Gradient Descent¶
Formula¶
\[
x_{k+1} = x_k - \eta_k\,g_k,\quad \mathbb{E}[g_k\mid x_k]=\nabla f(x_k)
\]
Parameters¶
- \(g_k\): stochastic gradient estimate
- \(\eta_k\): step size
What it means¶
Uses noisy gradient estimates to optimize large-scale objectives.
What it's used for¶
- Large-scale learning with noisy gradients.
- Online or streaming optimization.
Key properties¶
- Cheap per-iteration updates
- Converges in expectation under step-size schedules
Common gotchas¶
- Variance can slow convergence; use momentum/averaging.
- Requires careful tuning of \(\eta_k\).
Example¶
If \(g=2\), \(x=1\), \(\alpha=0.1\), then \(x_{\text{new}}=1-0.1\cdot2=0.8\).
How to Compute (Pseudocode)¶
Input: data D, stochastic gradient estimator g(.), initial parameters theta, learning-rate schedule eta_t, steps T
Output: trained parameters theta
theta <- theta0
for t from 1 to T:
sample a mini-batch B_t from D
g_t <- stochastic_gradient(theta, B_t)
theta <- theta - eta_t * g_t
return theta
Complexity¶
- Time: \(O(T \cdot \mathrm{BatchGradCost})\); for dense models this is often proportional to batch size \(b\) and parameter count \(p\)
- Space: Typically \(O(p)\) model/gradient state plus mini-batch storage
- Assumptions: \(T\) update steps, mini-batch SGD workflow; data pipeline and hardware parallelism strongly affect runtime