Stochastic Gradient Descent¶

Formula¶

\[ x_{k+1} = x_k - \eta_k\,g_k,\quad \mathbb{E}[g_k\mid x_k]=\nabla f(x_k) \]

Parameters¶

\(g_k\): stochastic gradient estimate
\(\eta_k\): step size

What it means¶

Uses noisy gradient estimates to optimize large-scale objectives.

What it's used for¶

Large-scale learning with noisy gradients.
Online or streaming optimization.

Key properties¶

Cheap per-iteration updates
Converges in expectation under step-size schedules

Common gotchas¶

Variance can slow convergence; use momentum/averaging.
Requires careful tuning of \(\eta_k\).

Example¶

If \(g=2\), \(x=1\), \(\alpha=0.1\), then \(x_{\text{new}}=1-0.1\cdot2=0.8\).

How to Compute (Pseudocode)¶

Input: data D, stochastic gradient estimator g(.), initial parameters theta, learning-rate schedule eta_t, steps T
Output: trained parameters theta

theta <- theta0
for t from 1 to T:
  sample a mini-batch B_t from D
  g_t <- stochastic_gradient(theta, B_t)
  theta <- theta - eta_t * g_t
return theta

Complexity¶

Time: \(O(T \cdot \mathrm{BatchGradCost})\); for dense models this is often proportional to batch size \(b\) and parameter count \(p\)
Space: Typically \(O(p)\) model/gradient state plus mini-batch storage
Assumptions: \(T\) update steps, mini-batch SGD workflow; data pipeline and hardware parallelism strongly affect runtime