Skip to content

Gradient Descent

Formula

\[ x_{k+1} = x_k - \eta_k\,\nabla f(x_k) \]

Parameters

  • \(\eta_k\): step size (learning rate)
  • \(\nabla f\): gradient

What it means

Iteratively moves downhill along the steepest descent direction.

What it's used for

  • Minimizing differentiable objectives.
  • Training ML models with large datasets.

Key properties

  • Converges for convex, smooth functions with suitable step sizes
  • Step size can be fixed or found by line search

Common gotchas

  • Too large \(\eta_k\) causes divergence; too small slows convergence.
  • Can stall at saddle points in nonconvex problems.

Example

For \(f(x)=x^2\), \(x_0=1\), \(\alpha=0.1\): \(x_1=1-0.1\cdot2\cdot1=0.8\).

How to Compute (Pseudocode)

Input: objective f, gradient function grad_f, initial x0, step sizes eta_k, iterations T
Output: approximate minimizer x

x <- x0
for k from 1 to T:
  g <- grad_f(x)
  x <- x - eta_k * g
return x

Complexity

  • Time: \(O(T \cdot \mathrm{GradCost})\), where \(\mathrm{GradCost}\) is the cost of one full gradient evaluation
  • Space: Depends on problem dimension; typically \(O(p)\) for \(p\) parameters plus gradient storage
  • Assumptions: First-order iterative method; exact cost depends on gradient computation and step-size strategy

See also