Gradient Descent¶

Formula¶

\[ x_{k+1} = x_k - \eta_k\,\nabla f(x_k) \]

Parameters¶

\(\eta_k\): step size (learning rate)
\(\nabla f\): gradient

What it means¶

Iteratively moves downhill along the steepest descent direction.

What it's used for¶

Minimizing differentiable objectives.
Training ML models with large datasets.

Key properties¶

Converges for convex, smooth functions with suitable step sizes
Step size can be fixed or found by line search

Common gotchas¶

Too large \(\eta_k\) causes divergence; too small slows convergence.
Can stall at saddle points in nonconvex problems.

Example¶

For \(f(x)=x^2\), \(x_0=1\), \(\alpha=0.1\): \(x_1=1-0.1\cdot2\cdot1=0.8\).

How to Compute (Pseudocode)¶

Input: objective f, gradient function grad_f, initial x0, step sizes eta_k, iterations T
Output: approximate minimizer x

x <- x0
for k from 1 to T:
  g <- grad_f(x)
  x <- x - eta_k * g
return x

Complexity¶

Time: \(O(T \cdot \mathrm{GradCost})\), where \(\mathrm{GradCost}\) is the cost of one full gradient evaluation
Space: Depends on problem dimension; typically \(O(p)\) for \(p\) parameters plus gradient storage
Assumptions: First-order iterative method; exact cost depends on gradient computation and step-size strategy

Gradient Descent¶

Formula¶

Parameters¶

What it means¶

What it's used for¶

Key properties¶

Common gotchas¶

Example¶

How to Compute (Pseudocode)¶

Complexity¶

See also¶