Skip to content

Backpropagation

Formula

\[ \frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \frac{\partial y}{\partial x} \]

Parameters

  • \(\mathcal{L}\): loss function
  • \(x,y\): intermediate variables in a computation graph
  • \(\partial \mathcal{L}/\partial(\cdot)\): gradients

What it means

Backpropagation computes gradients efficiently by applying the chain rule backward through the computation graph.

What it's used for

  • Training neural networks with gradient-based optimizers.
  • Automatic differentiation in deep learning frameworks.

Key properties

  • Reuses intermediate computations from the forward pass.
  • Time complexity is typically a small constant factor of the forward pass.

Common gotchas

  • In-place ops and detached tensors can break gradient flow.
  • Numerical instability can cause exploding/vanishing gradients.

Example

For \(y=wx\) and loss \(\mathcal{L}(y)\), backprop computes \(\partial \mathcal{L}/\partial w=(\partial \mathcal{L}/\partial y)\,x\).

How to Compute (Pseudocode)

Input: computation graph, loss L
Output: gradients for parameters and intermediates

run forward pass and cache needed intermediates
initialize upstream gradient dL/dL <- 1
traverse graph in reverse topological order:
  apply each node's local derivative rule
  accumulate gradients for parent nodes/parameters
return parameter gradients

Complexity

  • Time: Typically on the same order as the forward pass (often a small constant-factor multiple), plus gradient accumulation overhead
  • Space: Depends on stored activations/intermediates for the backward pass (activation memory often dominates)
  • Assumptions: Reverse-mode autodiff for scalar loss; checkpointing/recomputation strategies trade time for memory

See also