Backpropagation¶
Formula¶
\[
\frac{\partial \mathcal{L}}{\partial x}
=
\frac{\partial \mathcal{L}}{\partial y}
\frac{\partial y}{\partial x}
\]
Parameters¶
- \(\mathcal{L}\): loss function
- \(x,y\): intermediate variables in a computation graph
- \(\partial \mathcal{L}/\partial(\cdot)\): gradients
What it means¶
Backpropagation computes gradients efficiently by applying the chain rule backward through the computation graph.
What it's used for¶
- Training neural networks with gradient-based optimizers.
- Automatic differentiation in deep learning frameworks.
Key properties¶
- Reuses intermediate computations from the forward pass.
- Time complexity is typically a small constant factor of the forward pass.
Common gotchas¶
- In-place ops and detached tensors can break gradient flow.
- Numerical instability can cause exploding/vanishing gradients.
Example¶
For \(y=wx\) and loss \(\mathcal{L}(y)\), backprop computes \(\partial \mathcal{L}/\partial w=(\partial \mathcal{L}/\partial y)\,x\).
How to Compute (Pseudocode)¶
Input: computation graph, loss L
Output: gradients for parameters and intermediates
run forward pass and cache needed intermediates
initialize upstream gradient dL/dL <- 1
traverse graph in reverse topological order:
apply each node's local derivative rule
accumulate gradients for parent nodes/parameters
return parameter gradients
Complexity¶
- Time: Typically on the same order as the forward pass (often a small constant-factor multiple), plus gradient accumulation overhead
- Space: Depends on stored activations/intermediates for the backward pass (activation memory often dominates)
- Assumptions: Reverse-mode autodiff for scalar loss; checkpointing/recomputation strategies trade time for memory