Residual Connection (Skip Connection)¶
Formula¶
\[
y = x + F(x)
\]
Parameters¶
- \(x\): input
- \(F(x)\): learned transformation (e.g., attention or MLP sublayer)
- \(y\): output with skip connection
What it means¶
A residual connection adds the input directly to a sublayer output, making it easier to preserve and refine information.
What it's used for¶
- Deep networks (ResNets, Transformers).
- Improving gradient flow in deep stacks.
Key properties¶
- Lets layers learn residual corrections rather than full mappings.
- Helps optimization of very deep models.
Common gotchas¶
- Dimensions must match (or require a projection).
- Large residual magnitudes can destabilize training without normalization/scaling.
Example¶
Transformer sublayers often use \(x + \mathrm{Attention}(x)\) and \(x + \mathrm{MLP}(x)\).
How to Compute (Pseudocode)¶
Input: tensor x and sublayer F
Output: residual output y
u <- F(x)
y <- x + u
return y
Complexity¶
- Time: \(O(m)\) for the elementwise add over \(m\) values, plus the cost of computing \(F(x)\)
- Space: \(O(m)\) for the output tensor (and any sublayer intermediates)
- Assumptions: Shapes match for the residual add (or a projection is used before addition)