Residual Connection (Skip Connection)¶

Formula¶

\[ y = x + F(x) \]

A residual connection adds the input directly to a sublayer output, making it easier to preserve and refine information.

Dimensions must match (or require a projection).
Large residual magnitudes can destabilize training without normalization/scaling.

Transformer sublayers often use \(x + \mathrm{Attention}(x)\) and \(x + \mathrm{MLP}(x)\).

Input: tensor x and sublayer F
Output: residual output y

u <- F(x)
y <- x + u
return y

Time: \(O(m)\) for the elementwise add over \(m\) values, plus the cost of computing \(F(x)\)
Space: \(O(m)\) for the output tensor (and any sublayer intermediates)
Assumptions: Shapes match for the residual add (or a projection is used before addition)