Skip to content

Residual Connection (Skip Connection)

Formula

\[ y = x + F(x) \]

Parameters

  • \(x\): input
  • \(F(x)\): learned transformation (e.g., attention or MLP sublayer)
  • \(y\): output with skip connection

What it means

A residual connection adds the input directly to a sublayer output, making it easier to preserve and refine information.

What it's used for

  • Deep networks (ResNets, Transformers).
  • Improving gradient flow in deep stacks.

Key properties

  • Lets layers learn residual corrections rather than full mappings.
  • Helps optimization of very deep models.

Common gotchas

  • Dimensions must match (or require a projection).
  • Large residual magnitudes can destabilize training without normalization/scaling.

Example

Transformer sublayers often use \(x + \mathrm{Attention}(x)\) and \(x + \mathrm{MLP}(x)\).

How to Compute (Pseudocode)

Input: tensor x and sublayer F
Output: residual output y

u <- F(x)
y <- x + u
return y

Complexity

  • Time: \(O(m)\) for the elementwise add over \(m\) values, plus the cost of computing \(F(x)\)
  • Space: \(O(m)\) for the output tensor (and any sublayer intermediates)
  • Assumptions: Shapes match for the residual add (or a projection is used before addition)

See also