Skip to content

Feedforward Network (Transformer FFN)

Formula

\[ \mathrm{FFN}(x)=W_2\,\phi(W_1x+b_1)+b_2 \]

Parameters

  • \(x\): token representation
  • \(W_1,W_2,b_1,b_2\): FFN parameters
  • \(\phi\): activation (often GELU/ReLU/SwiGLU variants)

What it means

A position-wise MLP applied independently to each token after attention in a Transformer block.

What it's used for

  • Nonlinear feature transformation after attention.
  • Increasing model capacity with an expansion-projection pattern.

Key properties

  • Same weights reused across sequence positions.
  • Often uses hidden width larger than model dimension.

Common gotchas

  • It is not a sequence-mixing step (attention does that).
  • Activation choice affects model quality and speed.

Example

A Transformer block may map \(d=768\) to \(3072\) with GELU, then back to \(768\).

How to Compute (Pseudocode)

Input: token representations X
Output: transformed token representations

for each token vector x in X:
  h <- activation(W1 x + b1)
  y <- W2 h + b2
  emit y
return all token outputs

Complexity

  • Time: \(O(B L d d_{ff})\) for batch size \(B\), sequence length \(L\), model width \(d\), and FFN hidden width \(d_{ff}\) (dense implementation)
  • Space: \(O(B L d_{ff})\) for hidden activations plus output/storage tensors (training caches add more)
  • Assumptions: Position-wise dense FFN with shared weights across tokens; activation cost is lower-order relative to matrix multiplies

See also