Feedforward Network (Transformer FFN)¶
Formula¶
\[
\mathrm{FFN}(x)=W_2\,\phi(W_1x+b_1)+b_2
\]
Parameters¶
- \(x\): token representation
- \(W_1,W_2,b_1,b_2\): FFN parameters
- \(\phi\): activation (often GELU/ReLU/SwiGLU variants)
What it means¶
A position-wise MLP applied independently to each token after attention in a Transformer block.
What it's used for¶
- Nonlinear feature transformation after attention.
- Increasing model capacity with an expansion-projection pattern.
Key properties¶
- Same weights reused across sequence positions.
- Often uses hidden width larger than model dimension.
Common gotchas¶
- It is not a sequence-mixing step (attention does that).
- Activation choice affects model quality and speed.
Example¶
A Transformer block may map \(d=768\) to \(3072\) with GELU, then back to \(768\).
How to Compute (Pseudocode)¶
Input: token representations X
Output: transformed token representations
for each token vector x in X:
h <- activation(W1 x + b1)
y <- W2 h + b2
emit y
return all token outputs
Complexity¶
- Time: \(O(B L d d_{ff})\) for batch size \(B\), sequence length \(L\), model width \(d\), and FFN hidden width \(d_{ff}\) (dense implementation)
- Space: \(O(B L d_{ff})\) for hidden activations plus output/storage tensors (training caches add more)
- Assumptions: Position-wise dense FFN with shared weights across tokens; activation cost is lower-order relative to matrix multiplies