Skip to content

Transformer

Formula

\[ \mathrm{TransformerBlock}(X)=\mathrm{FFN}\big(\mathrm{AttnBlock}(X)\big) \]

Parameters

  • \(X\): sequence representations
  • \(\mathrm{AttnBlock}\): (self-/cross-)attention + residual + normalization
  • \(\mathrm{FFN}\): position-wise feedforward network + residual + normalization

What it means

A Transformer is a sequence model built from attention and feedforward blocks instead of recurrence or convolutions.

What it's used for

  • Language models and translation.
  • Vision/audio/multimodal sequence modeling.

Key properties

  • Highly parallelizable over tokens during training.
  • Captures long-range interactions via attention.

Common gotchas

  • Needs positional information because attention alone is permutation-equivariant.
  • Memory cost grows quickly with sequence length for standard attention.

Example

A decoder-only Transformer stacks masked self-attention + MLP blocks to predict the next token.

How to Compute (Pseudocode)

Input: token embeddings + positions, stack of Transformer blocks
Output: contextualized sequence representations (or logits via output head)

X <- input embeddings with positional information
for each Transformer block:
  X <- attention sublayer + residual + normalization
  X <- FFN sublayer + residual + normalization
apply task/output head if needed
return outputs

Complexity

  • Time: Dominated by attention and FFN costs across layers (standard dense self-attention gives quadratic dependence on sequence length per layer)
  • Space: Dominated by activations and attention matrices during training (quadratic in sequence length for standard dense attention)
  • Assumptions: Standard dense Transformer blocks; exact cost depends on layer count, hidden width, head count, sequence length, and implementation kernels

See also