Transformer¶

Formula¶

\[ \mathrm{TransformerBlock}(X)=\mathrm{FFN}\big(\mathrm{AttnBlock}(X)\big) \]

Parameters¶

\(X\): sequence representations
\(\mathrm{AttnBlock}\): (self-/cross-)attention + residual + normalization
\(\mathrm{FFN}\): position-wise feedforward network + residual + normalization

What it means¶

A Transformer is a sequence model built from attention and feedforward blocks instead of recurrence or convolutions.

What it's used for¶

Language models and translation.
Vision/audio/multimodal sequence modeling.

Key properties¶

Highly parallelizable over tokens during training.
Captures long-range interactions via attention.

Common gotchas¶

Needs positional information because attention alone is permutation-equivariant.
Memory cost grows quickly with sequence length for standard attention.

Example¶

A decoder-only Transformer stacks masked self-attention + MLP blocks to predict the next token.

How to Compute (Pseudocode)¶

Input: token embeddings + positions, stack of Transformer blocks
Output: contextualized sequence representations (or logits via output head)

X <- input embeddings with positional information
for each Transformer block:
  X <- attention sublayer + residual + normalization
  X <- FFN sublayer + residual + normalization
apply task/output head if needed
return outputs

Complexity¶

Time: Dominated by attention and FFN costs across layers (standard dense self-attention gives quadratic dependence on sequence length per layer)
Space: Dominated by activations and attention matrices during training (quadratic in sequence length for standard dense attention)
Assumptions: Standard dense Transformer blocks; exact cost depends on layer count, hidden width, head count, sequence length, and implementation kernels