Multi-Head Attention¶

Formula¶

\[ \mathrm{MHA}(Q,K,V)=\mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_h)W_O \]

\[ \mathrm{head}_i=\mathrm{Attention}(QW_i^Q,KW_i^K,VW_i^V) \]

Parameters¶

\(h\): number of heads
\(W_i^Q,W_i^K,W_i^V\): per-head projections
\(W_O\): output projection

What it means¶

Runs several attention mechanisms in parallel so the model can attend to different patterns/subspaces at once.

What it's used for¶

Core attention module in Transformers.
Improving representational flexibility vs single-head attention.

Key properties¶

Same input can produce multiple distinct attention maps.
Output dimension is typically restored by \(W_O\).

Common gotchas¶

Head dimension must match implementation constraints.
More heads does not always mean better performance.

Example¶

One head may focus on local syntax while another captures long-range dependencies.

How to Compute (Pseudocode)¶

Input: Q, K, V and number of heads h
Output: multi-head attention output

for each head i in 1..h:
  project Q, K, V into head i subspace
  compute head_i <- Attention(Q_i, K_i, V_i)
concat all head_i outputs
apply output projection W_O
return result

Complexity¶

Time: Same asymptotic order as attention with projection overhead, often summarized as \(O(h L^2 d_h) \approx O(L^2 d)\) for dense self-attention
Space: Includes per-head attention matrices (overall still quadratic in sequence length for dense attention)
Assumptions: Hidden dimension \(d = h d_h\); dense attention implementation without sparsity/linear-attention approximations