Skip to content

Attention

Formula

\[ \mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]

Parameters

  • \(Q\): queries
  • \(K\): keys
  • \(V\): values
  • \(d_k\): key dimension (scaling term)

What it means

Attention computes weighted combinations of value vectors, where weights are determined by similarity between queries and keys.

What it's used for

  • Letting models dynamically focus on relevant tokens/features.
  • Sequence modeling in transformers and multimodal models.

Key properties

  • Weights are data-dependent (not fixed convolution kernels).
  • Softmax-normalized weights make each output a convex combination of values (row-wise).

Common gotchas

  • Attention weights are not always faithful explanations.
  • Omitting scaling by \(\sqrt{d_k}\) can destabilize training for large dimensions.

Example

If a query strongly matches the key for token 3, the output becomes a weighted average dominated by value 3.

How to Compute (Pseudocode)

Input: queries Q, keys K, values V
Output: attention outputs

scores <- (Q K^T) / sqrt(d_k)
weights <- softmax(scores)   # row-wise over keys
output <- weights V
return output

Complexity

  • Time: \(O(L_q L_k d_k + L_q L_k d_v)\) for dense attention (often summarized as quadratic in sequence length when \(L_q \approx L_k\))
  • Space: \(O(L_q L_k)\) for attention score/weight matrices, plus input/output tensors
  • Assumptions: Dense scaled dot-product attention without sparsity/flash-style kernels; batch and head dimensions are omitted for readability

See also