Attention¶

Formula¶

\[ \mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \]

Parameters¶

\(Q\): queries
\(K\): keys
\(V\): values
\(d_k\): key dimension (scaling term)

What it means¶

Attention computes weighted combinations of value vectors, where weights are determined by similarity between queries and keys.

What it's used for¶

Letting models dynamically focus on relevant tokens/features.
Sequence modeling in transformers and multimodal models.

Key properties¶

Weights are data-dependent (not fixed convolution kernels).
Softmax-normalized weights make each output a convex combination of values (row-wise).

Common gotchas¶

Attention weights are not always faithful explanations.
Omitting scaling by \(\sqrt{d_k}\) can destabilize training for large dimensions.

Example¶

If a query strongly matches the key for token 3, the output becomes a weighted average dominated by value 3.

How to Compute (Pseudocode)¶

Input: queries Q, keys K, values V
Output: attention outputs

scores <- (Q K^T) / sqrt(d_k)
weights <- softmax(scores)   # row-wise over keys
output <- weights V
return output

Complexity¶

Time: \(O(L_q L_k d_k + L_q L_k d_v)\) for dense attention (often summarized as quadratic in sequence length when \(L_q \approx L_k\))
Space: \(O(L_q L_k)\) for attention score/weight matrices, plus input/output tensors
Assumptions: Dense scaled dot-product attention without sparsity/flash-style kernels; batch and head dimensions are omitted for readability