Attention¶
Formula¶
\[
\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
\]
Parameters¶
- \(Q\): queries
- \(K\): keys
- \(V\): values
- \(d_k\): key dimension (scaling term)
What it means¶
Attention computes weighted combinations of value vectors, where weights are determined by similarity between queries and keys.
What it's used for¶
- Letting models dynamically focus on relevant tokens/features.
- Sequence modeling in transformers and multimodal models.
Key properties¶
- Weights are data-dependent (not fixed convolution kernels).
- Softmax-normalized weights make each output a convex combination of values (row-wise).
Common gotchas¶
- Attention weights are not always faithful explanations.
- Omitting scaling by \(\sqrt{d_k}\) can destabilize training for large dimensions.
Example¶
If a query strongly matches the key for token 3, the output becomes a weighted average dominated by value 3.
How to Compute (Pseudocode)¶
Input: queries Q, keys K, values V
Output: attention outputs
scores <- (Q K^T) / sqrt(d_k)
weights <- softmax(scores) # row-wise over keys
output <- weights V
return output
Complexity¶
- Time: \(O(L_q L_k d_k + L_q L_k d_v)\) for dense attention (often summarized as quadratic in sequence length when \(L_q \approx L_k\))
- Space: \(O(L_q L_k)\) for attention score/weight matrices, plus input/output tensors
- Assumptions: Dense scaled dot-product attention without sparsity/flash-style kernels; batch and head dimensions are omitted for readability