Positional Encoding¶
Formula¶
\[
\mathrm{PE}(pos,2i)=\sin\!\left(\frac{pos}{10000^{2i/d}}\right),\quad
\mathrm{PE}(pos,2i+1)=\cos\!\left(\frac{pos}{10000^{2i/d}}\right)
\]
\[
X_{\text{in}} = E + \mathrm{PE}
\]
Parameters¶
- \(pos\): token position
- \(i\): feature index
- \(d\): model dimension
- \(E\): token embeddings
What it means¶
Adds position information to token embeddings so attention can use order.
What it's used for¶
- Transformer inputs in text and other sequences.
- Absolute or relative position representations.
Key properties¶
- Can be fixed (sinusoidal) or learned.
- Relative schemes often improve long-context behavior.
Common gotchas¶
- Position indexing and masking must align.
- Extrapolation behavior depends on positional encoding type.
Example¶
Two identical tokens at different positions get different input vectors after adding positional encodings.
How to Compute (Pseudocode)¶
Input: token embeddings E[positions], positional-encoding scheme
Output: position-aware inputs X_in
compute positional vectors PE for each position (fixed sinusoidal or learned lookup)
X_in <- E + PE
return X_in
Complexity¶
- Time: Typically \(O(Ld)\) to generate/lookup and add positional encodings for sequence length \(L\) and model dimension \(d\)
- Space: \(O(Ld)\) for position encodings (or \(O(d)\) if generated on the fly per position) plus outputs
- Assumptions: Absolute positional encodings shown; relative position schemes alter computation and memory patterns in attention layers