Positional Encoding¶

Formula¶

\[ \mathrm{PE}(pos,2i)=\sin\!\left(\frac{pos}{10000^{2i/d}}\right),\quad \mathrm{PE}(pos,2i+1)=\cos\!\left(\frac{pos}{10000^{2i/d}}\right) \]

\[ X_{\text{in}} = E + \mathrm{PE} \]

Parameters¶

\(pos\): token position
\(i\): feature index
\(d\): model dimension
\(E\): token embeddings

What it means¶

Adds position information to token embeddings so attention can use order.

What it's used for¶

Transformer inputs in text and other sequences.
Absolute or relative position representations.

Key properties¶

Can be fixed (sinusoidal) or learned.
Relative schemes often improve long-context behavior.

Common gotchas¶

Position indexing and masking must align.
Extrapolation behavior depends on positional encoding type.

Example¶

Two identical tokens at different positions get different input vectors after adding positional encodings.

How to Compute (Pseudocode)¶

Input: token embeddings E[positions], positional-encoding scheme
Output: position-aware inputs X_in

compute positional vectors PE for each position (fixed sinusoidal or learned lookup)
X_in <- E + PE
return X_in

Complexity¶

Time: Typically \(O(Ld)\) to generate/lookup and add positional encodings for sequence length \(L\) and model dimension \(d\)
Space: \(O(Ld)\) for position encodings (or \(O(d)\) if generated on the fly per position) plus outputs
Assumptions: Absolute positional encodings shown; relative position schemes alter computation and memory patterns in attention layers