Embedding¶
Formula¶
\[
e_i = E[i] \in \mathbb{R}^d
\]
Parameters¶
- \(E\in\mathbb{R}^{V\times d}\): embedding matrix
- \(i\): token/item index
- \(e_i\): dense vector representation
What it means¶
An embedding maps a discrete ID (token, item, node) to a learned dense vector.
What it's used for¶
- Token representations in language models.
- Categorical feature representations in ML systems.
Key properties¶
- Similar entities can learn nearby vectors.
- Usually trained jointly with the task model.
Common gotchas¶
- OOV/unknown handling depends on tokenization scheme.
- Vocabulary size heavily affects memory cost.
Example¶
A vocabulary of size \(50{,}000\) with dimension \(768\) uses an embedding matrix \(E\in\mathbb{R}^{50000\times768}\).
How to Compute (Pseudocode)¶
Input: index sequence i[1..L], embedding matrix E
Output: embedding vectors e[1..L]
for each position t:
e[t] <- E[i[t]] # row lookup
return e
Complexity¶
- Time: \(O(Ld)\) to read/write \(L\) embedding vectors of dimension \(d\) (lookup itself is O(1) per row plus vector copy)
- Space: \(O(Vd)\) for the embedding matrix and \(O(Ld)\) for looked-up embeddings
- Assumptions: Vocabulary size \(V\); sparse updates/optimizer state can dominate training memory for large embeddings