Embedding¶

Formula¶

\[ e_i = E[i] \in \mathbb{R}^d \]

An embedding maps a discrete ID (token, item, node) to a learned dense vector.

A vocabulary of size \(50{,}000\) with dimension \(768\) uses an embedding matrix \(E\in\mathbb{R}^{50000\times768}\).

Input: index sequence i[1..L], embedding matrix E
Output: embedding vectors e[1..L]

for each position t:
  e[t] <- E[i[t]]   # row lookup
return e

Time: \(O(Ld)\) to read/write \(L\) embedding vectors of dimension \(d\) (lookup itself is O(1) per row plus vector copy)
Space: \(O(Vd)\) for the embedding matrix and \(O(Ld)\) for looked-up embeddings
Assumptions: Vocabulary size \(V\); sparse updates/optimizer state can dominate training memory for large embeddings