Subword Tokenization (BPE / WordPiece)¶
Formula¶
\[
\text{word} \rightarrow \text{subword pieces}
\]
Parameters¶
- Vocabulary of subword units
- Merge rules / tokenization model
What it means¶
Subword tokenization represents text using pieces smaller than words, reducing unknown-word problems.
What it's used for¶
- Modern language models and translation systems.
- Handling rare words and morphology efficiently.
Key properties¶
- Balances vocabulary size vs sequence length.
- Frequently occurring patterns become reusable tokens.
Common gotchas¶
- Token boundaries differ across tokenizer families.
- Retokenizing with a different vocabulary can break model compatibility.
Example¶
"playing" may tokenize as play + ing in a subword scheme.
How to Compute (Pseudocode)¶
Input: text and a trained subword tokenizer (BPE/WordPiece-like)
Output: subword token sequence
pre-tokenize text if required by the tokenizer
for each text span/word:
iteratively apply subword merges or longest-match lookup rules
emit resulting subword pieces
map subword pieces to IDs
return subword IDs
Complexity¶
- Time: Depends on tokenizer implementation and text length; commonly near-linear in characters/tokens with trie/hash-based lookup and merge rules
- Space: Linear in output token count plus tokenizer vocabulary/merge table storage
- Assumptions: Trained tokenizer is already available; training a tokenizer is a separate workflow with different costs