Subword Tokenization (BPE / WordPiece)¶

Formula¶

\[ \text{word} \rightarrow \text{subword pieces} \]

Parameters¶

Vocabulary of subword units
Merge rules / tokenization model

What it means¶

Subword tokenization represents text using pieces smaller than words, reducing unknown-word problems.

What it's used for¶

Modern language models and translation systems.
Handling rare words and morphology efficiently.

Key properties¶

Balances vocabulary size vs sequence length.
Frequently occurring patterns become reusable tokens.

Common gotchas¶

Token boundaries differ across tokenizer families.
Retokenizing with a different vocabulary can break model compatibility.

Example¶

"playing" may tokenize as play + ing in a subword scheme.

How to Compute (Pseudocode)¶

Input: text and a trained subword tokenizer (BPE/WordPiece-like)
Output: subword token sequence

pre-tokenize text if required by the tokenizer
for each text span/word:
  iteratively apply subword merges or longest-match lookup rules
  emit resulting subword pieces
map subword pieces to IDs
return subword IDs

Complexity¶

Time: Depends on tokenizer implementation and text length; commonly near-linear in characters/tokens with trie/hash-based lookup and merge rules
Space: Linear in output token count plus tokenizer vocabulary/merge table storage
Assumptions: Trained tokenizer is already available; training a tokenizer is a separate workflow with different costs