Tokenization¶
Formula¶
\[
\text{text} \rightarrow (t_1,t_2,\dots,t_n)
\]
Parameters¶
- Raw text input
- Tokenizer rules/model
- Output tokens or token IDs
What it means¶
Tokenization splits text into units (tokens) that a model can process.
What it's used for¶
- NLP preprocessing and indexing.
- Preparing inputs for language models.
Key properties¶
- Token granularity can be word, subword, character, or byte-level.
- Tokenization choice affects vocabulary size and sequence length.
Common gotchas¶
- Different tokenizers produce different token counts and IDs.
- Whitespace/punctuation/Unicode normalization choices matter.
Example¶
"unhappiness" might be one word token, multiple subword tokens, or many character tokens.
How to Compute (Pseudocode)¶
Input: raw text string and tokenizer rules/model
Output: token sequence (and optionally token IDs)
normalize text according to tokenizer settings (if applicable)
apply tokenizer rules/model to split text into tokens/subwords/bytes
map tokens to IDs using the tokenizer vocabulary
return tokens/IDs
Complexity¶
- Time: Depends on tokenizer type and implementation; typically linear in input text length for common tokenizers
- Space: Linear in the number of produced tokens plus tokenizer vocabulary/model storage
- Assumptions: Exact complexity depends on tokenization scheme (word, subword, byte-level) and preprocessing rules