Skip to content

Tokenization

Formula

\[ \text{text} \rightarrow (t_1,t_2,\dots,t_n) \]

Parameters

  • Raw text input
  • Tokenizer rules/model
  • Output tokens or token IDs

What it means

Tokenization splits text into units (tokens) that a model can process.

What it's used for

  • NLP preprocessing and indexing.
  • Preparing inputs for language models.

Key properties

  • Token granularity can be word, subword, character, or byte-level.
  • Tokenization choice affects vocabulary size and sequence length.

Common gotchas

  • Different tokenizers produce different token counts and IDs.
  • Whitespace/punctuation/Unicode normalization choices matter.

Example

"unhappiness" might be one word token, multiple subword tokens, or many character tokens.

How to Compute (Pseudocode)

Input: raw text string and tokenizer rules/model
Output: token sequence (and optionally token IDs)

normalize text according to tokenizer settings (if applicable)
apply tokenizer rules/model to split text into tokens/subwords/bytes
map tokens to IDs using the tokenizer vocabulary
return tokens/IDs

Complexity

  • Time: Depends on tokenizer type and implementation; typically linear in input text length for common tokenizers
  • Space: Linear in the number of produced tokens plus tokenizer vocabulary/model storage
  • Assumptions: Exact complexity depends on tokenization scheme (word, subword, byte-level) and preprocessing rules

See also