Skip to content

TF-IDF

Formula

\[ \mathrm{tfidf}(t,d) = \mathrm{tf}(t,d)\cdot \mathrm{idf}(t) \]

Parameters

  • \(t\): term
  • \(d\): document
  • \(\mathrm{tf}(t,d)\): term frequency
  • \(\mathrm{idf}(t)\): inverse document frequency

What it means

Weights terms by frequency in a document and rarity across documents.

What it's used for

  • Text feature extraction.
  • Document ranking and similarity.

Key properties

  • Emphasizes terms that are frequent in a document but rare overall.
  • Sparse, high-dimensional representation.

Common gotchas

  • Needs consistent tokenization and normalization.
  • For short documents, TF-IDF can be noisy.

Example

If \(\mathrm{tf}=0.03\) and \(\mathrm{idf}=\log(100)\approx 4.605\), then \(\mathrm{tfidf}=0.138\).

How to Compute (Pseudocode)

Input: tokenized corpus/documents
Output: TF-IDF features

compute TF values per document
compute corpus-level IDF values
for each document-term pair used in the representation:
  tfidf(term, doc) <- tf(term, doc) * idf(term)
return TF-IDF vectors/matrix

Complexity

  • Time: Dominated by TF and IDF computation plus sparse feature assembly (typically linear in corpus token count for sparse pipelines)
  • Space: Depends on vocabulary size and sparsity of the document-term matrix (sparse storage is standard)
  • Assumptions: Tokenized corpus and sparse vectorization workflow; exact costs depend on preprocessing and vocabulary filtering