Skip to content

Data Science Field Guide

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF)¶

Formula¶

\[ \mathrm{idf}(t) = \log\frac{N}{\mathrm{df}(t)} \]

Parameters¶

\(N\): total number of documents
\(\mathrm{df}(t)\): number of documents containing term \(t\)

What it means¶

Downweights terms that appear in many documents.

What it's used for¶

Weighting terms by rarity.
Combining with TF for TF-IDF.

Key properties¶

Larger for rare terms.
Zero if \(\mathrm{df}(t)=N\).

Common gotchas¶

If \(\mathrm{df}(t)=0\), use smoothing: \(\log\frac{N+1}{\mathrm{df}(t)+1}+1\).
Log base changes scale but not ranking.

Example¶

If \(N=1000\) and \(\mathrm{df}(t)=10\), then \(\mathrm{idf}=\log(100)\).

How to Compute (Pseudocode)¶

Input: corpus of N documents
Output: IDF values for terms

initialize document-frequency counts df(term) <- 0
for each document:
  collect unique terms in the document
  increment df(term) once per term present
for each term:
  idf(term) <- log(N / df(term))   # or a smoothed variant
return IDF values

Complexity¶

Time: Linear in the total number of tokens (or unique-term occurrences) across the corpus, plus per-term IDF computation
Space: Depends on vocabulary size \(V\) (typically \(O(V)\) for document-frequency tables)
Assumptions: Tokenized corpus; smoothing and log-base choices affect values but not the overall asymptotic pattern