Skip to content

Inverse Document Frequency (IDF)

Formula

\[ \mathrm{idf}(t) = \log\frac{N}{\mathrm{df}(t)} \]

Parameters

  • \(N\): total number of documents
  • \(\mathrm{df}(t)\): number of documents containing term \(t\)

What it means

Downweights terms that appear in many documents.

What it's used for

  • Weighting terms by rarity.
  • Combining with TF for TF-IDF.

Key properties

  • Larger for rare terms.
  • Zero if \(\mathrm{df}(t)=N\).

Common gotchas

  • If \(\mathrm{df}(t)=0\), use smoothing: \(\log\frac{N+1}{\mathrm{df}(t)+1}+1\).
  • Log base changes scale but not ranking.

Example

If \(N=1000\) and \(\mathrm{df}(t)=10\), then \(\mathrm{idf}=\log(100)\).

How to Compute (Pseudocode)

Input: corpus of N documents
Output: IDF values for terms

initialize document-frequency counts df(term) <- 0
for each document:
  collect unique terms in the document
  increment df(term) once per term present
for each term:
  idf(term) <- log(N / df(term))   # or a smoothed variant
return IDF values

Complexity

  • Time: Linear in the total number of tokens (or unique-term occurrences) across the corpus, plus per-term IDF computation
  • Space: Depends on vocabulary size \(V\) (typically \(O(V)\) for document-frequency tables)
  • Assumptions: Tokenized corpus; smoothing and log-base choices affect values but not the overall asymptotic pattern