Inverse Document Frequency (IDF)¶
Formula¶
\[
\mathrm{idf}(t) = \log\frac{N}{\mathrm{df}(t)}
\]
Parameters¶
- \(N\): total number of documents
- \(\mathrm{df}(t)\): number of documents containing term \(t\)
What it means¶
Downweights terms that appear in many documents.
What it's used for¶
- Weighting terms by rarity.
- Combining with TF for TF-IDF.
Key properties¶
- Larger for rare terms.
- Zero if \(\mathrm{df}(t)=N\).
Common gotchas¶
- If \(\mathrm{df}(t)=0\), use smoothing: \(\log\frac{N+1}{\mathrm{df}(t)+1}+1\).
- Log base changes scale but not ranking.
Example¶
If \(N=1000\) and \(\mathrm{df}(t)=10\), then \(\mathrm{idf}=\log(100)\).
How to Compute (Pseudocode)¶
Input: corpus of N documents
Output: IDF values for terms
initialize document-frequency counts df(term) <- 0
for each document:
collect unique terms in the document
increment df(term) once per term present
for each term:
idf(term) <- log(N / df(term)) # or a smoothed variant
return IDF values
Complexity¶
- Time: Linear in the total number of tokens (or unique-term occurrences) across the corpus, plus per-term IDF computation
- Space: Depends on vocabulary size \(V\) (typically \(O(V)\) for document-frequency tables)
- Assumptions: Tokenized corpus; smoothing and log-base choices affect values but not the overall asymptotic pattern