Skip to content

Term Frequency (TF)

Formula

\[ \mathrm{tf}(t,d) = \frac{\mathrm{count}(t,d)}{\sum_{t'} \mathrm{count}(t',d)} \]

Parameters

  • \(t\): term
  • \(d\): document
  • \(\mathrm{count}(t,d)\): occurrences of \(t\) in \(d\)

What it means

Relative frequency of a term within a document.

What it's used for

  • Bag-of-words feature construction.
  • Input to TF-IDF weighting.

Key properties

  • Normalized by document length.
  • Values in \([0,1]\).

Common gotchas

  • Raw counts can overweight long documents.
  • Stopwords dominate without filtering.

Example

If \(t\) appears 3 times in a 100-token document, \(\mathrm{tf}=0.03\).

How to Compute (Pseudocode)

Input: tokenized document d and target term t (or all terms)
Output: TF values

count occurrences of each term in d
compute document length (or total term count)
for each term:
  tf(term, d) <- count(term, d) / document_length
return TF values

Complexity

  • Time: \(O(L)\) for a document of length \(L\) to count terms and compute normalized frequencies
  • Space: Depends on the number of unique terms in the document (up to \(O(L)\))
  • Assumptions: Tokenized input document; sparse maps/dictionaries used for term counts