Categorical Encoding¶

Formula¶

\[ \text{one-hot}(c_i) \in \{0,1\}^{K},\quad \sum_{j=1}^K \text{one-hot}(c_i)_j = 1 \]

Parameters¶

\(c_i\): category value
\(K\): number of categories

What it means¶

Converts categorical variables into numeric representations that models can use.

What it's used for¶

One-hot encoding for nominal categories.
Ordinal encoding for ordered categories.
Target/statistical encodings (with leakage controls).

Key properties¶

Encoding choice affects model behavior and dimensionality.
High-cardinality categories often need grouping, hashing, or regularized target encoding.

Common gotchas¶

Target encoding must be computed out-of-fold to avoid leakage.
Ordinal encoding is dangerous for unordered categories.

Example¶

For a color feature {red, green, blue}, one-hot creates three binary columns.

How to Compute (Pseudocode)¶

Input: categorical feature values, encoding scheme
Output: encoded numeric representation

fit encoder on training data categories/statistics
if one-hot encoding:
  build category-to-column mapping
  emit a binary indicator vector per example
if ordinal encoding:
  map each category to an integer code
if target/statistical encoding:
  compute training-only category statistics (preferably out-of-fold)

apply the fitted encoder to validation/test data

Complexity¶

Time: Depends on the encoding scheme; one-hot/ordinal encoders are typically \(O(n)\) after building mappings, while target encodings add aggregation/CV overhead
Space: Depends on output dimensionality (for one-hot, often \(O(nK)\) dense or sparse equivalent for \(K\) categories)
Assumptions: \(n\) examples; complexity varies substantially with cardinality, sparse storage, and leakage-control strategy