Categorical Encoding¶
Formula¶
\[
\text{one-hot}(c_i) \in \{0,1\}^{K},\quad \sum_{j=1}^K \text{one-hot}(c_i)_j = 1
\]
Parameters¶
- \(c_i\): category value
- \(K\): number of categories
What it means¶
Converts categorical variables into numeric representations that models can use.
What it's used for¶
- One-hot encoding for nominal categories.
- Ordinal encoding for ordered categories.
- Target/statistical encodings (with leakage controls).
Key properties¶
- Encoding choice affects model behavior and dimensionality.
- High-cardinality categories often need grouping, hashing, or regularized target encoding.
Common gotchas¶
- Target encoding must be computed out-of-fold to avoid leakage.
- Ordinal encoding is dangerous for unordered categories.
Example¶
For a color feature {red, green, blue}, one-hot creates three binary columns.
How to Compute (Pseudocode)¶
Input: categorical feature values, encoding scheme
Output: encoded numeric representation
fit encoder on training data categories/statistics
if one-hot encoding:
build category-to-column mapping
emit a binary indicator vector per example
if ordinal encoding:
map each category to an integer code
if target/statistical encoding:
compute training-only category statistics (preferably out-of-fold)
apply the fitted encoder to validation/test data
Complexity¶
- Time: Depends on the encoding scheme; one-hot/ordinal encoders are typically \(O(n)\) after building mappings, while target encodings add aggregation/CV overhead
- Space: Depends on output dimensionality (for one-hot, often \(O(nK)\) dense or sparse equivalent for \(K\) categories)
- Assumptions: \(n\) examples; complexity varies substantially with cardinality, sparse storage, and leakage-control strategy