Cross-Validation¶
Formula¶
\[
\mathrm{CVScore}=\frac{1}{K}\sum_{k=1}^{K} M\big(f^{(-k)}, \mathcal{D}^{(k)}\big)
\]
Parameters¶
- \(K\): number of folds
- \(f^{(-k)}\): model trained on all folds except \(k\)
- \(\mathcal{D}^{(k)}\): validation fold
- \(M\): metric
What it means¶
Estimates out-of-sample performance by repeatedly training on subsets and validating on held-out folds.
What it's used for¶
- Hyperparameter tuning.
- Comparing models when data is limited.
Key properties¶
- Reduces variance of a single split estimate.
- Stratified K-fold is common for class imbalance.
Common gotchas¶
- Preprocessing must be fit inside each fold to avoid leakage.
- CV score can still be optimistic if tuning choices are repeatedly adapted.
Example¶
Use 5-fold CV to select regularization strength, then refit on train+val before final test evaluation.
How to Compute (Pseudocode)¶
Input: dataset D, model/hyperparameters, metric M, folds K
Output: cross-validation score
split D into K folds (often stratified)
scores <- empty list
for k from 1 to K:
train_data <- all folds except fold k
val_data <- fold k
fit preprocessing + model on train_data
score_k <- evaluate metric M on val_data
append score_k to scores
return average(scores)
Complexity¶
- Time: Approximately \(K\) times the cost of fitting/evaluating the model pipeline on one train/validation split
- Space: Depends on the model/pipeline and data representation; typically includes one fold split plus model state
- Assumptions: Preprocessing is refit inside each fold; hyperparameter search multiplies this cost further by the number of candidate settings