Skip to content

Cross-Validation

Formula

\[ \mathrm{CVScore}=\frac{1}{K}\sum_{k=1}^{K} M\big(f^{(-k)}, \mathcal{D}^{(k)}\big) \]

Parameters

  • \(K\): number of folds
  • \(f^{(-k)}\): model trained on all folds except \(k\)
  • \(\mathcal{D}^{(k)}\): validation fold
  • \(M\): metric

What it means

Estimates out-of-sample performance by repeatedly training on subsets and validating on held-out folds.

What it's used for

  • Hyperparameter tuning.
  • Comparing models when data is limited.

Key properties

  • Reduces variance of a single split estimate.
  • Stratified K-fold is common for class imbalance.

Common gotchas

  • Preprocessing must be fit inside each fold to avoid leakage.
  • CV score can still be optimistic if tuning choices are repeatedly adapted.

Example

Use 5-fold CV to select regularization strength, then refit on train+val before final test evaluation.

How to Compute (Pseudocode)

Input: dataset D, model/hyperparameters, metric M, folds K
Output: cross-validation score

split D into K folds (often stratified)
scores <- empty list

for k from 1 to K:
  train_data <- all folds except fold k
  val_data <- fold k
  fit preprocessing + model on train_data
  score_k <- evaluate metric M on val_data
  append score_k to scores

return average(scores)

Complexity

  • Time: Approximately \(K\) times the cost of fitting/evaluating the model pipeline on one train/validation split
  • Space: Depends on the model/pipeline and data representation; typically includes one fold split plus model state
  • Assumptions: Preprocessing is refit inside each fold; hyperparameter search multiplies this cost further by the number of candidate settings

See also