Skip to content

Train/Validation/Test Split

Formula

\[ \mathcal{D}=\mathcal{D}_{train}\ \cup\ \mathcal{D}_{val}\ \cup\ \mathcal{D}_{test} \]
\[ \mathcal{D}_{train}\cap\mathcal{D}_{val}=\mathcal{D}_{train}\cap\mathcal{D}_{test}=\mathcal{D}_{val}\cap\mathcal{D}_{test}=\varnothing \]

Parameters

  • \(\mathcal{D}\): full dataset
  • \(\mathcal{D}_{train},\mathcal{D}_{val},\mathcal{D}_{test}\): disjoint splits

What it means

Separates data for fitting, model selection, and final unbiased evaluation.

What it's used for

  • Training on one subset and tuning on another.
  • Holding out a final test set for one-time reporting.

Key properties

  • Test set should be touched only after model decisions are finalized.
  • Stratified splits are common for classification.

Common gotchas

  • Random splits can leak future information in time series.
  • Repeated peeking at the test set turns it into a validation set.

Example

Use 70/15/15 or 80/10/10 splits, adjusted for dataset size and class balance.

How to Compute (Pseudocode)

Input: dataset D, split ratios, random seed (or time-based rule)
Output: D_train, D_val, D_test

choose a split strategy (random, stratified, grouped, or time-based)
partition D into disjoint subsets according to the strategy and ratios
verify no overlap between train/val/test
reserve D_test for final evaluation only
return D_train, D_val, D_test

Complexity

  • Time: Typically \(O(n)\) to assign \(n\) examples to splits (plus sorting/grouping costs for time- or group-aware strategies)
  • Space: \(O(n)\) to store split assignments or index lists
  • Assumptions: \(n\) is dataset size; exact cost depends on stratification/grouping constraints and implementation

See also