Train/Validation/Test Split¶
Formula¶
\[
\mathcal{D}=\mathcal{D}_{train}\ \cup\ \mathcal{D}_{val}\ \cup\ \mathcal{D}_{test}
\]
\[
\mathcal{D}_{train}\cap\mathcal{D}_{val}=\mathcal{D}_{train}\cap\mathcal{D}_{test}=\mathcal{D}_{val}\cap\mathcal{D}_{test}=\varnothing
\]
Parameters¶
- \(\mathcal{D}\): full dataset
- \(\mathcal{D}_{train},\mathcal{D}_{val},\mathcal{D}_{test}\): disjoint splits
What it means¶
Separates data for fitting, model selection, and final unbiased evaluation.
What it's used for¶
- Training on one subset and tuning on another.
- Holding out a final test set for one-time reporting.
Key properties¶
- Test set should be touched only after model decisions are finalized.
- Stratified splits are common for classification.
Common gotchas¶
- Random splits can leak future information in time series.
- Repeated peeking at the test set turns it into a validation set.
Example¶
Use 70/15/15 or 80/10/10 splits, adjusted for dataset size and class balance.
How to Compute (Pseudocode)¶
Input: dataset D, split ratios, random seed (or time-based rule)
Output: D_train, D_val, D_test
choose a split strategy (random, stratified, grouped, or time-based)
partition D into disjoint subsets according to the strategy and ratios
verify no overlap between train/val/test
reserve D_test for final evaluation only
return D_train, D_val, D_test
Complexity¶
- Time: Typically \(O(n)\) to assign \(n\) examples to splits (plus sorting/grouping costs for time- or group-aware strategies)
- Space: \(O(n)\) to store split assignments or index lists
- Assumptions: \(n\) is dataset size; exact cost depends on stratification/grouping constraints and implementation