Skip to content

Machine Learning Pipeline

Formula

\[ \hat y = (T_n\circ\cdots\circ T_1\circ f)(x) \]

Parameters

  • \(T_i\): preprocessing/transformation steps
  • \(f\): estimator

What it means

A pipeline packages preprocessing and modeling into one reproducible object with consistent train/serve behavior.

What it's used for

  • Preventing leakage during CV.
  • Reproducible training and deployment preprocessing.

Key properties

  • Transforms are fit on training data and applied consistently to validation/test/production data.
  • Makes hyperparameter tuning across preprocessing + model easier.

Common gotchas

  • Ad hoc notebook preprocessing often diverges from production scoring.
  • Pipelines still need schema/version checks in production.

Example

Combine imputation, one-hot encoding, scaling, and logistic regression in a single CV-tunable pipeline.

How to Compute (Pseudocode)

Input: raw data X, targets y, ordered transforms T1..Tn, estimator f
Output: fitted pipeline and predictions

# Training
Z <- X
for each transform T_i in order:
  fit T_i on Z (and y if needed)
  Z <- transform(T_i, Z)
fit estimator f on Z and y

# Inference
for new input x_new:
  z_new <- apply stored transforms T1..Tn in the same order
  y_hat <- predict with f on z_new

return pipeline

Complexity

  • Time: Sum of the fit/predict costs of all pipeline steps (transforms plus estimator); repeated evaluation (for example, CV) multiplies the total pipeline cost
  • Space: Sum of fitted state across transforms and estimator, plus intermediate representations
  • Assumptions: Exact complexity is pipeline-dependent; ordering and data representation (dense/sparse) can materially change runtime/memory