Skip to content

Ridge Regression (L2)

Formula

\[\n\hat{\beta}=\arg\min_\beta \|y-X\beta\|_2^2 + \lambda\|\beta\|_2^2\n\]

Parameters

  • \(X,y\): data and targets
  • \(\beta\): coefficients
  • \(\lambda\): regularization strength
  • \(\alpha\): L1/L2 mixing weight (elastic net only)

What it means

Adds an L2 penalty to linear regression to shrink coefficients and reduce variance.

What it's used for

  • Handling multicollinearity.
  • Improving generalization in high-dimensional settings.

Key properties

  • Usually keeps all coefficients nonzero.
  • Penalty strength \(\lambda\) is tuned by validation/CV.

Common gotchas

  • Must scale features before tuning \(\lambda\) in most workflows.
  • Interpretation changes because coefficients are biased toward 0.

Example

Ridge is often a strong baseline when many correlated features exist.

How to Compute (Pseudocode)

Input: design matrix X (n x d), targets y, regularization lambda
Output: ridge coefficients beta

# One common closed-form solve
A <- X^T X + lambda * I
b <- X^T y
beta <- solve_linear_system(A, b)
return beta

Complexity

  • Time: Dense direct solving is typically \(O(nd^2 + d^3)\) (forming \(X^T X\) and solving the \(d \times d\) system)
  • Space: Typically \(O(nd + d^2)\) for dense data plus the normal-equation matrix
  • Assumptions: \(n\) samples, \(d\) features; iterative solvers (especially for large sparse data) have different costs and memory profiles

See also