Ridge Regression (L2)¶
Formula¶
\[\n\hat{\beta}=\arg\min_\beta \|y-X\beta\|_2^2 + \lambda\|\beta\|_2^2\n\]
Parameters¶
- \(X,y\): data and targets
- \(\beta\): coefficients
- \(\lambda\): regularization strength
- \(\alpha\): L1/L2 mixing weight (elastic net only)
What it means¶
Adds an L2 penalty to linear regression to shrink coefficients and reduce variance.
What it's used for¶
- Handling multicollinearity.
- Improving generalization in high-dimensional settings.
Key properties¶
- Usually keeps all coefficients nonzero.
- Penalty strength \(\lambda\) is tuned by validation/CV.
Common gotchas¶
- Must scale features before tuning \(\lambda\) in most workflows.
- Interpretation changes because coefficients are biased toward 0.
Example¶
Ridge is often a strong baseline when many correlated features exist.
How to Compute (Pseudocode)¶
Input: design matrix X (n x d), targets y, regularization lambda
Output: ridge coefficients beta
# One common closed-form solve
A <- X^T X + lambda * I
b <- X^T y
beta <- solve_linear_system(A, b)
return beta
Complexity¶
- Time: Dense direct solving is typically \(O(nd^2 + d^3)\) (forming \(X^T X\) and solving the \(d \times d\) system)
- Space: Typically \(O(nd + d^2)\) for dense data plus the normal-equation matrix
- Assumptions: \(n\) samples, \(d\) features; iterative solvers (especially for large sparse data) have different costs and memory profiles