Lasso Regression (L1)¶
Formula¶
\[\n\hat{\beta}=\arg\min_\beta \|y-X\beta\|_2^2 + \lambda\|\beta\|_1\n\]
Parameters¶
- \(X,y\): data and targets
- \(\beta\): coefficients
- \(\lambda\): regularization strength
- \(\alpha\): L1/L2 mixing weight (elastic net only)
What it means¶
Adds an L1 penalty, which can drive some coefficients exactly to zero and perform feature selection.
What it's used for¶
- Sparse linear models.
- Feature selection when many weak features exist.
Key properties¶
- Can produce exact zeros.
- Solution path changes with \(\lambda\) and feature scaling.
Common gotchas¶
- With highly correlated features, lasso may choose one arbitrarily.
- Do not treat selected features as stable without resampling checks.
Example¶
Lasso can shrink 500 features down to a smaller active set for a simpler model.
How to Compute (Pseudocode)¶
Input: design matrix X, targets y, regularization lambda
Output: lasso coefficients beta
initialize beta
repeat until convergence:
for each coordinate j:
compute partial residual excluding feature j
update beta[j] using a soft-thresholding coordinate step
return beta
Complexity¶
- Time: Depends on the solver; a common coordinate-descent view is roughly \(O(Tnd)\) for \(T\) passes over \(d\) coordinates with dense data
- Space: Typically \(O(nd + d)\) for dense data and coefficient/work vectors
- Assumptions: \(n\) samples, \(d\) features; complexity depends on sparsity, screening rules, and convergence tolerance