Skip to content

Lasso Regression (L1)

Formula

\[\n\hat{\beta}=\arg\min_\beta \|y-X\beta\|_2^2 + \lambda\|\beta\|_1\n\]

Parameters

  • \(X,y\): data and targets
  • \(\beta\): coefficients
  • \(\lambda\): regularization strength
  • \(\alpha\): L1/L2 mixing weight (elastic net only)

What it means

Adds an L1 penalty, which can drive some coefficients exactly to zero and perform feature selection.

What it's used for

  • Sparse linear models.
  • Feature selection when many weak features exist.

Key properties

  • Can produce exact zeros.
  • Solution path changes with \(\lambda\) and feature scaling.

Common gotchas

  • With highly correlated features, lasso may choose one arbitrarily.
  • Do not treat selected features as stable without resampling checks.

Example

Lasso can shrink 500 features down to a smaller active set for a simpler model.

How to Compute (Pseudocode)

Input: design matrix X, targets y, regularization lambda
Output: lasso coefficients beta

initialize beta
repeat until convergence:
  for each coordinate j:
    compute partial residual excluding feature j
    update beta[j] using a soft-thresholding coordinate step

return beta

Complexity

  • Time: Depends on the solver; a common coordinate-descent view is roughly \(O(Tnd)\) for \(T\) passes over \(d\) coordinates with dense data
  • Space: Typically \(O(nd + d)\) for dense data and coefficient/work vectors
  • Assumptions: \(n\) samples, \(d\) features; complexity depends on sparsity, screening rules, and convergence tolerance

See also