18 Lasso Regression

In this chapter we describe Lasso.

Though ridge regression has a great many applications and uses, there is one thing to note: it does not perform variable selection.

Let us parse through this statement in more detail. In the previous chapter, we saw that the Ridge Regression estimate of \(\mathbf{b}_{RR}\) is given by

\[ \mathbf{b}_{RR} = \left( \mathbf{X^\mathsf{T} X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X^\mathsf{T}y} \tag{18.1} \]

We also encountered the following geometric interpretation of Ridge Regression:

Figure 18.1: Illustration of Ridge Regression

Geometrically, we see that ridge coefficients may be set close to zero, but never exactly equal to zero.

You may ask why we want some coefficients to be equal to zero. To answer this question, let us return to our general model:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p \tag{18.2} \]

where \(Y\) denotes our response variable and \(X_1, \dots, X_p\) denote the predictors. If it truly were the case that \(\beta_i = 0\), the entire variable \(X_i\) is actually irrelevant and can be omitted our model completely. As such, if we can find an estimation technique that sets the coefficients of “unnecessary” variables equal to 0, we can use this technique to shrink the space of input variables.

So, we’ve now seen why we want some coefficients to be set equal to 0, and we’ve seen that Ridge Regression doesn’t actually do this. As such, we posit a new estimation technique: the Least Absolute Shrinkage and Selection Operator (it even has the word “Selection” in its name!), or LASSO for short.

18.1 Mathematical Setup

We begin, as we did in Ridge Regression, with a constrained minimization problem:

\[ \min_{\mathbf{b} \in \mathbb{R}^{p}} \left\{ \frac{1}{n} \left\| \mathbf{y} - \mathbf{Xb} \right\|_2^2 \right\} \hspace{5mm} \text{subject to} \hspace{5mm} \text{norm}\left( \mathbf{b} \right) < c \tag{18.3} \]

Now, instead of using the \(\ell_2\) norm for our constraint, we’ll actually use an \(\ell_1\) norm:

\[ \min_{\mathbf{b} \in \mathbb{R}^{p}} \left\{ \frac{1}{n} \left\| \mathbf{y} - \mathbf{Xb} \right\|_2^2 \right\} \hspace{5mm} \text{subject to} \hspace{5mm} \left\| \mathbf{b} \right\|_1 \leq c \tag{18.4} \]

In case you’re not familiar with the \(\ell_1\) norm, what we mean by this constraint is

\[ \left\| \mathbf{b} \right\|_1 \leq c \iff \sum_{j=1}^{p} | b_j| \leq c \tag{18.5} \]

It can be shown that the minimization in equation (1) is exactly equivalent to

\[ \min_{\mathbf{b}} \left\{ \frac{1}{n} \left( \mathbf{y} - \mathbf{Xb} \right)^\mathsf{T} \left( \mathbf{y} - \mathbf{Xb} \right) + \lambda \sum_{j=1}^{p} |b_j| \right\} \tag{18.6} \]

18.1.1 Closed Form?

Recall that the vector of Ridge Regression coefficients had a simple closed-form solution:

\[ \mathbf{b}_{RR} = \left( \mathbf{X^\mathsf{T} X} + \lambda \mathbf{I} \right)^{-1} \mathbf{X^\mathsf{T} y} \tag{18.7} \]

One might ask: do we have a closed-form solution for the LASSO? Unfortunately, the answer is, in general, no. Even without extensive mathematical background, you might be able to see why this is: minimization problems involve differentiation; the LASSO involves an absolute value; differentiation and absolute values don’t play well together.

Now, there are a few exceptions. In particular, when we have only \(p = 1\) predictors, one can actually find a piecewise-defined (but closed-form) solution to the LASSO. When the design matrix \(\mathbf{X}\) is orthonormal, it can be shown that the LASSO can be expressed as

\[ \mathrm{sgn}(b_j) \cdot \left( |b_j| - \lambda \right)_{+} \tag{18.8} \]

where \(b_j\) denotes the OLS estimate of \(b_j\), and \(x_{+}\) denotes the positive-part function (i.e. the positive part of \(x\)). However, in general, it is nearly impossible to extract a closed-form solution to the LASSO.

This is not to say we can’t find the coefficients numerically; there exist a number of extremely effective and efficient algorithms that can solve \(\ell_1\)-minimization problems. For example, we could use Gradient Descent (perhaps, more accurately, sub-gradient descent), or we could use coordinate descent.

Note, Gradient Descent cannot actually be used here. The problem is that the \(\ell_1\) norm term does not take well to gradients (think about derivatives and absolute value signs), and as such we don’t have a well-defined update step.

In practice, however, we use built-in functions to do the heavy lifting. For example, in R, the function glmnet() from the package "glmnet" can compute both Ridge and LASSO coefficients quite easily.

18.2 Geometric Visualization

Just because we can’t always find neat closed-form solutions to the LASSO, that isn’t to say we can gain some intuition behind what’s going on. As we did with Ridge Regression, we can sketch our constraint region and level sets of constant error (MSE):

Figure 18.2: Error contours with constraint region

Here’s the kicker: in LASSO, we could have the following situation:

Figure 18.3: Error contours with constraint region

Here we see the variable selection advertised in the beginning of this chapter. In this particular scenario, the ideal point has \(b_1-\)coordinate equal to 0, meaning the LASSO has completely zeroed-out \(b_1\) in the model, reducing the number of covariates from \(2\) down to \(1\).

18.2.1 Some More Math: Variable Selection in Action

It may be insightful to see some concrete math behind the idea that the LASSO performs variable selection, whereas Ridge Regression does not. For now, we make the simplifying assumption that \(\mathbf{X}\) is orthonormal (this ensures a closed-form solution to the LASSO). As mentioned in ESL (page 71), we have

\[ b_{j-RR} = \frac{b_j}{1 + \lambda} ; \hspace{5mm} b_{j-LASSO} = \mathrm{sign}(b_j) (|b_j| - \lambda)_{+} \tag{18.9} \]

It should be clear now that \(\mathbf{b}_{RR} = \mathbf{0}\) only when \(\lambda = \infty\) (which, of course, is a mathematically ill-phrased statement) which is to say that \(\mathbf{b}_{RR}\) is never equal to zero. With LASSO, on the other hand, setting \(\lambda = |b_j|\) results in a \(\mathbf{b}_{LASSO}\) value of exactly zero.

18.2.2 Example with `mtcars`

As an example, consider the built-in mtcars dataset from R. We may use glmnet() to compute the LASSO coefficients, for different values of the tuning parameter \(\lambda\) (\(\lambda\) was chosen to increase from 0 to 1 in steps of 0.1):

INSERT code for ggplot figure

LASSO coefficients for different \(\lambda\) values; as we see, the LASSO eventually sets coefficients equal to \(0\), thereby eliminating their corresponding covariate from the original model.

Compare this with the coefficients obtained through Ridge Regression (for the same \(\lambda\) values):

INSERT code for ggplot figure

Ridge coefficients for different \(\lambda\) values; as we see, Ridge Regression does not set coefficients equal to 0.

As with Ridge Regression, there isn’t an analytic expression we can use to find the optimal value of \(\lambda\). Rather, \(\lambda\) should be chosen using cross-validation.

18.3 Going Beyond

So far, we’ve seen \(\ell_2\) norms (through Ridge Regression) and \(\ell_1\) norms (through the LASSO). In general, we define the \(\ell_p\) norm to be

\[ \left\| \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} \right\|_p = \sum_{j=1}^{n} | v_i |^p \]

One could plausibly replace the \(\ell_1\) norm in the LASSO with an \(\ell_p\) norm (for a different choice of \(p\)) to yield yet another type of estimator! One note; for \(p > 1\), the resulting estimator will not perform variable selection. This is because \(|b_j|^p\) is differentiable at \(0\) for \(p > 1\).

Another popular penalty (widely regarded as a “compromise” between Ridge Regression and LASSO) is the elastic-net penalty (Zou and Hastie, 2005), which replaces the \(\ell_1\) norm in equation (1) with

\[ \lambda \sum_{j=1}^{p} \left[ \alpha b_j^2 + (1 - \alpha) | b_j | \right] \]

The elastic net does perform variable selection (due to the absolute value term in its penalty), but also performs shrinkage on the remaining variables (like Ridge).

The LASSO is just the tip of the iceberg. In fact, we can consider a whole plethora of models with the form

\[ \min_{\mathbf{b}} \left\{ \frac{1}{n} \left( \mathbf{y} - \mathbf{Xb} \right)^\mathsf{T} \left( \mathbf{y} - \mathbf{Xb} \right) + \lambda \sum_{j=1}^{p} |b_j| \right\} \]