17 Ridge Regression

From ordinary least squares regression, we know that the predicted response is given by:

\[ \mathbf{\hat{y}} = \mathbf{X} (\mathbf{X^\mathsf{T} X})^{-1} \mathbf{X^\mathsf{T} y} \tag{17.1} \]

(provided the inverse exists). We also know that if \(\mathbf{X^\mathsf{T} X}\) is near-singular, finding its inverse could lead to trouble. For example, near-perfect multicollinearity corresponds to small eigenvalues of \(\mathbf{X^\mathsf{T} X}\), which will cause estimated regression coefficients to have large variance.

Hence, a natural idea arises: let’s try to modify \(\mathbf{X^\mathsf{T} X}\) to avoid this danger of near-singularity. Specifically, let us add a constant positive term \(\lambda\) to the diagonal of \(\mathbf{X^\mathsf{T} X}\), here’s how:

\[ \mathbf{X^\mathsf{T} X} + \lambda \mathbf{I} \tag{17.2} \]

Note that \(\lambda\) is not an eigenvalue here; it is simply a positive constant. In what way the addition of \(\lambda > 0\) to the diagonal of \(\mathbf{X^\mathsf{T} X}\) changes the structure of this matrix?

By properties of eigenstructures, it turns out that \(\mathbf{X^\mathsf{T} X} + \lambda \mathbf{I}\) and \(\mathbf{X^\mathsf{T} X}\) have the same eigenvectors; however, the eigenvalues of the former matrix will never be \(0\).

Specifically, say \(\mathbf{X^\mathsf{T} X}\) has eigenvalues \(\lambda_1, \lambda_2, \dots, \lambda_p\). The eigenvalues of \(\mathbf{X^\mathsf{T} X} + \lambda \mathbf{I}\) will therefore be \(\overset{*}{\lambda_1}, \overset{*}{\lambda_2}, \dots, \overset{*}{\lambda_p}\) where \(\overset{*}{\lambda_j} = \lambda_j + \lambda\).

Another interesting aspect is that there is actually a minimization problem behind this method (of appending an extra term to the eigenvalues). In Ridge Regression, we have that the predicted output is:

\[ \mathbf{\hat{y}}_{RR} = \mathbf{X} (\mathbf{X^\mathsf{T} X} + \lambda \mathbf{I} )^{-1} \mathbf{X^\mathsf{T} y} \tag{17.3} \]

To understand where this estimator comes from, let’s consider again the geometry of the MSE function in a regression context.

17.1 A New Minimization Problem

As you know, in linear regression the overall error function \(E()\) that we are interested in minimizing is the mean squared error (\(\text{MSE}\)). Moreover, we can look at the geometry of such error measure from the perspective of the parameters (i.e. the regression coefficients). From this perspective, we denote this error function as \(E(\mathbf{b})\), making explicit its dependency on the vector of coefficients \(\mathbf{b}\). For illustration purposes, let’s consider two inputs \(X_1\) and \(X_2\), and their corresponding parameters \(b_1\) and \(b_2\). The error function \(E(\mathbf{b})\) will generate a convex error surface with the shape of a bowl, or paraboloid (see figure below).

Error Surface

Figure 17.1: Error Surface

As usual we can express MSE for \(E(\mathbf{b})\) as follows:

\[ E(\mathbf{b}) = \frac{1}{n} (\mathbf{Xb} - \mathbf{y})^{\mathsf{T}} (\mathbf{Xb} - \mathbf{y}) \tag{17.4} \]

In OLS, we minimize \(E(\mathbf{b})\) unconditionally; that is, without any type of restriction. The associated solution, is indicated with a blue dot at the center of the elliptical contours of constant error.

Error contours with OLS solution

Figure 17.2: Error contours with OLS solution

17.1.1 Constraining Regression Coefficients

An alternative approach to minimize \(E(\mathbf{b})\) unconditionally is to impose a restriction on the squared magnitude of the regression coefficients. In other words, we still want to minimize the mean squared error \(E(\mathbf{b})\) but now we don’t let \(\mathbf{b}\) take any kind of values. Instead, we are going to require the following condition on \(b_1, \dots, b_p\):

\[ \textsf{constraining coefficients:} \quad \sum_{j=1}^{p} b_{j}^2 \leq c \tag{17.5} \]

Here’s an easy way to think about—and remember—this condition: think of \(c\) as a budget for how much you can spend on the size of all coefficients \(b_1, \dots, b_p\) (or their squared values to be more precise).

Adding this restriction to the MSE, we now have the following constrained minimization of \(E(\mathbf{b})\):

\[ \min_{\mathbf{b}} \left\{ \frac{1}{n} (\mathbf{Xb} - \mathbf{y})^{\mathsf{T}} (\mathbf{Xb} - \mathbf{y}) \right\} \quad \mathrm{s.t.} \quad \| \mathbf{b} \|_{2}^{2} = \mathbf{b^\mathsf{T} b} \leq c \tag{17.6} \]

for some “budget” \(c\).

On the \(b_1, b_2\) plane, what is the locus of points such that \(\mathbf{b^\mathsf{T}b} \leq c\)? The answer is a disk of radius \(c\), centered at the origin. Hence, sketching the level curves (curves of constant errors) as well as this restriction, we obtain the following picture (for some value of \(c\)):

Error contours with constraint region

Figure 17.3: Error contours with constraint region

Depending on the chosen value for \(c\), we could end up with a big enough constraint that includes the OLS solution, like in the following picture. As you would expect, in this case the solution with the restriction is not constraining anything whatsoever.

Error contours with a large budget

Figure 17.4: Error contours with a large budget

Of course, we could make our budget stricter by reducing the value of \(c\) we choose, something depicted in the next figure.

Error Contours with small budget

Figure 17.5: Error Contours with small budget

17.2 A New Minimization Solution

Let’s consider one generic elliptical contour of constant error, a given budget \(c\), and a point \(\mathbf{b}\) satisfying the budget constraint, like in the picture below.

Generic error contour with candidate point

Figure 17.6: Generic error contour with candidate point

As you can tell from the above diagram, the candidate point \(\mathbf{b}\) is fulfilling the restriction \(\mathbf{b^\mathsf{T} b} = c\). However, this point is not fully minimizing \(E(\mathbf{b})\), given the constraint. Visually speaking, we could find other candidate values for \(\mathbf{b}\) along the red circumference that would give us smaller \(E(\mathbf{b})\).

How do we know that the shown candidate point \(\mathbf{b}\) can be improved? To answer this question we should visualize the gradient \(\nabla E(\mathbf{b})\) which, as we know, will be pointing in the direction orthogonal to the contour ellipse—i.e. direction of largest change of \(E(\mathbf{b})\). In addition, we should also pay attention to the direction of the candidate point, illustrated with the red vector in the figure below. Notice that this vector is normal (i.e. orthogonal) to the circumference of the constraint.

Gradient of error surface associated to candidate point

Figure 17.7: Gradient of error surface associated to candidate point

Notice also that the angle between the normal vector and the gradient is less than 180 degrees. This is an indication that we can find better \(\mathbf{b}\) points that make the error smaller. But is there an optimal \(\mathbf{b^*}\) that achieves the smallest \(E(\mathbf{b})\) while satisfying the budget constraint? With this visualization, we get the intuition that \(\mathbf{b}^{*}\) must lie on the boundary of the constraint region (i.e., in the 2-D case, on the circle of radius \(c\) centered at the origin).

It turns out that the optimal vector \(\mathbf{b^*}\) corresponds to the one that is exactly the opposite of the gradient \(\nabla E(\mathbf{b})\).

Optimal point opposite to the gradient

Figure 17.8: Optimal point opposite to the gradient

At this point, the gradient and normal vectors will be exactly antiparallel which, in mathematical terms, means that the normal vector is proportional to the gradient:

\[ \nabla E(\mathbf{b}^*) \propto - \mathbf{b}^{*} \tag{17.7} \]

For mathematical convenience, let us choose a proportionality constant of \(-2 (\lambda / n)\). That is, we want the vector \(\mathbf{b}^{*}\) such that

\[ \nabla E(\mathbf{b}^*) = - 2 \frac{\lambda}{n} \mathbf{b}^{*} \tag{17.8} \]

Rearranging some terms, the above equation becomes:

\[ \nabla E(\mathbf{b}^*) + 2 \frac{\lambda}{n} \mathbf{b}^{*} = \mathbf{0} \tag{17.9} \]

Now, notice that the above equation looks like the gradient of something… But what is that something? The answer turns out to be:

\[ f(\mathbf{b}) = E(\mathbf{b}) + (\lambda/n) \mathbf{b}^\mathsf{T} \mathbf{b} \tag{17.10} \]

You can verify directly that \(\nabla f(\mathbf{b})\) is precisely the left hand side of equation (17.9):

\[ \nabla f(\mathbf{b}^*) = \nabla E(\mathbf{b}^*) + 2 \frac{\lambda}{n} \mathbf{b}^{*} \tag{17.11} \]

Let us rephrase what we have seen so far. From our initial minimization problem (17.6), we saw that its solution is given by the vector \(\mathbf{b}^{*}\) that solves another minimization problem:

\[ \min_{\mathbf{b}} \left\{ E(\mathbf{b}) + \frac{\lambda}{n} \mathbf{b}^\mathsf{T} \mathbf{b} \right\} \tag{17.12} \]

In other words, the minimizations (17.6) and (17.12) are exactly equivalent.

From (17.12), we can obtain the solution, namely, the ridge regression estimate of \(\mathbf{b}\):

\[\begin{align*} f(\mathbf{b}) &= E(\mathbf{b}) + (\lambda/n) \mathbf{b}^\mathsf{T} \mathbf{b} \\ &= \frac{1}{n} (\mathbf{Xb} - \mathbf{y})^\mathsf{T} (\mathbf{Xb} - \mathbf{y}) + (\lambda/n) \mathbf{b}^\mathsf{T} \mathbf{b} \\ &= \frac{1}{n} \mathbf{b^\mathsf{T}X^\mathsf{T}Xb} - \frac{2}{n} \mathbf{b^\mathsf{T}X^\mathsf{T}y} + \frac{1}{n} \mathbf{y^\mathsf{T}y} + \frac{\lambda}{n} \mathbf{b}^\mathsf{T} \mathbf{b} \tag{17.13} \end{align*}\]

Deriving \(f(\mathbf{b})\) with resepct to \(\mathbf{b}\) we get:

\[ \nabla f(\mathbf{b}) = \frac{2}{n} \mathbf{X^\mathsf{T}Xb} - \frac{2}{n} \mathbf{X^\mathsf{T}y} + \frac{2 \lambda}{n} \mathbf{b} \tag{17.14} \]

Setting \(\nabla f(\mathbf{b})\) equal to zero:

\[\begin{align*} \mathbf{X^\mathsf{T}Xb} - \mathbf{X^\mathsf{T}y} + \lambda \mathbf{b} &= \mathbf{0} \\ \Longrightarrow \mathbf{b} (\mathbf{X^\mathsf{T}X} + \lambda \mathbf{I}) &= \mathbf{X^\mathsf{T} y} \\ \Longrightarrow \mathbf{b} &= (\mathbf{X^\mathsf{T}X} + \lambda \mathbf{I})^{-1} \mathbf{X^\mathsf{T} y} \tag{17.15} \end{align*}\]

Therefore, the solution is:

\[ \textsf{ridge coefficients:} \quad \mathbf{b}_{RR} = (\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y} \tag{17.16} \]

As you can imagine, the optimal vector \(\mathbf{b}\) that minimizes \(E\) and satisfies the constraint, will be on a given contour of constant error. This situation is illustrated in the image below. Also, this image depicts the general case in which the estimated rigde parameters are biased. That is, the direction of the optimal vector \(\mathbf{b}\) is not pointing in the direction of the OLS solution.

Error contour intersecting the constraint circle

Figure 17.9: Error contour intersecting the constraint circle

Note that, from the discussion at the beginning of this chapter, we see that the ridge regression estimate always exists, and is always unique (unlike the OLS estimate).

17.3 What does RR accomplish?

In ridge regression, the key ingredient is the hyperparameter \(\lambda\). Setting \(\lambda = 0\) causes the ridge solution to be the same as the OLS solution:

\[\begin{align*} \mathbf{b}_{RR} &= (\mathbf{X}^\mathsf{T} \mathbf{X} + 0 \mathbf{I})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y} \\ &= (\mathbf{X}^\mathsf{T} \mathbf{X})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y} \\ &= \mathbf{b}_{OLS} \tag{17.17} \end{align*}\]

But what about specifying a very large value for \(\lambda\)? Let’s think about this case. If \(\lambda\) becomes larger and larger, the terms on the diagonal of \((\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda \mathbf{I})\) will be dominated by such \(\lambda\) value. Consequently, the matrix inverse, \((\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda \mathbf{I})^{-1}\), will basically be dominated by the inverse of the terms on its diagonal:

\[ \lambda \gg 0 \quad \Longrightarrow \quad (\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda \mathbf{I})^{-1} \rightarrow \frac{1}{\lambda} \mathbf{I} \tag{17.18} \]

Taking this idea to the extreme with very large \(\lambda\), will cause \(1/\lambda\) to be zero, essentially “collapsing” \((\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda \mathbf{I})^{-1}\) into a \(p \times p\) matrix full of zeros:

\[ \lambda \rightarrow \infty \quad \Longrightarrow \quad (\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda \mathbf{I})^{-1} \rightarrow \mathbf{0}_{(p,p)} \tag{17.19} \]

thus, the ridge regression coefficients will also collapse to zero:

\[ \lambda \rightarrow \infty \quad \Longrightarrow \quad \mathbf{b}_{RR} = \mathbf{0} \tag{17.20} \]

Relation between \(\lambda\) and \(c\)?

What is the relationship between the budget \(c\) and the non-negative constant \(\lambda\)? Both of these scalars are intimately related. Imposing a large budget constraint \(c\), results in a small \(\lambda\). The smaller the \(\lambda\), the closer \(\mathbf{b}_{RR}\) to the OLS solution. Conversely, imposing a small budget \(c\), causes \(\lambda\) to become large. The larger the \(\lambda\), the smaller the ridge coefficients.

\[ \uparrow c \quad \Longrightarrow \quad \downarrow \lambda \\ \downarrow c \quad \Longrightarrow \quad \uparrow \lambda \]

17.3.1 How to find \(\lambda\)?

In ridge regression, \(\lambda\) is a tuning parameter, and therefore it cannot be found analytically. Instead, you have to implement a trial and error process with various values for \(\lambda\), and determine which seems to be a good one. How? Typically with cross-validation, or other type of resampling approach. Here’s a description of the steps to be carried out.

Assumet that we have a training data set consisting of \(n\) data points: \(\mathcal{D}_{train} = (\mathbf{x_1}, y_1), \dots, (\mathbf{x_n}, y_n)\). Using \(K\)-fold cross-validation, we (randomly) split the data into \(K\) folds:

\[ \mathcal{D}_{train} = \mathcal{D}_{fold-1} \cup \mathcal{D}_{fold-2} \dots \cup \mathcal{D}_{fold-K} \]

Each fold set \(\mathcal{D}_{fold-k}\) will play the role of an evaluation set \(\mathcal{D}_{eval-k}\). Having defined the \(k\) fold sets, we form the corresponding \(K\) retraining sets:

  • \(\mathcal{D}_{train-1} = \mathcal{D}_{train} \setminus \mathcal{D}_{fold-1}\)
  • \(\mathcal{D}_{train-2} = \mathcal{D}_{train} \setminus \mathcal{D}_{fold-2}\)
  • \(\dots\)
  • \(\mathcal{D}_{train-K} = \mathcal{D}_{train} \setminus \mathcal{D}_{fold-K}\)

The cross-validation procedure then repeats the following loops:

  • For \(\lambda_b = 0.001, 0.002, \dots, \lambda_B\)
    • For \(k = 1, \dots, K\)
      • fit RR model \(h_{b,k}\) with \(\lambda_b\) on \(\mathcal{D}_{train-k}\)
      • compute and store \(E_{eval-k} (h_{b,k})\) using \(\mathcal{D}_{eval-k}\)
    • end for \(k\)
    • compute and store \(E_{cv_{b}} = \frac{1}{K} \sum_k E_{eval-k}(h_{b,k})\)
  • end for \(\lambda_b\)
  • Compare all cross-validation errors \(E_{cv_1}, E_{cv_2}, \dots, E_{cv_B}\) and choose the smallest of them, say \(E_{cv_{b^*}}\)
  • Use \(\lambda^*\) to fit the (finalist) Ridge Regression model: \(\mathbf{\hat{y}} = (\mathbf{X}^\mathsf{T} \mathbf{X} + \lambda^* \mathbf{I})^{-1} \mathbf{X}^\mathsf{T} \mathbf{y}\)