# 14 Regularization Techniques

In this part of the book we will talk about the notion of regularization (what is regularization, what is the purpose of regularization, what approaches are used for regularization) all of this within the context of linear models. However, keep in mind that you can also use regularization in non-linear contexts.

So what is regularization? If you look at the dictionary definition of the term
*regularization* you should find something like this:

regularization; the act of bringing to uniformity; make (something) regular

Simply put: regularization has to do with “making things regular”.

But what about *regularization* in the context of supervised learning? What are
the “irregular” things that need to be made “regular”? In order to answer this
question, we will first motivate the discussion by looking at the simulation
example used in chapter Overfitting, and then we will focus on
the effects of having predictors with high multicollinearity.

## 14.1 Example: Regularity in Models

Let’s use the simulation example from chapter Overfitting.

We consider a target function with some noise given by the following expression:

\[ f(x) = sin(1 + x^2) + \varepsilon \tag{14.1} \]

with the input variable \(x\) in the interval \([0,1]\), and the noise term \(\varepsilon \sim N(\mu=0, \sigma=0.03)\).

The function of the signal, \(sin(1 + x^2)\), is depicted in the figure below. Keep in mind that in real life we will never have this knowledge.

We randomly simulate seven *in-sample* sets each of size 10 points
\((x_i, y_i)\); and for each set we fit several regression models using
polynomials of various degrees from 1 to 9 (see figure below).

Notice again the high flexibility of the high-order polynomials, namely those of degree 7, 8 and 9. These polynomials show an extremely volatile behavior. In contrast, the low-order polynomials (e.g. degrees 1 t o 3) are much more stable.

In general, we can say that the low-order polynomials are very “regular” because even though the data sets change for every fitted model, the obtained fits are fairly similar. Likewise, we can say that the mid-order polynomials of degrees 4-to-6 are somewhat “regular” with one or two data sets resulting in atypical fits compared to the rest. Finally, the high-order polynomials are very “irregular”; by this we mean that changing from one in-sample data set to another produces a remarkably different fit.

On one hand, we know that low-complexity models behave more regularly but they tend to have low learning capacity. Put in more formal terms: they tend to have small variance but at the expense of exhibiting large bias.

On the other hand, we also know that high-complexity models tend to have more learning capacity with a better opportunity to get closer to the target model. However, this increased capacity comes at the expense of larger variance. In other words, they tend to have low bias but at the expense of exhibiting a more irregular behavior.

In summary, the notion of regularization is tightly connected to the bias-variance tradeoff. The larger the variance of models, the more irregular they will be, and viceversa. What regularization pursuits is to find ways of controlling the behavior (i.e. the variance) of more flexible models in order to obtain lower generalization errors. How? By simplifying or reducing the flexibility of models in a manner that their variance goes down without sacrifing too much bias.

Having exposed the high-level intuition of regularization, let’s discuss one of
the typical causes for having irregular behavior in linear regression models:
**multicollinearity**.

## 14.2 Multicollinearity Issues

One of the issues when fitting regression models is due to multicollinearity: the condition that arises when two or more predictors are highly correlated.

How does this affect OLS regression?

When one or more predictors are linear combinations of other predictors, then
\(\mathbf{X^\mathsf{T} X}\) is singular. This is known as *exact collinearity*.
When this happens, there is no unique least squares estimate \(\mathbf{b}\).

A more challenging problem arises when \(\mathbf{X^\mathsf{T} X}\) is close but
not exactly singular.
This is usually referred to as *near perfect collinearity* or, more commonly,
as *multicollinearity*. Why is this a problem? Because multicollinearity leads
to imprecise (i.e. unstable, irregular) estimates \(\mathbf{b}\).

What are the typical causes of multicollinearity?

- One or more predictors are linear combinations of other predictors
- One or more predictors are almost perfect linear combinations of other predictors
- There are more predictors than observations \(p > n\)

### 14.2.1 Toy Example

To illustrate the issues of dealing with multicollinearity, let’s play with
`mtcars`

, one of the built-in data sets in R.

```
mtcars[1:10, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
```

Let’s use `mpg`

as response, and `disp`

, `hp`

, and `wt`

as predictors.

```
# response
mpg <- mtcars$mpg
# predictors
disp <- mtcars$disp
hp <- mtcars$hp
wt <- mtcars$wt
# standardized predictors
X <- scale(cbind(disp, hp, wt))
```

Let’s inspect the correlation matrix given by the equation below, assuming the data matrix of predictors \(\mathbf{X}\) is mean-centered:

\[ \mathbf{R} = \frac{1}{n-1} \mathbf{X^\mathsf{T} X} \tag{14.2} \]

```
# correlation matrix
cor(X)
#> disp hp wt
#> disp 1.0000000 0.7909486 0.8879799
#> hp 0.7909486 1.0000000 0.6587479
#> wt 0.8879799 0.6587479 1.0000000
```

If we look at the circle of correlations from a principal components analysis,
we see the following pattern: `hp`

, `disp`

, and `wt`

are positively correlated;
indicated by the fact the arrows are pointing in a common general direction.

Carrying out an ordinary least squares regression, that is, regressing `mpg`

onto `disp`

, `hp`

, and `wt`

we get the following output provided the
linear model function `lm()`

:

```
#>
#> Call:
#> lm(formula = mpg ~ disp + hp + wt)
#>
#> Coefficients:
#> (Intercept) disp hp wt
#> 37.105505 -0.000937 -0.031157 -3.800891
#>
#> Call:
#> lm(formula = mpg ~ disp + hp + wt)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.891 -1.640 -0.172 1.061 5.861
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.105505 2.110815 17.579 < 2e-16 ***
#> disp -0.000937 0.010350 -0.091 0.92851
#> hp -0.031157 0.011436 -2.724 0.01097 *
#> wt -3.800891 1.066191 -3.565 0.00133 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.639 on 28 degrees of freedom
#> Multiple R-squared: 0.8268, Adjusted R-squared: 0.8083
#> F-statistic: 44.57 on 3 and 28 DF, p-value: 8.65e-11
```

Analytically, the estimated vector of coefficients \(\mathbf{b}\), and the predicted response \(\mathbf{\hat{y}}\) are obtained as:

\(\mathbf{b} = (\mathbf{X^\mathsf{T} X})^{-1} \mathbf{X^\mathsf{T} y}\)

\(\mathbf{\hat{y}} = \mathbf{X} (\mathbf{X^\mathsf{T} X})^{-1} \mathbf{X^\mathsf{T} y}\)

So let’s take a look at the matrix \((\mathbf{X^\mathsf{T} X})^{-1}\)

```
#> disp hp wt
#> disp 0.23627475 -0.08598357 -0.15316573
#> hp -0.08598357 0.08827847 0.01819843
#> wt -0.15316573 0.01819843 0.15627798
```

#### Exact Collinearity

Let’s introduce exact collinearity. For example, let’s create a fourth variable
`disp1`

that is a multiple of `disp`

, and then let’s add this variable to the
data matrix `X`

. What happens if we try to compute the inverse of
\(\mathbf{X_{1}^\mathsf{T} X_1}\)?

```
disp1 <- 10 * disp
X1 <- scale(cbind(disp, disp1, hp, wt))
solve(t(X1) %*% X1)
#> Error in solve.default(t(X1) %*% X1): system is computationally singular: reciprocal condition number = 1.55757e-17
```

As you can tell, R detected perfect collinearity, and the computation of the inverse shows that the matrix \(\mathbf{X_{1}^\mathsf{T} X_1}\) is singular.

#### Near-Exact Collinearity

Let’s introduce near-exact collinearity by creating a new variable `disp2`

which is almost a clone of `disp`

, and put everything together in a new
matrix `X2`

:

```
set.seed(123)
disp2 <- disp + rnorm(length(disp))
X2 <- scale(cbind(disp, disp2, hp, wt))
solve(t(X2) %*% X2)
#> disp disp2 hp wt
#> disp 588.167214 -590.721826 1.01055902 1.99383316
#> disp2 -590.721826 593.525960 -1.10174784 -2.15719062
#> hp 1.010559 -1.101748 0.09032362 0.02220277
#> wt 1.993833 -2.157191 0.02220277 0.16411837
```

In this case, R was able to invert the matrix \(\mathbf{X_{2}^\mathsf{T} X_2}\). How does this compare with \((\mathbf{X^\mathsf{T} X})^{-1}\)?

```
solve(t(X) %*% X)
#> disp hp wt
#> disp 0.23627475 -0.08598357 -0.15316573
#> hp -0.08598357 0.08827847 0.01819843
#> wt -0.15316573 0.01819843 0.15627798
```

When we compare \((\mathbf{X^\mathsf{T} X})^{-1}\) against
\((\mathbf{X_{2}^\mathsf{T} X_2})^{-1}\), we see that the first entry in the
diagonal of both matrices (the entry associated to `disp`

) are very different.

#### More Extreme Near-Exact Collinearity

Let’s make things even more extreme! What about increasing the amount of
near-exact collinearity by creating a new variable `disp3`

which is almost a perfect clone of `disp? This time we put everything in a new matrix`

X3`:

```
set.seed(123)
disp3 <- disp + rnorm(length(disp), mean = 0, sd = 0.1)
X3 <- scale(cbind(disp, disp3, hp, wt))
cor(disp, disp3)
#> [1] 0.9999997
```

And to make things more interesting, let’s create one more matrix,
denoted `X31`

, that is a copy of `X3`

but contains a tiny modification,
that in theory, would hardly have any effect:

```
# small changes may have a "butterfly" effect
disp31 <- disp3
# change just one observation
disp31[1] <- disp3[1] * 1.01
X31 <- scale(cbind(disp, disp31, hp, wt))
cor(disp, disp31)
#> [1] 0.9999973
```

When we calculate the inverse of both \(\mathbf{X}_{3}^\mathsf{T} \mathbf{X}_{3}\) and \(\mathbf{X}_{31}^\mathsf{T} \mathbf{X}_{31}\), something weird happens:

```
solve(t(X3) %*% X3)
#> disp disp3 hp wt
#> disp 59175.36325 -59202.97211 10.91501090 21.38646548
#> disp3 -59202.97211 59230.83035 -11.00617104 -21.54976679
#> hp 10.91501 -11.00617 0.09032362 0.02220277
#> wt 21.38647 -21.54977 0.02220277 0.16411837
solve(t(X31) %*% X31)
#> disp disp31 hp wt
#> disp 5941.5946101 -5942.3977358 0.30661752 0.64947961
#> disp31 -5942.3977358 5943.4373181 -0.39266978 -0.80278577
#> hp 0.3066175 -0.3926698 0.08830442 0.01825147
#> wt 0.6494796 -0.8027858 0.01825147 0.15638642
```

This is what we may call a “butterfly effect.” By modifying just one cell in \(\mathbf{X}_{31}\) by adding a little amount of random noise, the inverses \((\mathbf{X}_{3}^\mathsf{T} \mathbf{X}_{3})^{-1}\) and \((\mathbf{X}_{31}^\mathsf{T} \mathbf{X}_{31})^{-1}\) have changed dramatically. Which illustrates some of the issues when dealing with matrices that are not technically singular, but have a large degree of multicollinearity.

For comparison purposes:

```
# using predictors in X
reg <- lm(mpg ~ disp + hp + wt)
reg$coefficients
#> (Intercept) disp hp wt
#> 37.1055052690 -0.0009370091 -0.0311565508 -3.8008905826
# using predictors in X
reg2 <- lm(mpg ~ disp + disp2 + hp + wt)
reg2$coefficients
#> (Intercept) disp disp2 hp wt
#> 36.59121161 0.43074188 -0.43323471 -0.02970117 -3.60121209
# using predictors in X3
reg3 <- lm(mpg ~ disp + disp3 + hp + wt)
reg3$coefficients
#> (Intercept) disp disp3 hp wt
#> 36.59121161 4.32985428 -4.33234711 -0.02970117 -3.60121209
# using predictors in X31
reg31 <- lm(mpg ~ disp + disp31 + hp + wt)
reg31$coefficients
#> (Intercept) disp disp31 hp wt
#> 37.19129692 2.02426582 -2.02580803 -0.03091464 -3.76623514
```

## 14.3 Irregular Coefficients

Another interesting thing to look at in linear regression—via OLS—, is the theoretical variance-covariance matrix of the regression coefficients, given by:

\[ Var(\mathbf{\boldsymbol{\hat{\beta}}}) = \ \begin{bmatrix} Var(\hat{\beta}_1) & Cov(\hat{\beta}_1, \hat{\beta}_2) & \cdots & Cov(\hat{\beta}_1, \hat{\beta}_p) \\ Cov(\hat{\beta}_2, \hat{\beta}_1) & Var(\hat{\beta}_2) & \cdots & Cov(\hat{\beta}_2, \hat{\beta}_p) \\ \vdots & & \ddots & \vdots \\ Cov(\hat{\beta}_p, \hat{\beta}_1) & Cov(\hat{\beta}_p, \hat{\beta}_2) & \cdots & Var(\hat{\beta}_p) \\ \end{bmatrix} \tag{14.3} \]

Doing some algebra, the matrix \(Var(\boldsymbol{\hat{\beta}})\) can be compactly expressed in matrix notation as:

\[ Var(\boldsymbol{\hat{\beta}}) = \sigma^2 (\mathbf{X^\mathsf{T} X})^{-1} \tag{14.4} \]

The variance of a particular coefficient \(\hat{\beta}_j\) is given by:

\[ Var(\hat{\beta}_j) = \sigma^2 \left [ (\mathbf{X^\mathsf{T} X})^{-1} \right ]_{jj} \tag{14.5} \]

where \(\left [ (\mathbf{X^\mathsf{T} X})^{-1} \right ]_{jj}\) is the \(j\)-th diagonal element of \((\mathbf{X^\mathsf{T} X})^{-1}\)

A couple of remarks:

- Recall again that we don’t know \(\sigma^2\). How can we find an estimator \(\hat{\sigma}^2\)?
- We don’t observe the error terms \(\boldsymbol{\varepsilon}\) but we do have the residuals \(\mathbf{e = y - \hat{y}}\)
- We also have the Residual Sum of Squares (RSS)

\[ \text{RSS} = \sum_{i=1}^{n} e_{i}^{2} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \tag{14.6} \]

#### Effect of Multicollinearity

Assuming standardized variables, \(\mathbf{X^\mathsf{T} X} = n \mathbf{R}\), it can be shown that:

\[ Var(\boldsymbol{\hat{\beta}}) = \sigma^2 \left ( \frac{\mathbf{R}^{-1}}{n} \right ) \tag{14.7} \]

and \(Var(\hat{\beta}_j)\) can then be expressed as:

\[ Var(\hat{\beta}_j) = \frac{\sigma^2}{n} [\mathbf{R}^{-1}]_{jj} \tag{14.8} \]

It turns out that:

\[ [\mathbf{R}^{-1}]_{jj} = \frac{1}{1 - R_{j}^{2}} \tag{14.9} \]

which is known as the **Variance Inflation Factor** or VIF. If \(R_{j}^{2}\) is
close to 1, then VIF will be large, and so \(Var(\hat{\beta}_j)\) will also be large.

Let us write the eigenvalue decomposition of \(\mathbf{R}\) as:

\[ \mathbf{R} = \mathbf{V \boldsymbol{\Lambda} V^\mathsf{T}} \tag{14.10} \]

And then express \(\mathbf{R}^{-1}\) in terms of the above decomposition:

\[\begin{align} \mathbf{R}^{-1} &= (\mathbf{V \Lambda} \mathbf{V}^\mathsf{T})^{-1} \\ &= \mathbf{V} \mathbf{\Lambda}^{-1} \mathbf{V}^\mathsf{T} \\ &= \begin{bmatrix} \mid & \mid & & \mid \\ \mathbf{v_1} & \mathbf{v_2} & \dots & \mathbf{v_p} \\ \mid & \mid & & \mid \\ \end{bmatrix} \begin{pmatrix} \frac{1}{\lambda_1} & 0 & \dots & 0 \\ 0 & \frac{1}{\lambda_2} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac{1}{\lambda_p} \\ \end{pmatrix} \begin{bmatrix} - \mathbf{v^\mathsf{T}_1} - \\ - \mathbf{v^\mathsf{T}_2} - \\ \vdots \\ - \mathbf{v^\mathsf{T}_p} - \\ \end{bmatrix} \tag{14.11} \end{align}\]

In this way, we obtain a formula for \(V(\hat{\beta}_j)\) involving the eigenvectors and eigenvalues of \(\mathbf{R}\) (as well as \(\sigma^2 / n\)):

\[ V(\hat{\beta}_j) = \frac{1}{n} \left ( \frac{\sigma^2}{\lambda_1} v_{j1}^2 + \frac{\sigma^2}{\lambda_2} v_{j2}^2 + \dots + \frac{\sigma^2}{\lambda_p} v_{jp}^2 \right) = \frac{\sigma^2}{n} \sum_{\ell=1}^{p} \frac{1}{\lambda_\ell} v_{j \ell}^2 \tag{14.12} \]

where \(v_{j \ell}\) denotes the \(j\)-th entry of the \(\ell\) eigenvector.

\[ Var(\hat{\beta}_j) = \left ( \frac{\sigma^2}{n} \right ) \sum_{\ell=1}^{p} \frac{v^{2}_{j \ell}}{\lambda_\ell} \tag{14.13} \]

The block matrix diagram above is designed to help visualize how we got these \(v_{\ell j}^2\) terms. As you can tell, the variance of the estimators depends on the inverses of the eigenvalues of \(\mathbf{R}\).

This formula is useful in interpreting collinearity. If we have perfect collinearity, some of the eigenvalues of \((\mathbf{X}^\mathsf{T} \mathbf{X})^{-1}\) would be equal to 0. In the case of near-perfect collinearity, some eigenvalues will be very close to 0. That is, \(\frac{1}{\lambda_\ell}\) will be very large. Stated differently: if we have multicollinearity, our OLS coefficients will have large variance.

In summary:

- the standard errors of \(\hat{\beta}_j\) are inflated.
- the fit is unstable, and becomes very sensitive to small perturbations.
- small changes in \(Y\) can lead to large changes in the coefficients.

What would you do to overcome multicollinearity?

- Reduce number of predictors
- If \(p > n\), then try to get more observations (increase \(n\))
- Find an orthogonal basis for the predictors
- Impose constraints on the estimated coefficients
- A mix of some or all of the above?
**Other ideas?**

## 14.4 Connection to Regularization

So how all of the previous discussion connects with regularization?
Generally speaking, we know that *regularization* has to do with
“making things regular.” So, if you think about it, regularization implies
that something is irregular.

In the context of regression, what is it that is irregular? What are the things that we are interested in “making regular”?

Within linear regression, regularization means reducing the
variability—and therefore the size—of the vector of predicted coefficients.
What we have seen is that if there is multicollinearity, some of the elements
of \(\mathbf{b}_{ols}\) will have high variance. In other words, the norm
\(\| \mathbf{b}_{ols} \|\) will be very large. When this is the case,
we say that the coefficients are **irregular**. Consequently, regularization
involves finding ways to mitigate the effects of these irregular coefficients.

#### Regularization Metaphor

Here’s a metaphor that we like to use when talking about regularization and the issue with “irregular” coefficients. As you know, in OLS regression, we want to find parameters that give us a linear combination of our response variables.

Think of finding the parameters \(b_1, \dots, b_p\) as if you go “shopping” for parameters. We assume (initially) that we have a blank check, and can pick whatever parameters we can buy. No need to worry about about how much they cost. If we are not careful (if there’s high multicollinearity) we could end up spending quite a lot of money.

To have some sort of budgeting control, we use regularization. This is like imposing some sort of restriction on what coefficients we can buy. We could use penalized methods or we could also use dimension reduction methods.

In penalized methods, we explicitly impose a budget on how much money we can spend when buying parameters.

In dimension reduction, there are no explicit penalties, but we still impose a “cost” on the regressors that are highly collinear. We reduce the number of parameters that we can buy.

The approaches that we will study—in the next chapters—for achieving regularization are:

**Dimension Reduction**: Principal Components Regression (PCR), Partial Least Squares regression (PLSR)**Penalized Methods**: Ridge Regression, Lasso regression.