5 Linear Regression
Before entering supervised learning territory, we want to discuss the general framework of linear regression. We will introduce this topic from a pure model-fitting point of view. In other words, we will postpone the learning aspect (the prediction of new data) after the chapter of theory of learning.
The reason to cover linear regression in this way is for us to have something to work with when we start talking about the theory of supervised learning.
5.1 Motivation
Consider, again, the NBA dataset example from previous chapters. Suppose we want to use this data to predict the salary of NBA players, in terms of certain variables like player’s team, player’s height, player’s weight, player’s position, player’s years of professional experience, player’s number of 2pts, player’s number of 3 points, number of blocks, etc. Of course, we need information on the salaries of some current NBA player’s:
Player | Height | Weight | Yrs Expr | 2 Points | 3 Points |
---|---|---|---|---|---|
1 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |
2 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |
3 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |
… | … | … | … | … | … |
As usual, we use the symbol \(\mathbf{x_i}\) to denote the vector of measurements of player \(i\)’s statistics (e.g. height, weight, etc.); in turn, the salary of the \(i\)-th player is represented with \(y_i\).
Ideally, we assume the existance of some function \(f : \mathcal{X} \to \mathcal{y}\) (i.e. a function that takes values from \(\mathcal{X}\) space and maps them to a single value \(y\)). We will refer to this function the ideal “formula” for salary. Here we are using the word formula in a very lose sense, and not necessarily using the word “formula” in the mathematical sense. We now seek a hypothesized model (which we call \(\widehat{f} : \mathcal{X} \to y\)), which we select from some set of candidate functions \(h_1, h_2, \dots, h_m\). Our task is to obtain \(\widehat{f}\) in a way that we can claim that it is a good approximation of the (unknown) function \(f\).
5.2 The Idea/Intuition of Regression
Let’s go back to our example with NBA players. Recall that \(y_i\) denotes the salary for the \(i\)-th player. For simplicity’s sake let’s not worry about inflation. Say we now have a new prospective player from Europe; and we are tasked with predicting their salary denoted by \(y_0\). Let’s review a couple of scenarios to get a high-level intuition for this task.
Scenario 1. Suppose we have no information on this new player. How would we compute \(y_{0}\) (i.e. the salary of this new player)? One possibility is to guesstimate \(y_0\) using the historical average salary \(\bar{y}\) of NBA players. In other words, we would simply calculate: \(\hat{y}_0 = \bar{y}\). In this case we are using \(\bar{y}\) as the typical score (e.g. a measure of center) as a plausible guesstimate for \(y_0\). We could also look at the median of the existing salaries, if we are concerned about outliers or some skewed distribution of salaries.
Scenario 2. Now, suppose we know that this new player will sign on to the LA Lakers. Compared to scenario 1, we now have a new bit of information since we know which team will hire this player. Therefore, we can use this fact to have a more educated guess for \(y_0\). How? Instead of using the salaries of all playes, we can focus on the salaries of Laker’s players. We could then use \(\hat{y}_0 = \text{avg}(\text{Laker's Salaries})\): that is, the average salary of all Laker’s players. It is reasonable that \(\hat{y}_0\) is “closer” to the average salary of Laker’s than to the overall average salary of all NBA players.
Scenario 3. Similarly, if we know this new player’s years of experience (e.g. 6 years), we would look at the average of salaries corresponding to players with the same level of experience.
What do the three previous scenarios correspond to? In all of these examples, the prediction is basically a conditional mean:
\[ \hat{y}_0 = \text{ave}(y_i|x_i = x_0) \tag{5.1} \]
Of course, the previous strategy only makes sense when we have data points \(x_i\) that are equal to the query point \(x_0\). But what if none of the available \(x_i\) values are equal to \(x_0\)? We’ll talk about this later.
The previous hypothetical scenarios illustrate the core idea of regression: we obtain predictions \(\hat{y}_0\) using quantities of the form \(\text{ave}(y_i|x_i = x_0)\) which can be formalized—under some assumptions—into the notion of conditional expectations of the form:
\[ \mathbb{E}(y_i \mid x_{i1}^{*} , x_{i2}^{*}, \dots, x_{ip}^{*}) \longrightarrow \hat{y} \tag{5.2} \]
where \(x_{ij}^{*}\) represents the \(i\)-th measurement of the \(j\)-th variable. The above equation is what we call the regression function; note that the regression function is nothing more than a conditional expectation!
5.3 The Linear Regression Model
In a regression model we use one or more features \(X\) to say something about the reponse \(Y\). In turn, a linear regression model tells us to combine our features in a linear way in order to approximate the response, In the univarite case, we have a linear equation:
\[ \hat{Y} = b_0 + b_1 X \tag{5.3} \]
In pointwise format, that is for a given individual \(i\), we have:
\[ \hat{y}_i = b_0 + b_1 x_i \tag{5.4} \]
In vector notation:
\[ \mathbf{\hat{y}} = b_0 + b_1 \mathbf{x} \tag{5.5} \]
To simplify notation, sometimes we prefer to add an auxiliary constant feature in the form of a vector of 1’s with \(n\) elements, and then use matrix notation with the following elements:
\[ \mathbf{X} = \ \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \\ \end{bmatrix}, \qquad \mathbf{\hat{y}} = \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{n} \\ \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} b_{0} \\ b_{1} \\ \end{bmatrix} \]
In the multidimensional case when we have \(p>1\) predictors:
\[ \mathbf{X} = \ \begin{bmatrix} 1 & x_{11} & \dots & x_{1p} \\ 1 & x_{21} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \dots & x_{np} \\ \end{bmatrix}, \qquad \mathbf{\hat{y}} = \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{n} \\ \end{bmatrix}, \qquad \mathbf{b} = \begin{bmatrix} b_{0} \\ b_{1} \\ \vdots \\ b_{p} \\ \end{bmatrix} \]
With the matrix of features, the response, and the coefficients we have a compact expression for the predicted outcomes:
\[ \mathbf{\hat{y}} = \mathbf{Xb} \tag{5.6} \]
In path diagram form, the linear model looks like this:
If we assume that the predictors and the response are mean-centered, then we don’t have to worry about the constant term \(\mathbf{x_0}\):
Obviously the question becomes: how do we obtain the vector of coefficients \(\mathbf{b}\)?
5.4 The Error Measure
We would like to get \(\hat{y}_i\) to be “as close as” possible to \(y_i\). This requires to come up with some type of measure of closeness. Among the various functions that we can use to measure how close \(\hat{y}_i\) and \(y_i\) are, the most common option is to use the squared distance between such values:
\[ d^2(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2 = (\hat{y}_i - y_i)^2 \tag{5.7} \]
Replacing \(\hat{y}_i\) with \(\mathbf{b^\mathsf{T}\vec{x}_i}\) we have:
\[ d^2(y_i, \hat{y}_i) = (\mathbf{b^\mathsf{T}\vec{x}_i} - y_i)^2 \tag{5.8} \]
Notice that \(d^2(y_i, \hat{y}_i)\) is a pointwise error measure that we can generally denote as \(\text{err}_i\). But we also need to define a global measure of error. This is typically done by adding all the pointwise error measures \(\text{err}_{i}\).
There are two flavors of overall error measures based on squared pointwise differences:
- the sum of squared errors or \(\text{SSE}\), and
- the mean squared error or \(\text{MSE}\).
The sum of squared errors, \(\text{SSE}\), is defined as:
\[ \text{SSE} = \sum_{i=1}^{n} \text{err}_i \tag{5.9} \]
The mean squared error, \(\text{MSE}\), is defined as:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \text{err}_i \tag{5.10} \]
As you can tell, \(\text{SSE} = n \text{MSE}\), and viceversa, \(\text{MSE} = \text{SSE} / n\)
Throughout this book, unless mentioned otherwise, when dealing with regression problems, we will consider the \(\text{MSE}\) as the default overall error function to be minimized (you could also take \(\text{SSE}\) instead). Let \(e_i\) be:
\[ e_i = (y_i - \hat{y}_i) \quad \to \quad e_i^2 = (y_i - \hat{y}_i)^2 = \text{err}_i \tag{5.11} \]
Doing some algebra, it’s easy to see that:
\[\begin{align} \text{MSE} &= \frac{1}{n} \sum_{i=1}^{n} e_{i}^{2} \\ &= \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 \\ &= \frac{1}{n} \sum_{i=1}^{n} (\mathbf{b^\mathsf{T}\vec{x}_i} - y_i)^2 \\ &= \frac{1}{n} (\mathbf{Xb} - \mathbf{y})^\mathsf{T} (\mathbf{Xb} - \mathbf{y}) \\ &= \frac{1}{n} \| \mathbf{Xb} - \mathbf{y} \|^2 \\ &= \frac{1}{n} \| \mathbf{\hat{y}} - \mathbf{y} \|^2 \tag{5.12} \end{align}\]
As you can tell, the Mean Squared Error \(\text{MSE}\) is proportional to the squared norm of the residual vector \(\mathbf{e} = \mathbf{\hat{y}} - \mathbf{y}\),
\[ \text{MSE} = \frac{1}{n} \| \mathbf{e} \|^2 = \frac{1}{n} (\mathbf{\hat{y}} - \mathbf{y})^\mathsf{T} (\mathbf{\hat{y}} - \mathbf{y}) \tag{5.13} \]
5.5 The Least Squares Algorithm
In (ordinary) least squares regression, we want to minimize the mean of squared errors (\(\text{MSE}\)). This minimization problem involves computing the derivative of \(\text{MSE}\) with respect to \(\mathbf{b}\). In other words, we compute the gradient of \(\text{MSE}\), denoted \(\nabla \text{MSE}(\mathbf{b})\), which is the vector of partial derivatives of \(\text{MSE}\) with respecto to each parameter \(b_0, b_1, \dots, b_p\):
\[\begin{align*} \nabla \text{MSE}(\mathbf{b}) &= \frac{\partial }{\partial \mathbf{b}}\text{MSE}(\mathbf{b}) \\ &= \frac{\partial }{\partial \mathbf{b}} \left (\frac{1}{n} \mathbf{b^\mathsf{T}X^\mathsf{T}Xb} - \frac{2}{n} \mathbf{b^\mathsf{T}X^\mathsf{T}y} + \frac{1}{n} \mathbf{y^\mathsf{T}y} \right) \tag{5.14} \end{align*}\]
which becomes:
\[ \nabla \text{MSE}(\mathbf{b}) = \frac{2}{n} \mathbf{X^\mathsf{T}Xb} - \frac{2}{n} \mathbf{X^\mathsf{T}y} \tag{5.15} \]
Equating to zero we get that:
\[ \mathbf{X^\mathsf{T}Xb} = \mathbf{X^\mathsf{T}y} \quad (\text{normal equations}) \tag{5.16} \]
The above equation defines a system of equations that most authors refer to as the so-called Normal equations. It is a system of \(n\) equations with \(p+1\) unknowns (assuming we have a constant term \(b_0\)).
If the cross-product matrix \(\mathbf{X^\mathsf{T}X}\) is invertible, which is not a minor assumption, then the vector of regression coefficients \(\mathbf{b}\) that we are looking for is given by:
\[ \mathbf{b} = (\mathbf{X^\mathsf{T}X})^{-1} \mathbf{X^\mathsf{T}y} \tag{5.17} \]
Having obtained \(\mathbf{b}\), we can easily compute the response vector:
\[\begin{align*} \mathbf{\hat{y}} &= \mathbf{Xb} \\ &= \mathbf{X} (\mathbf{X^\mathsf{T}X})^{-1} \mathbf{X^\mathsf{T} y} \tag{5.18} \end{align*}\]
If we denote \(\mathbf{H} = \mathbf{X} (\mathbf{X^\mathsf{T}X})^{-1} \mathbf{X^\mathsf{T}}\), then the predicted response is:
\[ \mathbf{\hat{y}} = \mathbf{Hy} \tag{5.19} \]
This matrix \(\mathbf{H}\) is better known as the hat matrix, because it puts the hat on the response. More importantly, the matrix \(\mathbf{H}\) is an orthogonal projector.
From linear algebra, orthogonal projectors have very interesting properties:
- they are symmetric
- they are idempotent
- their eigenvalues are either 0 or 1
5.6 Geometries of OLS
Now that we’ve seen the algebra, it’s time to look at the geometric interpretation of all the action that is going on within linear regression via OLS. We will discuss three geometric perspectives:
OLS from the individuals point of view (i.e. rows of the data matrix).
OLS from the variables point of view (i.e. columns of the data matrix).
OLS from the parameters point of view, and the error surface.
5.6.1 Rows Perspective
This is probably the most popular perspective covered in most textbooks.
For illustration purposes let’s assume that our data has just \(p=1\) predictor. In other words, we have the response \(Y\) and one predictor \(X\). We can depict individuals as points in this space:
In linear regression, we want to predict \(y_i\) by linearly mixing the inputs \(\hat{y}_{i} = b_0 + b_1 x_i\). In two dimensions, the fitted model corresponds to a line. In three dimensions it would correspond to a plane. And in higher dimensions this would correspond to a hyperplane.
With a fitted line, we obtain predicted values \(\hat{y}_i\). Some predicted values may be equal to the observed values. Other predicted values will be greater than the observed values. And some predicted values will be smaller than the observed values.
As you can imagine, given a set of data points, you can fit an infinite number of lines (in general). So which line are we looking for? We want to obtain the line that minimizes the square of the errors \(e_i = \hat{y}_{i} - y_{i}\). In the figure below, these errors (also known as _residuals) are represented by the vertical difference between the observed values \(y_i\) and the predicted values \(\hat{y}_i\).
Combining all residuals, we want to obtain parameters \(b_0, \dots, b_p\) that minimize the squared norm of the vector of residuals:
\[ \sum_{i=1}^{n} e_{i}^{2} = \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 = \sum_{i=1}^{n} (b_0 + b_1 x_i - y_i)^2 \tag{5.20} \]
In vector-matrix form we have:
\[\begin{align*} \| \mathbf{e} \|^2 &= \| \mathbf{\hat{y}} - \mathbf{y} \|^2 \\ &= \| \mathbf{Xb} - \mathbf{y} \|^2 \\ &= (\mathbf{Xb} - \mathbf{y})^\mathsf{T} (\mathbf{Xb} - \mathbf{y}) \\ &\propto \text{MSE} \tag{5.21} \end{align*}\]
As you can tell, minimizing the squared norm of the vector of residuals, is equivalent to minimizing the mean squared error.
5.6.2 Columns Perspective
We can also look at the geometry of OLS from the columns perspective. This is less common than the rows perspective, but still very enlightening.
Imagine variables in an \(n\)-dimensional space, both the response and the predictors. In this space, the \(X\) variables will span some subspace \(\mathbb{S}_{X}\). This subspace is not supposed to contain the response—unless \(Y\) happens to be a linear combination of \(X_1, \dots, X_p\).
What are we looking for? We’re looking for a linear combination \(\mathbf{Xb}\) that gives us a good approximation to \(\mathbf{y}\). As you can tell, there’s an infinite number of linear combinations that can be formed with \(X_1, \dots, X_p\).
The mix of features that we are interested in, \(\mathbf{\hat{y}} = \mathbf{Xb}\), is the one that gives us the closest approximation to \(\mathbf{y}\).
Now, what do we mean by closest approximation? How do we determine the closeness between \(\mathbf{\hat{y}}\) and \(\mathbf{y}\)? By looking at the difference, which results in a vector \(\mathbf{e} = \mathbf{\hat{y}} - \mathbf{y}\). And then measuring the size or norm of this vector. Well, the squared norm to be precise. In other words, we want to obtain \(\mathbf{\hat{y}}\) such that the squared norm \(\| \mathbf{e} \|^2\) is as small as possible.
\[ \text{Minimize} \quad \| \mathbf{e} \|^2 = \| \mathbf{\hat{y}} - \mathbf{y} \|^2 \tag{5.22} \]
Minimizing the squared norm of \(\mathbf{e}\) involves minimizing the mean squared error.
5.6.3 Parameters Perspective
In addition to the two previously discussed perspectives (rows and columns), we could also visualize the regression problem from the point of view of the parameters \(\mathbf{b}\) and the error surface from \(\text{MSE}\). This is the least common perspective discussed in the literature that has to do with linear regression in general. However, it is not that uncommon within the Statistical Learning literature.
For illustration purposes, assume that we have only two predictors \(X_1\) and \(X_2\). Recall that the Mean Squared Error (MSE) is:
\[ E(\mathbf{y}, \mathbf{\hat{y}}) = \frac{1}{n} \left( \mathbf{b^\mathsf{T} X^\mathsf{T}} \mathbf{X b} - 2 \mathbf{b^\mathsf{T} X^\mathsf{T} y} + \mathbf{y^\mathsf{T}y} \right) \tag{5.23} \]
Now, from the point of view of \(\mathbf{b} = (b_1, b_2)\), we can classify the order of each term:
\[ E(\mathbf{y}, \mathbf{\hat{y}}) = \frac{1}{n} ( \underbrace{\mathbf{b^\mathsf{T} X^\mathsf{T}} \mathbf{X b}}_{\text{Quadratic Form}} - \underbrace{ 2 \mathbf{b^\mathsf{T} X^\mathsf{T} y}}_{\text{Linear}} + \underbrace{ \mathbf{y^\mathsf{T}y}}_{\text{Constant}} ) \tag{5.24} \]
Since \(\mathbf{X^\mathsf{T} X}\) is positive semidefinite, we know that \(\mathbf{b^\mathsf{T} X^\mathsf{T} Xb} \geq 0\). Furthermore, we know that (from vector calculus) it will be a paraboloid (bowl-shaped surface) in the \((E, b_1, b_2)\) space. The following diagram depicts this situation.
Imagine that we get horizontal slices of the error surface. For any of those slices, we can project them onto the plane spanned by the parameters \(b_1\) and \(b_2\). The resulting projections will be like a topographic map, with error contours on this plane. In general, those contours will be ellipses.
With quadratic error surfaces like this, they have a minimum value, and we are guaranteed the existence of \(\mathbf{b}^* = (b_1^{*}, b_2^{*})\) s.t. \(\mathbf{b^\mathsf{T} X^\mathsf{T} X b}\) is minimized. This is a powerful result! Consider, for example, a parabolic cylinder. Such a shape has no unique minimum; rather, it has an infinite number of points (all lying on a line) that minimize the function. The point being; with positive semi-definite matrices, we never have this latter case.
The minimum of the error surface occurs at the point \((b_{1}^{*}, b_{2}^{*})\). This is the precisely the OLS solution.