19 Beyond Linear Regression

In this part of the book we will talk about some extensions of the linear model, as well as alternative approaches to relax the assumption of linearity.

19.1 Introduction

So far, we have talked about regression models in their most basic way: linear regression models. So let’s circle back to the general regression framework in which the main idea is to model an output value \(y_i\) in terms of one or more input features \(\mathbf{x_i}\) (\(p\)-vector). In addition, we also suppose that there is a noise or error term \(\varepsilon_i\) to indicate the imperfect nature of our model:

\[ y_i = f(\mathbf{x_i}) + \varepsilon_i \tag{19.1} \]

Usually, we assume that the error terms are independent from the input features, and that they have zero-mean, and constant variance:

\[ \mathbb{E}(\varepsilon_i) = 0; \qquad Var(\varepsilon_i) = \sigma^2 \tag{19.2} \]

To formalize things further, we assume that there is a joint distribution \(P(y, \mathbf{x})\), a marginal distribution for the input features \(P(\mathbf{x})\), and a conditional distribution \(P(y | \mathbf{x})\).

With all the previous assumptions, the theoretical essence in regression is to model \(f()\) as the conditional expectation of \(y|\mathbf{x}\), which is precisely the so-called regression function:

\[ \text{regression function} \longrightarrow \mathbb{E} (y|\mathbf{x}) = \widehat{f}(\mathbf{x}) \tag{19.3} \]

The regression function \(\widehat{f}()\) takes in one or more input features, in the form of a vector \(\mathbf{x}\), and returns a predicted output \(\hat{y}\).

One important thing to highlight is that the theoretical framework of regression says nothing about what the target function \(f()\) should or could look like, which is good news because we have a lot of freedom to decide on almost any form for \(f()\), and consequently, on any form for \(\widehat{f}()\). Again, here the term function is not the classic definition of a mathematical function. Instead, think of the notion of function as a machine that takes input features \(\mathbf{x}\), and returns an output \(y\).

Up to now, we have been working with the most standard type of form for \(\widehat{f}()\) which is a linear model. The estimated regression is simply a linear combination of the \(p\) input features, possibly including a constant term \(b_0\):

\[\begin{align*} \hat{y}_i &= b_0 + b_1 x_{i1} + b_2 x_{i2} + \dots + b_p x_{ip} \\ &= \mathbf{b^\mathsf{T} \vec{x}_i} \tag{19.4} \end{align*}\]

In vector-matrix notation, the vector of predictions is expressed as:

\[ \mathbf{\hat{y}} = \mathbf{Xb} \tag{19.5} \]

which we can graphically represent using a path diagram like the one shown below. Keep in mind that the terms \(\mathbf{x_0}, \dots, \mathbf{x_p}\) refer to predictors variables (assuming a constant term \(\mathbf{x_0} = \mathbf{1}\)):

Figure 19.1: Linear regression model in diagram form

While a linear regression model like the one above can be very useful—not to mention its importance for many other derived methods—and should be part of your machine learning toolbox, it’s far too limiting. Therefore, we need to discuss some of the ways in which the standard linear regression model can be enriched, and be made more flexible.

19.2 Expanding the Regression Horizon

There are several ways in which we can enrich and extend a linear regression model. Our objective is to expose you to an assortment of methods, and give you a taste of some interesting notions that are employed in more sophisticated approaches. Having said that, we will describe just a few of these methods. Trying to cover all possible extensions would require us to write a whole separate book.

To make the discussion more organized, we have decided to classify the covered approaches into two major classes that we are calling: (1) parametric models, and (2) nonparametric models.

At this point, we have two magic words that deserve some clarification: linear and (non)parametric:

When people talk about linear regression models, what do they exactly mean by “linear”?
When people talk about parametric -vs- nonparametric models, what do they mean by this? What kind of parameters are they referring to?

19.2.1 Linearity

The standard linear regression model:

\[ f(\mathbf{x_i}) = b_0 + b_1 x_{i1} + \dots + b_p x_{ip} + \varepsilon_i \tag{19.6} \]

is linear in two modes. On one hand, it is linear in the input variables \(X_1, \dots, X_p\). On the other hand, it is also linear in the parameters \(b_0, b_1, \dots, b_p\). As we said before, although a model like this is very friendly to work with, its double linearity may be fairly restrictive.

In the regression world, the most general type of linearity is the one that applies to the regression coefficients or parameters \(b_0, b_1, \dots, b_p\). Here’s an example of a model that is nonlinear with respect to the predictors, but it is linear with respect to the parameters:

\[ f(\mathbf{x_i}) = b_0 + b_1 x_{i1} + b_2 x_{i2} + b_3 x_{i1}^2 + b_4 x_{i2}^2 + \varepsilon_i \tag{19.7} \]

In contrast, the model below is nonlinear in both the predictors and the parameters:

\[ f(\mathbf{x_i}) = b_0 + exp(x_{i1}^{b_1}) + \sqrt{b_2 x_{i2}} + \varepsilon_i \tag{19.8} \]

So, when using the term “linear model”, we typically refer to a model that is linear in its coefficients or parameters, not necessarily in the input features.

19.2.2 Parametric and Nonparametric

The word “parametric”, and its sibling “nonparametric”, are those kind of terms commonly used in various branches of statistics. For better or worse, they form part of those terms that people assign different meanings to. Depending on who you talk to, some people will give you a definition of the term “nonparametric” in the sense of distribution-free methods, or so-called, nonparametric statistics. This is NOT the meaning that we use in this book for nonparametric.

So what do we mean by parametric and nonparametric?

By parametric, we refer to a functional form of \(f()\) that is fully described by a finite set of parameters, like in the standard linear model

\[ f(\mathbf{x_i}) = b_0 + b_1 x_{i1} + \dots + b_p x_{ip} + \varepsilon_i \tag{19.9} \]

In this book we use the notion of nonparametric models to imply a more relaxed way to specify a function \(f()\) without directly imposing a known functional form. Somewhat contradictory, nonparametric does not necessarily mean that a model has no parameters. As we’ll see, it turns out that nonparametric models do have parameters (or hyperparameters).

A common example of a nonparametric method is \(K\)-Nearest-Neighbors (KNN). The functional form of \(f()\) is way more relaxed, and all we do is to use an average of the response values \(y_i\) for the closest \(k\) points \(x_i\) to a query value \(x_0\):

\[ \hat{y}_0 = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x_0)} y_i \tag{19.10} \]

where the notation \(\mathcal{N}_K(x_0)\) indicates the set of \(K\) closest neighbors to \(x_0\).

Notice that KNN does not impose any functional form that specifies how to combine the \(X\)-input feature(s). In this sense, we say that this is a nonparametric model.

Now that we have provided some clarifications on the meanings of terms such as “linear”, “parametric”, and “nonparametric”, we can continue with our introduction of ways in which we can extend linear models beyond linearity, as well as extensions of models in a more relaxed (nonparametric) sense.

19.3 Transforming Features

Let us consider parametric models first. A first option to make a linear model more sophisticated is by means of applying transformations to some or all of the input features. This is not a new idea. In fact, we have alredy used this strategy when we studied Principal Components Regression and Partial Least Squares Regression.

Consider a linear model that uses some type of dimension reduction approach. The general recipe is to obtain new variables by using linear combinations of the input features, illustrated in the following diagram.

Figure 19.2: PCs as linear combinations of input variables

We can think of any given component \(\mathbf{z_q}\) as a transformation applied on all the \(X\)-input variables, that is:

\[ \mathbf{z_q} = v_{1q} \mathbf{x_1} + \dots + v_{pq} \mathbf{x_p} \quad \longrightarrow \quad Z_q = \phi_q (X_1, \dots, X_p) \tag{19.11} \]

We can explicitly think of a function \(\phi_q : \mathbb{R}^p \rightarrow \mathbb{R}\) that transforms the inputs into a new synthetic feature. In the case of PCR and PLSR, the transformation functions \(\phi_q()\) happen to be linear functions.

Figure 19.3: Linear combinations of input variables

In PCR, with the matrix \(\mathbf{Z}\) containing the transformed features, we can still use ordinary least squares to obtain the predicted response as:

\[ \mathbf{\hat{y}} = \mathbf{Z} (\mathbf{Z}^\mathsf{T} \mathbf{Z})^{-1} \mathbf{Z}^\mathsf{T} \mathbf{y} \tag{19.12} \]

This is a first approach that gives us the opportunity to enrich a linear model by transforming the \(p\) input features into new \(r\) synthetic features.

In the next chapter, we will describe a particular type of linear models (i.e. linear in the parameters) that uses what is called basis functions for the transformation functions \(\phi_q()\).