# 8 Theoretical Framework

Finally we have arrived to the part of the book in which we provide a
framework for the theory of learning. Well, to be more precise, the framework
is really about the theory of *supervised* learning. The purpose of this chapter
is to give you a *mental map* of the conceptual elements that are present
in a supervised learning problem.

Keep in mind that most of what is covered in this chapter is highly theoretical. It has to do with the concepts and principles that ideally we expect to find in a prediction task (e.g. regression, classification). Having said, we will also need to discuss what to do in practice in order to handle most of these theoretical elements (in the upcoming chapters).

## 8.1 Mental Map

So far, we have seen an example of Unsupervised Learning (PCA), as well as one method of Supervised Learning (linear regression). Now, we begin discussing learning ideas at an abstract level.

Let’s return to our example of predicting NBA players’ salaries. Suppose we have data of NBA players: player’s height, player’s weight, player’s years of professional experience, player’s number of 2 points, player’s number of 3 points, etc. And assume also that we have data about the players’ salaries.

Player | Height | Weight | Yrs Expr | 2-Pts | 3-Pts |
---|---|---|---|---|---|

1 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |

2 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |

… | … | … | … | … | … |

n | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |

We are interested in predicting the salary of players, which is the **output**
variable, denoted as \(Y\). The rest of the variables (e.g. height, weight,
experience, 2PTS, 3PTS) are the **inputs** denoted as \(X_1, \dots, X_p\).

Likewise, we assume the existence of a **target function** \(f()\):

\[ \textsf{target function} \qquad f : \mathcal{X} \to \mathcal{Y} \tag{8.1} \]

which is a function mapping from the inputs’ space \(\mathcal{X}\) to the output
space \(\mathcal{y}\)). Keep in mind that this function is an *ideal*, and remains
unknown throughout our computations. Here’s a metaphor that we like to use.
Pretend that the target function is some sort of mythical creature, like a
unicorn (or your preferred creature). We are trying to find this elusive guy.

More generally, we have a **dataset** \(\mathcal{D}\):

\[ \textsf{dataset} \qquad \mathcal{D} = \large \{ (\mathbf{x_1}, y_1), ( \mathbf{x_2}, y_2), \dots, (\mathbf{x_n}, y_n) \large \} \tag{8.2} \]

where \(\mathbf{x_i}\) represents the vector of features for the \(i\)-th player, and \(y_i\) represents his salary.

From this data, we wish to obtain a fitted model, formally known as
a **hypothesis model** \(\widehat{f}()\):

\[ \textsf{hypothesis model} \qquad \widehat{f}: \mathcal{X} \to \mathcal{Y} \tag{8.3} \]

and then use \(\widehat{f}\) to approximate the unknown target function \(f\).

In order to find \(\widehat{f}()\), we typically consider a set of candidate
models, also known as a *hypothesis set*,
\(\mathcal{H} = \{ h_1, h_2, \dots, h_m \}\). The selected hypothesized model
\(h^*_m\) will be the one used as our *final* model \(\widehat{f}\).

We can sketch these basic ideas in a sort of mental map; we will refer to the following picture as the “diagram for supervised learning”.

Here’s how to read this diagram. We are using orange clouds around those concepts
that are more intangible than those appearing within rectangular or oval shapes.
One of these orange clouds is the unknown target function \(f\), which as its
name indicates it is **unknown**. This implies that we never really “discover”
the target function \(f\) in its entirety; rather, we just find a good enough
approximation to it by estimating \(\widehat{f}\). Now, as you can tell from the
mental map, the other orange clound has to do with precisely this idea of
*good approximation*: \(\widehat{f} \approx f\). It is also theoretical because of
what we just said: that we don’t know \(f\). We should let you know that as we
modify our diagram, we will encounter more orange concepts as well of highly
theoretical nature.

Now let’s turn our attention to the blue rectangular elements. One of them
is the data set \(\mathcal{D}\) which is influenced by the unknown target function.
The other blue rectangle has to do with a set of candidate models \(\mathcal{H}\),
which is sometimes referred to as the *hypothesis set*. Both the data set and
the set of models are tangible ingredients. Moreover, the set of candidate
models is totally under our control. We get to decide what type of models
we want try out (e.g. linear model, polynomial models, non-parametric models).

Then we go to the yellow oval shape which right now is simply labeled as the
*learning algorithm* \(\mathcal{A}\). This corresponds to the set of instructions
and steps to be carried out when learning from data. It is also the stage of
the diagram in which most computations take place.

Finally, we arrive at the yellow rectangle containing the final model \(\widehat{f}\). This is supposed to be the selected model by the learning algorithm from the set of hypothesis models. Ideally, this model is the one that provides a good approximation for the target function \(f\).

Going back to the holy grail of supervised learning, our goal is to find a model \(\widehat{f}\) that gives “good” or accurate predictions. Before discussing what exactly do we mean by “accurate”, we first need to talk about predictions.

## 8.2 Kinds of Predictions

What is the ultimate goal in supervised learning? Quick answer, we want a
“good” model. What does “good” model mean? Simply put, it means that we want to
estimate an unknown model \(f\) with some model \(\widehat{f}\) that gives “good”
predictions. What do we mean by “good” predictions? Loosely speaking, it means
that we want to obtain “accurate” predictions. Before clarifying the notion of
*accurate predictions*, let’s discuss first the concept of **predictions**.

Think of a simple linear regression model (e.g. with one predictor). Having a fitted model \(\widehat{f}(x)\), we can use it to make two types of predictions. On one hand, for an observed point \(x_i\), we can compute \(\hat{y}_i = \widehat{f}(x_i)\). By observed point we mean that \(x_i\) was part of the data used to find \(\widehat{f}\). On the other hand, we can also compute \(\hat{y}_0 = \widehat{f}(x_0)\) for a point \(x_0\) what was not part of the data used when deriving \(\widehat{f}\).

### 8.2.1 Two Types of Data

The two distinct types of predictions involve two slightly different kinds of data. The data points \(x_i\) that we use to fit a model is what we call training or learning data. The data points \(x_0\) that we use to assess the performance of a model are points NOT supposed to be part of the training set.

This implies that, at least in theory, we need two kinds of data sets:

**In-sample data**, denoted \(\mathcal{D}_{in}\), used to fit a model**Out-of-sample data**, denoted \(\mathcal{D}_{out}\), used to measure the predictive quality of a model

### 8.2.2 Two Types of Predictions

Given the two kinds of data points, we have two types of predictions:

predictions \(\hat{y}_i\) of observed/seen values \(x_i\)

predicitons \(\hat{y}_0\) of unobserved/unseed values \(x_0\)

Each type of prediction is associated with a certain behavioral feature of a model. The predictions of observed data, \(\hat{y}_i\), have to do with the memorizing aspect (apparent error, resubstitution error). The predictions of unobserved data, \(\hat{y}_0\), have to do with the generalization aspect (generalization error, prediction error).

Both kinds of predictions are important, and each of them is interesting in its own right. However, from the supervised learning standpoint, it is the second type of predictions that we are ultimately interested in. That is, we want to find models that are able to give predictions \(\hat{y}_0\) as accurate as possible for the real value \(y_0\).

Don’t get us wrong. Having good predictions \(\hat{y}_i\) of observed values is important and desirable. And to a large extent, it is a necessary condition for a good model. However, it is not a sufficient condition. It is not enough to fit the observed data well, in order to get a good predictive model. Sometimes, you can perfectly fit the observed data, but have a terrible performance for unobserved values \(x_0\).

## 8.3 Two Types of Errors

In theory, we are dealing with two types of predictions, each of which is associated to certain types of data points.

Because we are interested in obtaining models that give accurate predictions, we need a way to measure the accuracy of such predictions. At the conceptual level we need some mechanism to quantify how different the fitted model is from the target function \(f\):

\[ \widehat{f} \text{ -vs- } f \]

It would be nice to have some measure of how much discrepancy exists between
the estimated model and the target model. This means that we need a function
that summarizes, somehow, the total amount of error. We will denote such term as
an *Overall Measure of Error*:

\[ \text{Overall Measure of Error:} \quad E(\widehat{f},f) \tag{8.4} \]

The typical way in which an overall measure of error is defined is in terms of individual or pointwise errors \(err_i(\hat{y}_i, y_i)\) that quantify the difference between an observed value \(y_i\) and its predicted value \(\hat{y}_i\). As a matter of fact, most overall errors focus on the addition of the pointwise errors:

\[ E(\widehat{f},f) = \text{measure} \left( \sum err_i(\hat{y}_i, y_i) \right ) \tag{8.5} \]

Unless otherwise said, in this book we will use the mean sum of errors as the default overall error measure:

\[ E(\widehat{f},f) = \frac{1}{n} \left( \sum_i err_i (\hat{y}_i, y_i) \right) \tag{8.6} \]

### 8.3.1 Individual Errors

What form does the individual error function, \(err()\), take? In theory, they can take any form you want. This means that you can invent your own individual error function. However, the most common ones are:

**squared error**: \(\quad err(\widehat{f}, f) = \left( \hat{y}_i - y_i \right)^2\)**absolute error**: \(\quad err(\widehat{f}, f) = \left| \hat{y}_i - y_i \right|\)**misclassification error**: \(\quad err(\widehat{f}, f) = [\![ \hat{y}_i \neq y_i ]\!]\)

In the machine learning literature, these individual errors are formally
known as **loss functions**.

### 8.3.2 Overall Errors

As you can imagine, there are actually two types of overall error measures, based on the type of data that is used to assess the individual errors:

**In-sample Error**, denoted \(E_{in}\)**Out-of-sample Error**, denoted \(E_{out}\)

The in-sample error is the average of pointwise errors from data points of the in-sample data \(\mathcal{D}_{in}\):

\[ E_{in} (\widehat{f}, f) = \frac{1}{n} \sum_{i} err_i \tag{8.7} \]

The out-of-sample error is the theoretical mean, or expected value, of the pointwise errors over the entire input space:

\[ E_{out} (\widehat{f}, f) = \mathbb{E}_{\mathcal{X}} \left[ err \left( \widehat{f}(x), f(x) \right) \right] \tag{8.8} \]

The point \(x\) denotes a general data point in the input space \(\mathcal{X}\). And as we said, the expectation is taken over the input space \(\mathcal{X}\). Which means that the nature of \(E_{out}\) is highly theoretical. In practice, you will never, never, be able to compute this quantity.

In the machine learning literature, these overall measures of error tend to
be formally known as **cost functions** or **risks**.

Let’s update our supervised learning diagram to include error measures (see figure below). We add a new box (in blue) involving an overall error measure as well as some pointwise error function.

Notice the connections of the error elements to both the learning algortihm \(\mathcal{A}\), and the final model \(\hat{f}\). Why is this? As we will learn in the upcoming chapters, learning algorithms use—implicitly or explicitly—a pointwise error function \(err()\). In turn, in order to determine which candidate model \(h()\) is the best approximation to the target model \(f()\), we need to use an overall measure of error \(E()\).

### 8.3.3 Auxiliary Technicality

We need to assume some probability distribution \(P\) on \(\mathcal{X}\). That is, we assume our vectors \(\mathbf{x_1}, \dots, \mathbf{x_n}\) are independent identically distributed (i.i.d.) samples from this distribution \(P\). (Exactly what distribution you pick - normal, chi-squared, \(t\), etc. - is, for the moment, irrelevant).

Recall that out-of-sample data is highly theoretical; we will never be able to obtain it in its entirety. The best we can do is obtain a subset of the out-of-sample data (the test data), and estimate the rest of the data. Our imposition of a distributional structure on \(\mathcal{X}\) enables us to link the in-sample error with the out-of-sample data.

Recall that our ultimate goal is to get a good function \(\widehat{f} \approx f\). What do we mean by the symbol “\(\approx\)”? Technically speaking, we want \(E_{\mathrm{out}}(\widehat{f}) \approx 0\). If this is the case, we can safely say that our model has been successfully trained. However, we can never check if this is the case, since we don’t have access to \(E_{\mathrm{out}}\).

To solve this, we break our goal into two sub-goals:

\[ E_{\mathrm{out}} (\widehat{f}) \approx 0 \ \Rightarrow \begin{cases} E_{\mathrm{in}}(\widehat{f}) \approx 0 & \text{practical result} \\ & \\ E_{\mathrm{out}}(\widehat{f}) \approx E_{\mathrm{in}}(\widehat{f}) & \text{technical/theoretical result} \\ \end{cases} \]

The first condition is easy to check. How do we check the second? We check the second condition by invoking our distributional assumption \(P\) on \(\mathcal{X}\). Using our assumption, we can cite various theorems to assert that the second result indeed holds true. We will later find ways to estimate \(E_{\mathrm{out}}(\widehat{f})\).

## 8.4 Noisy Targets

In practice, our function won’t necessarily be a nice (or smooth) function.
Rather, there will be some **noise**. Hence, instead of saying \(y = f(x)\) where
\(f : \mathcal{X} \to \mathcal{Y}\), a better statement might be something
like \(y = f(x) + \varepsilon\). But even this notation has some flaws;
for example, we could have multiple inputs mapping to the same output
(which cannot happen if \(f\) is a proper “function”). That is, we may have two
individuals with the exact same inputs \(\mathbf{x_A} = \mathbf{x_B}\) but with
different response variables \(y_A \neq y_B\). Instead, it makes more sense to
consider some **target conditional distribution** \(P(y \mid x)\). In this way,
we can think of our data as forming a joint probability distribution
\(P(\mathbf{x}, y)\). That is because:

\[ P(\mathbf{x}, y) = P(\mathbf{x}) P(y \mid \mathbf{x}) \tag{8.9} \]

Here’s the updated version of the supervised learning diagram with a modified orange cloud containing the unknown target distribution, and the noisy function.

In supervised learning, we want to learn the conditional distribution
\(P(y \mid \mathbf{x})\). Again, we can think of this probability in terms of
\(y = f() + \text{noise}\). Also, sometimes the Hypothesis Set and the Learning
Algorithm boxes are combined into one, called the **Learning Model**.