8 Theoretical Framework
Finally we have arrived to the part of the book in which we provide a framework for the theory of learning. Well, to be more precise, the framework is really about the theory of supervised learning. The purpose of this chapter is to give you a mental map of the conceptual elements that are present in a supervised learning problem.
Keep in mind that most of what is covered in this chapter is highly theoretical. It has to do with the concepts and principles that ideally we expect to find in a prediction task (e.g. regression, classification). Having said, we will also need to discuss what to do in practice in order to handle most of these theoretical elements (in the upcoming chapters).
8.1 Mental Map
So far, we have seen an example of Unsupervised Learning (PCA), as well as one method of Supervised Learning (linear regression). Now, we begin discussing learning ideas at an abstract level.
Let’s return to our example of predicting NBA players’ salaries. Suppose we have data of NBA players: player’s height, player’s weight, player’s years of professional experience, player’s number of 2 points, player’s number of 3 points, etc. And assume also that we have data about the players’ salaries.
Player | Height | Weight | Yrs Expr | 2-Pts | 3-Pts |
---|---|---|---|---|---|
1 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |
2 | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |
… | … | … | … | … | … |
n | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) |
We are interested in predicting the salary of players, which is the output variable, denoted as \(Y\). The rest of the variables (e.g. height, weight, experience, 2PTS, 3PTS) are the inputs denoted as \(X_1, \dots, X_p\).
Likewise, we assume the existence of a target function \(f()\):
\[ \textsf{target function} \qquad f : \mathcal{X} \to \mathcal{Y} \tag{8.1} \]
which is a function mapping from the inputs’ space \(\mathcal{X}\) to the output space \(\mathcal{y}\). Keep in mind that this function is an ideal, and remains unknown throughout our computations. Here’s a metaphor that we like to use. Pretend that the target function is some sort of mythical creature, like a unicorn (or your preferred creature). We are trying to find this elusive guy.
More generally, we have a dataset \(\mathcal{D}\):
\[ \textsf{dataset} \qquad \mathcal{D} = \large \{ (\mathbf{x_1}, y_1), ( \mathbf{x_2}, y_2), \dots, (\mathbf{x_n}, y_n) \large \} \tag{8.2} \]
where \(\mathbf{x_i}\) represents the vector of features for the \(i\)-th player, and \(y_i\) represents his salary.
From this data, we wish to obtain a fitted model, formally known as a hypothesis model \(\widehat{f}()\):
\[ \textsf{hypothesis model} \qquad \widehat{f}: \mathcal{X} \to \mathcal{Y} \tag{8.3} \]
and then use \(\widehat{f}\) to approximate the unknown target function \(f\).
In order to find \(\widehat{f}()\), we typically consider a set of candidate models, also known as a hypothesis set, \(\mathcal{H} = \{ h_1, h_2, \dots, h_m \}\). The selected hypothesized model \(h^*_m\) will be the one used as our final model \(\widehat{f}\).
We can sketch these basic ideas in a sort of mental map; we will refer to the following picture as the “diagram for supervised learning”.
Here’s how to read this diagram. We are using orange clouds around those concepts that are more intangible than those appearing within rectangular or oval shapes. One of these orange clouds is the unknown target function \(f\), which as its name indicates it is unknown. This implies that we never really “discover” the target function \(f\) in its entirety; rather, we just find a good enough approximation to it by estimating \(\widehat{f}\). Now, as you can tell from the mental map, the other orange clound has to do with precisely this idea of good approximation: \(\widehat{f} \approx f\). It is also theoretical because of what we just said: that we don’t know \(f\). We should let you know that as we modify our diagram, we will encounter more orange concepts as well of highly theoretical nature.
Now let’s turn our attention to the blue rectangular elements. One of them is the data set \(\mathcal{D}\) which is influenced by the unknown target function. The other blue rectangle has to do with a set of candidate models \(\mathcal{H}\), which is sometimes referred to as the hypothesis set. Both the data set and the set of models are tangible ingredients. Moreover, the set of candidate models is totally under our control. We get to decide what type of models we want try out (e.g. linear model, polynomial models, non-parametric models).
Then we go to the yellow oval shape which right now is simply labeled as the learning algorithm \(\mathcal{A}\). This corresponds to the set of instructions and steps to be carried out when learning from data. It is also the stage of the diagram in which most computations take place.
Finally, we arrive at the yellow rectangle containing the final model \(\widehat{f}\). This is supposed to be the selected model by the learning algorithm from the set of hypothesis models. Ideally, this model is the one that provides a good approximation for the target function \(f\).
Going back to the holy grail of supervised learning, our goal is to find a model \(\widehat{f}\) that gives “good” or accurate predictions. Before discussing what exactly do we mean by “accurate”, we first need to talk about predictions.
8.2 Kinds of Predictions
What is the ultimate goal in supervised learning? Quick answer, we want a “good” model. What does “good” model mean? Simply put, it means that we want to estimate an unknown model \(f\) with some model \(\widehat{f}\) that gives “good” predictions. What do we mean by “good” predictions? Loosely speaking, it means that we want to obtain “accurate” predictions. Before clarifying the notion of accurate predictions, let’s discuss first the concept of predictions.
Think of a simple linear regression model (e.g. with one predictor). Having a fitted model \(\widehat{f}(x)\), we can use it to make two types of predictions. On one hand, for an observed point \(x_i\), we can compute \(\hat{y}_i = \widehat{f}(x_i)\). By observed point we mean that \(x_i\) was part of the data used to find \(\widehat{f}\). On the other hand, we can also compute \(\hat{y}_0 = \widehat{f}(x_0)\) for a point \(x_0\) what was not part of the data used when deriving \(\widehat{f}\).
8.2.1 Two Types of Data
The two distinct types of predictions involve two slightly different kinds of data. The data points \(x_i\) that we use to fit a model is what most authors call learning data. The data points \(x_0\) that we use to assess the performance of a model are points NOT supposed to be part of the learning set. In this book we are going to give special names to these data points. We will use the name In-sample data to refer to the learning data, and Out-of-sample data to refer to those data points not used to learn a model.
To summarize, we have at least in theory, two kinds of data sets:
In-sample data, denoted \(\mathcal{D}_{in}\), used to fit a model
Out-of-sample data, denoted \(\mathcal{D}_{out}\), used to measure the predictive quality of a model
8.2.2 Two Types of Predictions
Given the two kinds of data points, we have two corresponding types of predictions:
predictions \(\hat{y}_i\) of observed/seen values \(x_i\)
predicitons \(\hat{y}_0\) of unobserved/unseed values \(x_0\)
Each type of prediction is associated with a certain behavioral feature of a model. The predictions of observed data, \(\hat{y}_i\), have to do with the memorizing aspect (i.e. apparent error, resubstitution error). The predictions of unobserved data, \(\hat{y}_0\), have to do with the generalization aspect (i.e. generalization error, prediction error).
Both kinds of predictions are important, and each of them is interesting in its own right. However, from the supervised learning standpoint, it is the second type of predictions that we are ultimately interested in. That is, we want to find models that are able to give predictions \(\hat{y}_0\) as accurate as possible for the real value \(y_0\).
Don’t get us wrong. Having good predictions \(\hat{y}_i\) of observed values is important and desirable. And to a large extent, it is a necessary condition for a good model. However, it is not a sufficient condition. It is not enough to fit the observed data well, in order to get a good predictive model. Sometimes, you can perfectly fit the observed data, but have a terrible performance for unobserved values \(x_0\).
8.3 Two Types of Errors
In theory, we are dealing with two types of predictions, each of which is associated to certain types of data points.
Because we are interested in obtaining models that give accurate predictions, we need a way to measure the accuracy of such predictions. At the conceptual level we need some mechanism to quantify how different the fitted model is from the target function \(f\):
\[ \widehat{f} \text{ -vs- } f \]
It would be nice to have some measure of how much discrepancy exists between the estimated model and the target model. This means that we need a function that summarizes, somehow, the total amount of error. We will denote such term as an Overall Measure of Error:
\[ \text{Overall Measure of Error:} \quad E(\widehat{f},f) \tag{8.4} \]
The typical way in which an overall measure of error is defined is in terms of individual or pointwise errors \(err_i(\hat{y}_i, y_i)\) that quantify the difference between an observed value \(y_i\) and its predicted value \(\hat{y}_i\). As a matter of fact, most overall errors focus on the addition of the pointwise errors:
\[ E(\widehat{f},f) = \text{measure} \left( \sum err_i(\hat{y}_i, y_i) \right ) \tag{8.5} \]
Unless otherwise said, in this book we will use the mean sum of errors as the default overall error measure:
\[ E(\widehat{f},f) = \frac{1}{n} \left( \sum_i err_i (\hat{y}_i, y_i) \right) \tag{8.6} \]
8.3.1 Individual Errors
What form does the individual error function, \(err()\), take? In theory, they can take any form you want. This means that you can invent your own individual error function. However, the most common ones are:
squared error: \(\quad err(\widehat{f}, f) = \left( \hat{y}_i - y_i \right)^2\)
absolute error: \(\quad err(\widehat{f}, f) = \left| \hat{y}_i - y_i \right|\)
misclassification error: \(\quad err(\widehat{f}, f) = [\![ \hat{y}_i \neq y_i ]\!]\)
In the machine learning literature, these individual errors are formally known as loss functions.
8.3.2 Overall Errors
As you can imagine, there are actually two types of overall error measures, based on the type of data that is used to assess the individual errors:
In-sample Error, denoted \(E_{in}\)
Out-of-sample Error, denoted \(E_{out}\)
The in-sample error is typically defined as the average of pointwise errors from data points of the in-sample data \(\mathcal{D}_{in}\):
\[ E_{in} (\widehat{f}, f) = \frac{1}{n} \sum_{i} err_i \tag{8.7} \]
The out-of-sample error is the theoretical mean, or expected value, of the pointwise errors over the entire input space:
\[ E_{out} (\widehat{f}, f) = \mathbb{E}_{\mathcal{X}} \left[ err \left( \widehat{f}(x), f(x) \right) \right] \tag{8.8} \]
The point \(x\) denotes a general data point in the input space \(\mathcal{X}\). And as we said, the expectation is taken over the input space \(\mathcal{X}\). Which means that the nature of \(E_{out}\) is highly theoretical. In practice, you will never, never, be able to compute this quantity.
In the machine learning literature, these overall measures of error tend to be formally known as cost functions or risks.
Let’s update our supervised learning diagram to include error measures (see figure below). We add a new box (in blue) involving an overall error measure as well as some pointwise error function.
Notice the connections of the error elements to both the learning algortihm \(\mathcal{A}\), and the final model \(\widehat{f}\). Why is this? As we will learn in the upcoming chapters, learning algorithms use—implicitly or explicitly—a pointwise error function \(err()\). In turn, in order to determine which candidate model \(h()\) is the best approximation to the target model \(f()\), we need to use an overall measure of error \(E()\).
8.3.3 Probability as an Auxiliary Technicality
For theoretical reasons, we need to assume some probability distribution over the input space: \(P\) on \(\mathcal{X}\). That is, we assume our vectors \(\mathbf{x_1}, \dots, \mathbf{x_n}\) are independent identically distributed (i.i.d.) samples from this distribution \(P\). (Exactly what distribution you pick—normal, chi-squared, \(t\), etc.—is, for the moment, irrelevant).
Recall that out-of-sample data is highly theoretical; we will never be able to obtain it in its entirety. The best we can do is to obtain a subset of the out-of-sample data (called the test data). Our imposition of a distributional structure on \(\mathcal{X}\) enables us to link the in-sample error with the out-of-sample data.
Our ultimate goal is to get a good function \(\widehat{f} \approx f\). What do we mean by the symbol “\(\approx\)”? Technically speaking, we want \(E_{\mathrm{out}}(\widehat{f}) \approx 0\). If this is the case, we can safely say that our model has been successfully trained. However, we can never check if this is the case, since we don’t have access to \(E_{\mathrm{out}}\).
To solve this, we break our goal into two sub-goals:
\[ E_{\mathrm{out}} (\widehat{f}) \approx 0 \ \Rightarrow \begin{cases} E_{\mathrm{in}}(\widehat{f}) \approx 0 & \text{practical result} \\ & \\ E_{\mathrm{out}}(\widehat{f}) \approx E_{\mathrm{in}}(\widehat{f}) & \text{technical/theoretical result} \\ \end{cases} \]
The first condition is easy to check because we do have access to the in-sample data. How do we check the second condition? We check the second condition by invoking our distributional assumption \(P\) on \(\mathcal{X}\). Using our assumption, we can cite various theorems to assert that the second result indeed holds true (this part is out of this book’s scope). We will later find ways to estimate \(E_{\mathrm{out}}(\widehat{f})\).
8.4 Noisy Targets
In practice, the target function won’t necessarily be a nice (or smooth) function. Rather, there will be some noise. Hence, instead of saying \(y = f(x)\) where \(f : \mathcal{X} \to \mathcal{Y}\), a better statement might be something like \(y = f(x) + \varepsilon\). But even this notation has some flaws; for example, we could have multiple inputs mapping to the same output (which cannot happen if \(f\) is a proper “function”). That is, we may have two individuals with the exact same inputs \(\mathbf{x_A} = \mathbf{x_B}\) but with different response variables \(y_A \neq y_B\). Instead, it makes more sense to consider some target conditional distribution \(P(y \mid x)\). In this way, we can think of our data as forming a joint probability distribution \(P(\mathbf{x}, y)\). That is because:
\[ P(\mathbf{x}, y) = P(\mathbf{x}) P(y \mid \mathbf{x}) \tag{8.9} \]
Here’s the updated version of the supervised learning diagram with a modified orange cloud containing the unknown target distribution, and the noisy function.
In supervised learning, we want to learn the conditional distribution \(P(y \mid \mathbf{x})\). Again, we can think of this probability in terms of \(y = f() + \text{noise}\). Also, sometimes the Hypothesis Set and the Learning Algorithm boxes are combined into one, called the Learning Model.