12 Learning Phases

In this chapter we further discuss more theoretical elements of the supervised learning framework. In particular, we take a deep dive into some of the activities to be performed in every learning process, namely, model training, model selection, and model assessment.

A word of caution needs to be said about the terminology that we use here. If you look at other books or resources about statistical/machine learning, you will find that there is no consistent use of words such as training, testing, validation, evaluation, assessment, selection, and other similar terms. We have decided to use specific words—for certain concepts—that other authors or practitioners may handle differently.

Also, bear in mind that many of the notions and ideas described in this chapter tend to have a decisive theoretical flavor. As we move on, we will provide more details and explanations on how to make such ideas more concrete, and how to execute them in practice.

12.1 Introduction

Let’s bring back a simplified version of the supervised learning diagram depicted in the figure below. We know that there are more elements in the full-fledged diagram, but we want to reduce the number of displayed elements in order to focus on three main stages that are present in all supervised learning contexts.

Simplified Supervised Learning Diagram

Figure 12.1: Simplified Supervised Learning Diagram

In every learning situation, there is a handful of tasks we need to carry out.

1) First, we need to fit several models, typically this involves working with different classes of hypothesis. For example, we can consider four types of regression methods: principal components regression (\(\mathcal{H_1}\)), partial least squares regression (\(\mathcal{H_2}\)), ridge regression (\(\mathcal{H_3}\)), and lasso (\(\mathcal{H_4}\)). All of these types of regression models have tuning parameters that cannot be derived analytically from the data, but have to be determined through trial-error steps.

2) For each hypothesis family, we need to find the optimal tuning parameter, which involves choosing a finalist model: the best principal components regression, the best partial least regression, etc. Then, we need to select the best model among the finalist models. This will be the final model \(\widehat{f}\) to be delivered.

3) Finally, we need to measure the predicting performance of the final model: measure how the model will behave with out-of-sample points.

We can formalize these tasks in three major phases encountered in every supervised learning system: 1) model training, 2) model selection, and 3) model assessment.

  • Training: Given some data, and a certain modeling hypothesis \(\mathcal{H}_m\), how can we fit/estimate a model \(h_m\)? Often, we are also interested in telling something about its predicting performance (in a limited sense).

  • Selection: This involves choosing a model from a set of candidate models. Typically, this has to do with two types of selection:

    • Pre-Selection: choosing a finalist model from a set of models belonging to a certain class of hypothesis.

    • Final-Selection: choosing a final model among a set of finalist models; that is, choosing the very final model.

  • Assessment: Given a final \(\widehat{f}\), how can we provide an estimate the out-of-sample error \(\widehat{E}_{out}(\hat{f})\)? In other words, how to measure the performance of the final model in order to provide an estimation of its predictive behavior out-of-sample.

To expand our mental map, let’s place the learning phases in the supervised learning diagram (see image below). As you can tell, now we are taking a peek inside the learning algorithm \(\mathcal{A}\) section. This is where the model training, the pre-selection, and the final-selection tasks occur. In turn, the model assessment part has to do with estimating the out-of-sample performance of the final model.

Schematic of Learning Phases

Figure 12.2: Schematic of Learning Phases

Having identified the main learning tasks, and knowing how they show up inside the supervised learning diagram, the next thing to consider is: What data should we use to perform each task? We will answer this question in the following sections, starting backwards, that is: first with the model assessment phase, then the selection of the final model, then the pre-selection of a finalist model, and finally the training of candidate models.

12.2 Model Assessment

Let us consider a situation in which we have a final model \(h\). The next logical step should involve measuring its prediction quality. In other words, we want to see how good (or how bad) the predictions of \(h\) are. The general idea is fairly simple—at least conceptually. Given an input data \(\mathbf{x_i}\), we need to assess the discrepancy between the predicted value \(\hat{y}_i = h(\mathbf{x_i})\) and the observed value \(y_i\).

Simple, right? … Well, not really.

As you may recall from the chapter about the theoretical framework of supervised learning, there are two major types of predictions. On one hand, we have predictions \(\hat{y}_i = h(\mathbf{x_i})\) of in-sample points \((\mathbf{x_i}, y_i) \in \mathcal{D}_{in}\). In-sample points are those data points that we used to fit a given model. On the other hand, we have predictions \(\hat{y}_0 = h(\mathbf{x_0})\) of out-of-sample points \((\mathbf{x_0}, y_0) \in \mathcal{D}_{out}\). Out-of-sample points are data points not used to fit a model.

Because of this distinction between in-sample data \(\mathcal{D}_{in}\), and out-of-sample data \(\mathcal{D}_{out}\), we can measure the predictive quality of a model from these two perspectives. This obviously implies having predictions, and errors, of two different natures. Aggregating all the discrepancies between the predicted in-sample points \(\hat{y}_i = h(\mathbf{x_i})\) and their observed values \(y_i\), allows us to quantify the in-sample error \(E_{in}\), which informs us about the resubstitution power of the model \(h\) (i.e. how well the model fits the learning data).

\[ E_{in}(h) = \frac{1}{|\mathcal{D}_{in}|} \sum_{i \in \mathcal{D}_{in}} err \big( h(x_i) = \hat{y}_i, y_i \big) \qquad \text{resubstitution error} \tag{12.1} \]

The more interesting part comes with the out-of-sample data points. If we had access to all points in \(\mathcal{D}_{out}\), measuring the discrepancies between the predicted out-of-sample points \(\hat{y}_0 = h(\mathbf{x_0})\) and their observed values \(y_0\), would allow us to quantify the out-of-sample error \(E_{out}\), truly measuring the generalization power of the model \(h\). In practice, unfortunately, we won’t have access to the entire out-of-sample data.

\[ E_{out}(h) = \mathbb{E} [err (h(x_0) = \hat{y}_0 , y_0)] \qquad \text{generalization error} \tag{12.2} \]

Knowing that the whole out-of-sample data set is not within our reach, the second best thing that we can do is to find a proxy set for \(\mathcal{D}_{out}\). If we can obtain a data set \(\mathcal{D}_{proxy}\) that is a representative subset of \(\mathcal{D}_{out}\), then we can compute \(E_{proxy}\) and use it to approximate \(E_{out}\), thus having a fair estimation of the generalization power of a final model \(h\). This is precisely the idea of the so-called term Model Assessment: How can we estimate \(E_{out}(h)\) of a final model? Which basically reduces to: how can we find an unbiased sample \(\mathcal{D}_{proxy} \subset \mathcal{D}_{out}\) in order to get \(\widehat{E}_{out}\)?

\[ \underbrace{\widehat{E}_{out}(h)}_{\mathcal{D}_{proxy}} \ \approx \ \underbrace{E_{out}(h)}_{\mathcal{D}_{out}} \tag{12.3} \]

12.2.1 Holdout Test Set

In order to answer the question “How can we estimate \(E_{out}\)?”, let us discuss the theoretical rationale behind the so-called Holdout Method, also known as Holdout Test Set.

In practice, we always have some available data \(\mathcal{D} = \{ (\mathbf{x_1}, y_1), \dots, (\mathbf{x_n}, y_n) \}\). If we used all the \(n\) data points to train/fit a model, we could perfectly measure \(E_{in}\) but then we wouldn’t have any out-of-sample points to get an honest approximation \(\widehat{E}_{out}\) of \(E_{out}\). Perhaps you are thinking: “Why not use \(E_{in}\) as an estimate \(\widehat{E}_{out}\) of \(E_{out}\)?” Because it won’t be a realiable estimate. We may have a model \(h\) that produces a very small in-sample error, \(E_{in} \approx 0\), but that doesn’t necessarily mean that it has good generalization performance out-of-sample.

Given that the only available data set is \(\mathcal{D}\), the Holdout Method proposes to split \(\mathcal{D}\) into two subsets:

  1. a subset \(\mathcal{D}_{train}\), called the training set, used to train/fit the model, and

  2. another subset \(\mathcal{D}_{test}\), called the test set, to be used as a proxy of \(\mathcal{D}_{out}\) for testing/assessment purposes.

Data split into training and test sets

Figure 12.3: Data split into training and test sets

How do we actually obtain these two sets? Usually by taking random samples of size \(a\), without replacement, from \(\mathcal{D}\) (although there are exceptions).

\[ \mathcal{D} \to \begin{cases} \text{training } \ \mathcal{D}_{train} & \to \ \text{size } n - a \\ & \\ \text{test } \ \mathcal{D}_{test} & \to \ \text{size } a \\ \end{cases} \tag{12.4} \]

We then fit a particular model using \(\mathcal{D}_{train}\); obtaining a model that we’ll call \(h^{-}(x)\), “\(h\)-minus”, because it is a model fitted with the training set \(\mathcal{D}_{train}\), which is a subset of the available data \(\mathcal{D}\). With the remainder points in \(\mathcal{D}_{test}\), we can measure the performance of the model \(h^{-}(x)\) as:

\[ E_{test}(h^{-}) = \frac{1}{a} \sum_{\ell=1}^{a} err\left( h^{-}(\mathbf{x_\ell}) , y_\ell \right) ; \hspace{5mm} (\mathbf{x_\ell}, y_\ell) \in \mathcal{D}_{test} \tag{12.5} \]

As long as \(\mathcal{D}_{test}\) is a representative sample of \(\mathcal{D}_{out}\), \(E_{test}\) should give an unbiased estimate of \(E_{out}\). Let’s see why.

12.2.2 Why does a test set work?

Consider an out-of-sample point \((\mathbf{x_0}, y_0)\) that is part of the test set:

\[ (\mathbf{x_0}, y_0) \in \mathcal{D}_{test} \tag{12.6} \]

Given a pointwise error function \(err()\), we can measure the error:

\[ err(h(\mathbf{x_0}), y_0) \tag{12.6} \]

Moreover, we can treat it as a point estimate of \(E_{out}(h)\).

Here’s a relevant question: Is \(err(h(\mathbf{x_0}), y_0)\) an unbiased estimate of \(E_{out}(h)\)?

To see whether the point estimate \(err(h(\mathbf{x_0}), y_0)\) is an unbiased estimate of \(E_{out}(h)\), let’s determine its expectation over the input space \(\mathcal{X}\):

\[ \mathbb{E}_{\mathcal{X}} [ err(h(\mathbf{x_0}), y_0) ] \tag{12.7} \]

Remember what the above expression represent? Yes, it is precisely the out-of-sample error \(E_{out}(h)\)! Therefore, \(err(h(\mathbf{x_0}), y_0)\) is an unbiased point estimate of the out-of-sample error.

What about the variance: \(Var[err(h(\mathbf{x_0}), y_0)]\)? For the sake of simplicity let’s assume that this variance is constant:

\[ Var[err(h(\mathbf{x_0}), y_0)] = s^2 \tag{12.8} \]

Obviously this variance could be large (or small). So having just one test point \((\mathbf{x_0}, y_0)\), even though it is an unbiased estimate of \(E_{out}\), it does not allow us to have an idea of how reliable that estimate is.

Well, let’s consider a set \(D_{test} = \{ (\mathbf{x_1}, y_1), \dots, (\mathbf{x_a}, y_a) \}\) containing \(a > 1\) points. We can average their pointwise errors to get \(E_{test}\)

\[ E_{test}(h) = \frac{1}{a} \sum_{\ell=1}^{a} err[ h^{-}(\mathbf{x_\ell}), y_\ell ] \tag{12.9} \]

The question becomes: is \(E_{test}(h)\) an unbiased estimate of \(E_{out}(h)\)?

Let’s find out:

\[\begin{align*} \mathbb{E}_{\mathcal{X}} [E_{test}(h)] &= \mathbb{E}_{\mathcal{X}} \left [ \frac{1}{a} \sum_{\ell=1}^{a} err[ h^{-}(\mathbf{x_\ell}), y_\ell ] \right ] \\ &= \frac{1}{a} \sum_{k=1}^{a} \mathbb{E}_{\mathcal{X}} \left [ err \left ( h^{-}(\mathbf{x_\ell}), y_\ell \right) \right ] \\ &= \frac{1}{a} \sum_{\ell=1}^{a} E_{out}(h) \\ &= E_{out}(h) \tag{12.10} \end{align*}\]

Yes, it turns out that \(E_{test}(h)\) is an unbiased estimate of \(E_{out}(h)\). But what about the variance? Let’s assume that the errors across points are independent (this may not be the case in practice, but we make this assumption to ease computation): It can be shown that the variance of \(E_{test}(h)\) is given by:

\[ Var[E_{test}(h)] = \frac{1}{a^2} \sum_{\ell=1}^{a} Var[ err(h(\mathbf{x_0}), y_0) ] = \frac{s^2}{a} \tag{12.11} \]

The above equation tells us that, as we increase the number \(a\) of test points, the variance of the estimator \(E_{test}(h)\) will decrease. Simply put, the more test points we use, the more reliably \(E_{test}(h)\) estimates \(E_{out}(h)\). Of course, \(a\) is not freely selectable; the larger \(a\) is, the smaller our training dataset will be.

In any case, the important thing is that reserving some points \(\mathcal{D}_{test}\) from a learning set \(\mathcal{D}\) to use them for testing purposes is a very wise idea. It definitely allows us to have a way for estimating the performance of a model \(h\), when applied to out-of-sample points.

Again, the holdout method is simply a conceptual starting point. Also, depending on how you form your training and test sets, you may end up with a split that may not be truly representative of the studied phenomenon. So instead of using just one split, some authors propose to use several splits of training-test sets obtained through resampling methods. We talk about this topic with more detail in the next chapter.

Holdout Algorithm

Here’s the conceptual algorithm behind the holdout method.

  1. Compile the available data into a set \(\mathcal{D} = \{(\mathbf{x_1}, y_1), \dots, (\mathbf{x_n}, y_n) \}\).

  2. Choose \(a \in \mathbb{Z}^{+}\) elements from \(\mathcal{D}\) to comprise a test set \(\mathcal{D}_{test}\), and place the remaining \(n - a\) points into the training set \(\mathcal{D}_{train}\).

  3. Use \(\mathcal{D}_{train}\) to fit a particular model \(h^{-}(x)\).

  4. Measure the performance of \(h^{-}\) using \(\mathcal{D}_{test}\); specifically, compute

\[ E_{test}(h^{-}) = \frac{1}{a} \sum_{\ell=1}^{a} err_\ell[ h^{-}(\mathbf{x_\ell}), y_\ell ] \tag{12.12} \]

where \((\mathbf{x_\ell}, y_\ell) \in \mathcal{D}_{test}\), and \(err_\ell\) is some measure of pointwise error.

  1. Generate the final model \(\widehat{h}\) by refitting \(h^{-}\) to the entire dataset \(\mathcal{D}\).

There are many different conventions as to how to pick \(a\): one common rule-of-thumb is to assign 80% of your data to the training set, and the remaining 20% to the test set.

Also, keep in mind that the model that you ultimately deliver will not be \(h^{-}\); rather, you need to refit \(h^{-}\) using the entire data \(\mathcal{D}\). This yields the final hypothesis model \(\widehat{h}\).

12.3 Model Selection

Now that we’ve talked about how to measure the generalization performance of a given final model, the next thing to discuss is how to compare different models, in order to select the best one.

Say we have \(M\) models to choose from. Consider, for example, the following three cases (each with \(M = 3\)):

We could have three different types of hypotheses:

  • \(h_1:\) linear model
  • \(h_2:\) neural network
  • \(h_3:\) regression tree

Or we could also have a particular type of model, e.g. polynomials, with three different degrees:

  • \(h_1:\) quadratic model
  • \(h_2:\) cubic model
  • \(h_3:\) 15th-degree model

Or maybe a principal components regression with three options for the number of components to retain:

  • \(h_1:\) PCR \((c = 1)\)
  • \(h_2:\) PCR \((c = 2)\)
  • \(h_3:\) PCR \((c = 3)\)

How do we select the best model from a set of candidate models?

12.3.1 Three-way Holdout Method

We can extend the idea behind the holdout method, to go from one holdout set to two holdout sets. Namely, instead of splitting \(\mathcal{D}\) into two sets (training and test), we split it into three sets that we will call: training \(\mathcal{D}_{train}\), validation \(\mathcal{D}_{val}\), and test \(\mathcal{D}_{test}\).

Data split into training, validation, and test sets

Figure 12.4: Data split into training, validation, and test sets

The test set \(\mathcal{D}_{test}\) is the set that will be used for assessing the performance of a final model. This means that once we have created \(\mathcal{D}_{test}\), the only time we use this set is at end of the learning process. We only use it to quantify the generalization error of the final model. And that’s it. We don’t use this set to make any learning decision.

What about the validation set \(\mathcal{D}_{val}\)? What do we use it for?

We recommend using the validation set for selecting the final model from a set of finalist models. However, keep in mind that other authors may recommend other uses for \(\mathcal{D}_{val}\).

Here’s our ideal suggestion on how to use the validation set.

Using a validation set for final-selection

Figure 12.5: Using a validation set for final-selection

\(\mathcal{H}_m\) represents the \(m\)-th hypothesis, and \(h^{-}_m\) represents the best \(m\)-th fit to the \(m\)-th hypothesis using \(\mathcal{D}_{train}\). In other words, \(h^{-}_m\) is the finalist model from class \(\mathcal{H}_m\).

After a finalist model \(h^{-}_m\) has been pre-selected for each class of hypothesis, then we use \(\mathcal{D}_{val}\) to compute validation errors \(E^{m}_{val}\). The model with the smallest validation error is then selected as the final model.

After the best model \(h^{-}_{m}\) has been selected (with the smallest \(E_{val}\)), a model \(h_{\overset{*}m}\) is fitted using \(\mathcal{D}_{train} \cap \mathcal{D}_{val}\). The performance of this model is assessed by using \(\mathcal{D}_{test}\) to obtain \(E^{m}_{test} = E_{test}(h_{m}^{*})\). Finally, the “official” model is the model fitted on the entire data set \(\mathcal{D}\), but the reported performance is \(E^{m}_{test}\).

One important thing to notice is that \(E^{m}_{test}\) is an unbiased estimate of the out-of-sample performance of \(h_{\overset{*}m}\), even though the actual model \(h_m\) that is delivered is the one fitted on the entire data.

Why are we calling \(\mathcal{D}_{val}\) a “validation” set, when it appears to be serving the same role as a “test” set? Because choosing the “best” model (i.e. the model with the smallest \(E_{val}\)) is a learning decision. Had we stopped our procedure before making this choice (i.e. if we had stopped after considering \(E_{val}^m\) for \(m = 1, 2, ..., M\)), we could have plausibly called these errors “test errors.” However, we went one step further and as a result obtained a biased estimate of \(E_{out}\); namely, \(E_{val}^{m}\) for the chosen model \(m\).

Of course, there is still a tradeoff when working with a 3-way split, which is precisely the fact that we need to split our data into three sets. The larger our test and validations sets are, the more reliable the estimates of the out-of-sample performance will be. At the same time, however, the more data points for validation and testing, the smaller our training set will be. Which will very likely produce far from optimal finalist models, as well as a suboptimal very final model.

12.4 Model Training

In order to perform the pre-selection of finalist models, the selection of the final model, and its corresponding assessment, we need to be able to fit all candidate models belonging to different classes of hypotheses.

The data set that we use to fit/train such models is, surprise-surprise, the training set \(\mathcal{D}_{train}\). This will also be the set that we’ll use to pre-select the finalist models from each class. For example, say the class \(\mathcal{H}_{m}\) is principal components regression (pcr), and we need to train several models \(h_{1,m}, h_{2,m}, \dots, h_{q,m}\) with different number of principal components (i.e. the tuning parameter).

Obviously, if we just simply use \(\mathcal{D}_{train}\) to fit all possible pcr models, and also to choose the one with smallest error \(E_{train}\), then we run the risk of overfitting. Likewise, we know that \(E_{train}\) is an optimistic—unreliable—biased estimate of \(E_{out}\). One may ask, “Why don’t we use \(\mathcal{D}_{val}\) to compute validation errors, and select the model with smallest validation error?” Quick answer: you could. But then you are going to run out of fresh points for the final-selection phase, also running the risk of choosing a model that overfits the validation data.

It looks like we are heading into a dead-end road. On one hand, we cannot use all the training points for both model training, and pre-selection of finalists. On the other hand, if we use the validation points for the pre-selection phase, then we are left with no fresh points to choose the final model, unless we exhaust the points in the test set.

In theory, it seems that we should use different data sets for each phase, something like a \(\mathcal{D}_{train}\) for training candidate models, \(\mathcal{D}_{pre}\) for pre-selecting finalist models, \(\mathcal{D}_{final}\) for selecting the very final model, and \(\mathcal{D}_{assess}\) for assessing the performance of the final model. We’ve seen what theory says about reserving some out-of-sample points for both 1) making learning decisions (e.g. choosing a finalist), and for 2) assessing the performance of a model. It is a wise idea to have fresh unseen points to be spent as we move on into a new learning phase. The problem is that theory doesn’t tell us exactly how many holdout sets we should have, or how big they should be.

The main limitation is that we only have so much data. Can we really afford splitting our limited data resources into four different sets? Are we doomed …?

Resampling methods to the rescue!

Fortunately, we can use (re)sampling methods that will allow us to make the most out of—at least—the training set. Because of its practical relevance, we discuss this topic in the next chapter.