37 Bagging

37.1 Introduction

Bagging is the acronym for “bootstrap aggregating”, a method developped by Leo Breiman in the early 1990s, and officially published in his paper Bagging Predictors, in the Machine Learning journal. A preliminary report is still available in Leo’s archived academic website:

https://www.stat.berkeley.edu/~breiman/bagging.pdf

We will not discuss bagging in detail, rather we will only focus on its conceptual essence, following Leo’s description: “a method for generating multiple versions of a predictor and using these to get an aggregated predictor”.

37.1.1 Idea of Bagging

Suppose that, instead of having an initial dataset \(\mathcal{D}\), we are able to obtain \(M\) new datasets \(\mathcal{D}_1, \dots, \mathcal{D}_M\). From each set \(\mathcal{D_m}\), we then obtain a model \(h_m\). The type of model can be a single decision tree, or a generic regression model (e.g. ridge regression), or a generic classification model (e.g. logistic regression).

Several data sets and fitted models

Figure 37.1: Several data sets and fitted models

Having the set of \(m\) models, we can compute the (emperical) average model \(\bar{h}(x)\) as:

\[ \text{empirical} \quad \bar{h}(x) = \frac{1}{M} \sum_{m=1}^{M} h_m(x) \]

which we can include in our previous diagram and get an updated figure:

Empirical average model

Figure 37.2: Empirical average model

As the number of data sets increases, i.e. a very large number \(M \to \infty\), the empirical average model will approach the theoretical average hypothesis:

\[\begin{align*} \text{empirical} \quad \bar{h}(x) & \to \text{theoretical} \quad \bar{h}(x) \\ \text{as} \hspace{5mm} M & \to \infty \end{align*}\]

where the theoretical average hypothesis \(\bar{h}\) is given by the expectation over all learning data sets:

\[ \text{theoretical} \quad \bar{h}(x) = \mathbb{E}_{\mathcal{D}} [h^{\mathcal{D}} (x)] \]

Now, there’s a catch. In order for the approximation to be valid, that is, to approximate the theoretical average hypothesis with the empirical one, the data sets \(\mathcal{D}_1, \dots, \mathcal{D}_M\) have to be independent.

The problem is that, in practice, we don’t have \(M\) independent datasets. However, what we can do is to resample from \(\mathcal{D}\), sampling with replacement.

If \(|\mathcal{D}| = n\), we wish to produce a set of data sets \(\{\mathcal{D_m} \}_{m=1}^{M}\) with \(|\mathcal{D_m}|=n\); that is, these are bootstrap samples. Given these \(M\) bootstrap samples, we proceed as we did in our theoretical experiment above: generate the set of individual models \(\{h_m \}_{m=1}^{M}\), and define compute the empirical average hypothesis:

\[ \bar{h}(x) = \frac{1}{M} \sum_{m=1}^{M} h_m(x) \]

Now, we do not have the weak law of large numbers, because these \(\mathcal{D_m}\)’s are not independent. However, it turns out that despite this, the empirical \(\bar{h}(x)\) gets quite close to the theoretical average hypothesis.

This process of taking bootstrap samples, averaging, and using this to approximate the average hypothesis is known as bagging, which is a portmanteau of bootstrap aggregating. To summarize, we generate a series of data sets \(\mathcal{D}_1, \dots, \mathcal{D}_M\) by drawing bootstrap samples from the original learning set \(\mathcal{D}\); we fit the corresponding models \(h_m(x)\) for each bootstrap set \(\mathcal{D}_m\), and then we aggregate them (averaging them) to obtain the final model:

Bagging Scheme

Figure 37.3: Bagging Scheme

The way we obtain a prediction \(\hat{y}_0\) for a query point \(\mathbf{x_0}\) is by taking the average of all \(M\) predictions, that is:

\[ \hat{y}_0 \rightarrow \bar{h}(\mathbf{x_0}) = \frac{1}{M} \sum_{m=1}^{M} h_m(\mathbf{x_0}) \]

37.2 Why Bother Bagging?

Recall the bias-variance decomposition. Given a learning data set \(\mathcal{D}\) of \(n\) points, and a hypothesis \(h(x)\), the expectation of the Squared Error for a given out-of-sample point \(x_0\), over all possible learning sets, is expressed as:

\[ \mathbb{E}_{\mathcal{D}} \left [ \left( h^{(\mathcal{D})}(x_0) - f(x_0) \right)^2 \right ] \]

When we have a noisy target \(y = f(x) + \epsilon\), with \(\epsilon\) a zero-mean noise random variable with variance \(\sigma^2\), the bias-variance decomposition becomes:

\[ \mathbb{E}_{\mathcal{D}} \left [ \left (h^{(\mathcal{D})}(x_0) - y_0 \right)^2 \right ] = \text{var} + \text{bias}^2 + \sigma^2 \]

which can be expanded as:

\[\begin{align*} \mathbb{E}_{\mathcal{D}} \left [ \left( h^{(\mathcal{D})}(x_0) - f(x_0) + \epsilon \right)^2 \right ] &= \underbrace{\mathbb{E}_{\mathcal{D}} \left [ \left (h^{(\mathcal{D})}(x_0) - \bar{h}(x_0) \right)^2 \right ]}_{\text{variance}} \\ &+ \underbrace{ \big( \bar{h}(x_0) - f(x_0) \big)^2 }_{\text{bias}^2} \\ &+ \underbrace{\sigma^2}_{\text{noise}} \end{align*}\]

Suppose we only use our training set to fit one model; if this is a decision tree, we know it will tend to suffer from high variance and hence there is a strong chance this model will be wildly off from the true model. Using bagging, however, we are able to reduce variance.

Now, will bagging increase bias? We know that, in general, reducing variance comes at the cost of increasing bias. However, after closer examination of bias:

\[ \text{Bias}^2 (h(x)) = \big( \bar{h}(x_0) - f(x_0) \big)^2 \]

you should be able to note that bias does not depend on our individual datasets directly. Sure, there is an implicit dependency through the average model, however there is no \(\mathcal{D}\) in the bias formula bove. Hence, in bagging, we can decrease variance while not sacrificing bias.

Ensemble of models, and bagged model

Figure 37.4: Ensemble of models, and bagged model

Surely there is still a cost, somewhere? Well, there is. On one hand, we lose interpretability. If we apply bagging with decision trees, we then obtain a bagged forest in which there is not a “global tree” that we can visualize and follow its partitions. On the other hand, there is a higher computational cost associated with bagging (especially in the resample procedure). Luckily, there is a “cheat” for this: parallel processing. This resampling procedure is highly parallelizable; we could run different parts on different computers and combine them in the end. Hence, if we have enough computational resources, this should not be a huge problem.