38 Random Forests

38.1 Introduction

Random Forests are another creation of Leo Breiman, co-developped with Adele Cutler, in the late 1990s. Like bagging, documentation on random forests is available in Leo’s archived academic website:


As the name indicates, a random forest entails a set of trees. The building blocks are still single trees, but instead of fitting just one tree, we grow many of them. Depending on the type of analyzed response variable, you will end up with a forest of regression trees or a forest of classification trees.

Conceptually, growing many trees is something that can also be done with bagging, in which case we talk about a bagged forest.

Several data sets and fitted models

Figure 38.1: Several data sets and fitted models

The starting point is the same as in bagging: use of bootstrap samples to generate multiple learning data sets \(\mathcal{D}_1, \dots, \mathcal{D}_B\), which in turn will be used to grow several trees, like in the following diagram:

Conceptual diagram of a Random Forest

Figure 38.2: Conceptual diagram of a Random Forest

So far, so good.

Now, what exactly makes a random forest different from a bagged forest? The answer to this question has to do with the way in which single trees are grown in a random forests. We are going to introduce a twist in the tree-building algorithm. More specifically, we are going to modify the node splitting procedure. Assuming we have \(p\) features, we predetermine a number \(k \ll p\) such that at each node, \(k\) features are randomly selected (from the set of \(p\) predictors), to find the best split.

The fundamental reason to introduce another random sampling operation, which is repeated for every node partition, is to attempt reducing the number of correlated predictors that could potentially be present in the construction of a single tree. This seemingly awkward operation turns out to return high dividends when a forest is deployed in practice.

38.2 Algorithm

Say we decide to grow \(B\) trees (e.g. \(B\) = 100). Here is how we would grow our forest with trees \(T_1, T_2, \dots, T_B\).

  • Draw a bootstrap sample—that is, with replacement—of size \(n\) from the learning set \(\mathcal{D}\), where \(n\) is the size of the learning set.

  • Specify a number \(k << p\) of features which will be used in each node-splitting operation.

  • Grow a tree \(T_b\) using the bootstrapped data \(\mathcal{D}_b\), by recursively repeating the following steps:

    • For each child node:

      1. Randomly select \(k\) features from the \(p\) possible features.

      2. Pick the feature \(X_{\ell}\), among the \(k\) available features, that produces the best split.

      3. Split the node into two child nodes according to the rule you obtained from step (ii).

  • Output the set of trees, that is, the forest: \(\{T_1, T_2, \dots, T_B\}\) (also known as the ensemble of trees).

Here, \(k\) is a tuning parameter. The good news, however, is that we typically set \(k = \lceil \sqrt{p} \rceil\) (take square root of \(p\) and round up, although some authors prefer to round down, \(k = \lfloor \sqrt{p} \rfloor\)). Why is this good news? Because using the previous formula to select \(k\), we don’t really need to tune anything.

There is a theoretical justification for the fact that

\[ k = \lceil \sqrt{p} \rceil \]

This gives us a good chance that our subsamples are very nearly independent.

38.2.1 Two Sources of Randomness

Randomness in the data points

  • Each tree gets to see a different data set (that is, no tree gets to train on the entire dataset)

  • Behind this, we have the notion of bagging (recall, this is a portmanteau of Bootstrap and Aggregating)

Randomness in the features/predictors

  • Each tree gets to see a different set of features (that is, no tree gets to train on the entire set of features)

  • This is the powerful feature of random forests. Specifically, it is in this step that trees become de-correlated. Note that this source of randomness is not present in bagging.

38.2.2 Regressions and Classification Forests

Once you grow a random forests, how does it get used to make predictions? Let’s consider the type of prediction based on the kind of response variable.

Forest of Regression Trees

Suppose we are in a regression setting. Given a test a point \((\mathbf{x_0}, y_0)\), we pass this query point to all the trees. We would obtain \(B\) predictions for \(y_0\): denote these by \(\hat{y}_0^{(1)}, \hat{y}_0^{(2)}, \dots, \hat{y}_0^{(B)}\).

Random forest of regression trees

Figure 38.3: Random forest of regression trees

To get the predicted value \(\hat{y}_0\), we would then average these single-tree predictions:

\[ \hat{y}_0 = \frac{1}{B} \sum_{b=1}^{B} \hat{y}_0^{(b)} \]

Forest of Classification Trees

Suppose now, that we are in a classification setting with a binary response with 2 classes, say class “red” and class “yellow”. Given a test a point \((\mathbf{x_0}, y_0)\), we pass this query point to all the trees:

Random forest of classification trees

Figure 38.4: Random forest of classification trees

The single-tree predictions \(\hat{y}_0^{(b)}\) would be obtained by examining the distribution of each class within each terminal node.

In this hypothetical example (see above diagram), \(\hat{y}_0^{(1)} = \verb|red|\), \(\hat{y}_0^{(2)} = \verb|yellow|\), \(\hat{y}_0^{(3)} = \verb|red|\), and \(\hat{y}_0^{(B)} = \verb|red|\). We would then assign \(\hat{y}_0\) the mode of all this single-tree predictions:

\[ \hat{y}_0 = \mathrm{mode} \left\{ T_b(x_0) \right\} \]

38.2.3 Key Advantage of Random Forests

Let’s first start by recapping what we typically do with most techniques covered in this book. We start with a dataset \(\mathcal{D}\), and then split it into some number of sub-datasets:

\[ \mathcal{D} \to \begin{cases} \mathcal{D}_{\mathrm{training}} \\ \mathcal{D}_{\text{testing}} \\ \end{cases} \hspace{5mm} \text{|or|} \hspace{5mm} \mathcal{D} \to \begin{cases} \mathcal{D}_{\mathrm{training}} \\ \mathcal{D}_{\text{validation}} \\ \mathcal{D}_{\text{testing}} \\ \end{cases} \]

In bagging, as well as in random forests, we do NOT take this approach. Rather, we end up with the so-called Out-of-Bag Error (or OOB, for short). For example, suppose we have a forest of 10 trees \(T_1, T_2, \dots, T_{10}\)

Out-of-Bag points in Random Forest

Figure 38.5: Out-of-Bag points in Random Forest

Some points \((\mathbf{x}_i, y_i)\) are not used by all the trees. For instance, say we have a data point \((\mathbf{x}_i, y_i)\) that is never used by trees \(T_2, T_3, T_5,\) and \(T_8\). This means that point \((\mathbf{x}_i, y_i)\) is really a test point for these trees. Hence, we could plug \((\mathbf{x}_i, y_i)\) into these trees to obtain predictors \(\hat{y}_i^{(2)} , \hat{y}_i^{(3)} , \hat{y}_i^{(5)},\) and \(\hat{y}_i^{(8)}\). We would then compute the OOB in the following manner:

\[ \mathrm{OOB}_i = \frac{ (\hat{y}_i^{(2)} - y_i)^2 + (\hat{y}_i^{(3)} - y_i)^2 + (\hat{y}_i^{(5)} - y_i)^2 + (\hat{y}_i^{(8)} - y_i)^2 }{4} \]

More generally, we define the overall OOB error to be

\[ E_{\mathrm{OOB}} = \frac{1}{n} \sum_{i=1}^{n} \mathrm{OOB}_i \]

where the OOB\(_i\)’s are computed as above. Here is the key observation: it turns out that \(E_{\mathrm{OOB}}\) is an unbiased estimate of \(E_{out}\).


Last but not least, we want to highlight the extremely atractive property of ensemble methods: we get to reduce variance without increasing bias. And with random forests, we get an unbiased estimate of \(E_{out}\) (almost for free!).

Typical error curves of random forests

Figure 38.6: Typical error curves of random forests