# 34 Binary Splits and Impurity

As we mentioned in the last part of the previous chapter, the building process of a binary decision tree requires three main components:

Establish the set of admissible divisions for each node: compute all possible binary splits.

Define a criterion to find the “best” node split: some optimization criterion to be optimized. This in turn, requires determining some measure of heterogeneity (within a given node).

Determine a rule to declare a given node as either internal node, or leaf node.

In this chapter we’ll review these components in more detail.

## 34.1 Binary Partitions

In binary segmentation trees, the set of admissible splits for each node depends on the nature of the predictor featuers in the data. Predictor variables can be of different type either nature but also continuous. We can divide them in four classes:

- Binary variables
- Ordinal variables
- Nominal variables
- Continuous variables

The number of possible binary splits for each type of segmentation variable is described below.

### 34.1.1 Splits of Binary variables

Because binary variables have two values, they only generate one binary
partition. For example, say we have a variable *Gender* with two values,
female and male. This feature can only de divided in one possible way:

### 34.1.2 Splits of Nominal Variables

A nominal variable may have many categories and the number of binary splits depends on the number of distinct categories for the corresponding variable. The total number of possible binary partitions for a nominal variable with \(q\) categories is

\[ 2^{q-1} - 1 \]

For example, suppose a variable *Color* with three
categories: blue, white and red. The number of binary splits for this nominal
feature is:

\[ 2^{3-1} - 1 = 3 \]

### 34.1.3 Splits of Ordinal Variables

Binary splits for ordinal variables must respect the order property of the categories, that is, the grouping of categories must preserve the order among the values. The total number of binary partitions for an ordinal variable with \(q\) categories is

\[ q-1 \]

For example, the *Apparel size* may have four
categories: small (S), medium (M), large (L) and extra-large (XL). In this
case, the number of binary splits will be

\[ 4-1 = 3 \]

### 34.1.4 Splits Continuous variables

The treatment for continuous variables depends on the number of different values the variable takes. Suppose a continuous variable with \(q\) different values. If we consider that \(q\) is “adequate” in some sense, we can treat the continuous variable like an ordinal variable. If we consider \(q\) to be large we may group its values and reduced their number in \(q^*\) (\(q^*<q\)). In both cases, the variable is treated as an ordinal variable and the number of binary splits will be

\[ q-1 \quad \text{or} \quad q^*-1 \]

For example, imagine the continuous variable *Age*
(measured in years) with five values 5, 7, 8, 9 and 10. The number of binary
splits is

\[ 5 - 1 = 4 \]

The following table shows the number of binary splits depending on the type of scale for a given feature.

Scale | Order matters? | Binary splits |
---|---|---|

Binary (\(q=2\) categories) | No | 1 |

Nominal (\(q>2\) categories) | No | \(2^{q-1} - 1\) |

Ordinal (\(q>2\) categories) | Yes | \(q - 1\) |

Continuous (\(q\) values) | Yes | \(q - 1\) |

## 34.2 Measures of Impurity

The second aspect behind any binary tree building process has to do with determining a criterion to find the “best” node split. This in turn, requires some sort of optimization criterion. The general idea is to find the node partition that gives the largest reduction in the amount of heterogeneity (or variability) of a parent node.

For illustrative purposes, let us focus on a single node with \(n = 14\) objects, divided in \(2\) classes; one class denoted as \(\bigcirc\), the other denoted as \(\Delta\); both classes having \(7\) objects. Suppose this node splits in the following manner:

We have an intuitive feeling that the parent node has more heterogeneity than its child nodes. In fact, this is a general result: a split cannot result in child nodes with more heterogeneity than the parent node. Stated differently, you can never do worse than the parent node.

In the parent node of the previous figure, the proportion of \(\bigcirc\) equals the proportion of \(\Delta\). In other words, there is an equal probability of being a \(\bigcirc\) as there is to be a \(\Delta\). However, in the left node, there is a higher chance of being a \(\bigcirc\) than \(\Delta\) (the reverse is true of the right node).

Most measures of heterogeneity (and therefore homogeneity) relate to proportions of classes in a parent node. Let’s spitball some ideas as to how to do this:

\(\mathrm{prop}(\bigcirc) - \mathrm{prop}(\Delta)\); you would need to assign some sort of interpretation to the value \(0\) in this case.

Entropy (information, as viewed from a computer science standpoint). Could also refer to this as a measure of “disorder” within a node.

Impurity (how mixed, diverse)

### 34.2.1 Entropy

The first type of measure of heterogeneity (or impurity) that we’ll study is
the so-called *Entropy*.

Let’s start with an (abstract) toy example: say we have a box containing 12 objects, all of the same class (say, \(\bigcirc\)).

If we randomly select an object from this box, we have zero uncertainty about
its class: in other words, there is **zero entropy** associated with this
situation.

Now, suppose three of those \(\bigcirc\)’s were replaced with \(\square\)’s.

If we were to randomly draw an object from this box, there would be *some*
amount of uncertainty as to its class; that is, there is **some entropy**.
Probabilistically speaking, there’s a chance of 9/12 that the selected object
is \(\bigcirc\), and a chance of 3/12 that the drawn object is \(\square\).

Finally, in the most extreme case, say we have 12 objects, all of different classes, like in the diagram below:

We are quite uncertain as to the classes of objects drawn from this box;
that is, we have **maximum entropy**.

In Short: the intuition behind entropy relates to how “mixed” a box is, with respect to the classes contained inside. If a box is “pure” (i.e. has only one class), it has zero entropy where as if a box is “highly impure” (i.e. has many different classes), it has some nonzero amount of entropy.

Now, let’s see how we can link our idea of entropy with the idea of
*probability*. From the examples above, we can see the following relationship
when randomly slecting an object from the box:

\[ \begin{array}{lcl} \text{large probability of guessing class} & \Rightarrow & \text{small entropy value} \\ \text{small probability of guessing class} & \Rightarrow & \text{large entropy value} \\ \end{array} \]

Hence, it makes sense to try and model entropy as a function of \(P\), the probability of selecting an object of a certain class. From our discussion above, we see that our entropy function should be small when \(P\) is around 1, and large when \(P\) is around \(0\).

There are quite a few functions that display this behavior. The one we will use is the logarithm, of any base \(b\). Note that

\[ \log_b(P) : [0, 1] \to (-\infty, 0] \]

That is, since our probability values are restricted to lie in the unit interval, the entropy will lie somewhere on the negative real axis.

### 34.2.2 The Math Behind Entropy

Letting \(H(\text{node})\) denote the entropy of a particular node, we use the following definition:

\[ H(\text{node}) = - \sum_{k=1}^{K} p_k \log_2(p_k) \]

where \(p_k\) denotes the probability of randomly selecting an object that comes from class \(k\). Note the negative sign in the above formula; the reason for this negative sign is to have positive values from logarithms of probabilities. By convention, we use a log of base 2. The rationale behind picking \(2\) comes from information theory, as base-2 logs can be interpreted in terms of bits. Additionally, if \(p_j = 0\) for some class \(j \in \{1, \dots, K\}\), we set \(p_j \log_b(p_j) = 0\) because \[ \lim_{p_j \to 0} \left[ p_j \log_b(p_j) \right] = 0 \] In this way, we never have to directly evaluate \(\log_b(0)\).

Let’s see if this definition of entropy is consistent with our intuition. For
example, let us consider a **pure** node; that is, a node in which all
objects share the same class. In this case \(p_k = 1\), and we have only one
class so

\[\begin{align*} H(\text{node}) &= - p_1 \cdot log_2 (p_1) \\ &= -(1) \cdot \log_2(1) \\ &= -1 \cdot 0 = 0 \end{align*}\]

Now, for our second example above, we have \(K = 2\) classes with \(p_1 = 9/12\) and \(p_2 = 3/12\), so that

\[\begin{align*} H(\text{node}) &= - p_1 \log_2 (p1) - p_2 \log_2 (p_2) \\ &= - \frac{9}{12} \log_2\left( \frac{9}{12} \right) - \frac{3}{12} \log_2\left( \frac{3}{12} \right) \approx 0.8112 \end{align*}\]

In a node where there is a 50-50 split, we have

\[\begin{align*} H(\text{node}) &= - p_1 \log_2 (p1) - p_2 \log_2 (p_2) \\ &= - \frac{1}{2} \log_2\left( \frac{1}{2} \right) - \frac{1}{2} \log_2\left( \frac{1}{2} \right) = 1 \end{align*}\]

The following figure depicts several examples of sets and their entropy values.

### 34.2.3 Gini Impurity

The second type of measure of heterogeneity (or impurity) that we’ll describe
is the so-called *Gini* impurity, also refer to as Gini index.

Let us return to the second box example from the previous section. There are 9 \(\bigcirc\)’s and 3 \(\square\)’s.

Let’s suppose I randomly select a ball, and note what class it belongs to. I then ask you (without giving you any information about the box, other than the fact that there are \(\bigcirc\)’s and \(\square\)’s), to guess what class my selected object belongs to.

There are 4 possible outcomes to this situation:

- I select \(\bigcirc\), you guess \(\bigcirc\).
- I select \(\bigcirc\), you guess \(\square\).
- I select \(\square\), you guess \(\bigcirc\).
- I select \(\square\), you guess \(\square\).

The Gini index pays attention to the **misclassifications**; that is, the
situations in which your guess was wrong. For instance, in our example above,
there is a total of 2 possible misclassifications:

- I select \(\bigcirc\), you guess \(\square\).
- I select \(\square\), you guess \(\bigcirc\).

The Gini Index involves the probability of misclassification of an object randomly selected from the box. It is computed as follows:

\[\begin{align*} \text{Gini Index} & = \underbrace{ Prob(\bigcirc) }_{\text{me}} \underbrace{ Prob(\square) }_{\text{you}} + \underbrace{ Prob(\square) }_{\text{me}} \underbrace{ Prob(\bigcirc) }_{\text{you}} \\ & = \left( \frac{9}{12} \right) \left( \frac{3}{12} \right) + \left( \frac{3}{12} \right) \left( \frac{9}{12} \right) = \frac{1}{8} = 0.375 \end{align*}\]

For a given node containing objects of \(K\) classes, the Gini impurity is given by:

\[ Gini(\text{node}) = \sum_{k=1}^{K} p_k (1 - p_k) \]

Doing a bit of algebra, it is possible to find alternative formulas for the gini impurity:

\[\begin{align*} Gini(\text{node}) &= \sum_{k=1}^{K} p_k (1 - p_k) \\ &= 1 - \sum_{k=1}^{K} p^{2}_{k} \\ &= \sum_{k \neq \ell} p_k p_{\ell} \end{align*}\]

#### Some Examples

For illustration purposes, let’s compute gini impurity of a node containing 12 objects, all of the same class:

\[\begin{align*} Gini(\text{node}) &= \sum_{k=1}^{1} p_k (1 - p_k) \\ &= \left( \frac{12}{12} \right) \left( \frac{0}{12} \right) = 0 \end{align*}\]

Now consider a balanced node with objects of two classes

the gini impurity in this case is:

\[\begin{align*} Gini(\text{node}) &= \sum_{k=1}^{2} p_k (1 - p_k) \\ &= \left( \frac{6}{12} \right) \left( \frac{6}{12} \right) + \left( \frac{6}{12} \right) \left( \frac{6}{12} \right) = 0.5 \end{align*}\]

The following figure depicts several examples of sets and their gini impurities.

### 34.2.4 Variance-based Impurity

So far we’ve been considering measures of impurity when we have a categorical response. But what about a quantitative response? What measure of impurity can we use in this case?

The typical way in which we measure the amount of impurity of a node, with respect to a quantitative response \(Y\), is by computing the sum of squared deviations from the mean (or alternatively the variance).

For instance, consider a parent node denoted by \(N\). The sum of squared deviations from the mean is given by:

\[ \text{rss} = \sum_{i \in N} (y_i - \bar{y})^2 \]

where \(\bar{y}\) is the mean response for the individuals in node \(N\).

Now, suppose we are looking the best binary split among a set of \(X_1, \dots, X_m\) features.

Say we take predictor \(X_j\), and we are evaluating a splitting condition given by the cutoff value \(s\). This means that \(s\) splits the feature space in two regions \(\mathcal{R_1}\) and \(\mathcal{R_2}\):

\[ \mathcal{R_1}(j,s) = \{ X | X_j < s\} \quad \mathrm{and} \quad \mathcal{R_2}(j,s) = \{ X | X_j \geq s \} \]

The goal is to find the feature \(j\), and the value \(s\) that minimize the following equation:

\[ \sum_{i \in \mathcal{R_1}} (y_i - \bar{y}_{\mathcal{R_1}})^2 + \sum_{i \in \mathcal{R_2}} (y_i - \bar{y}_{\mathcal{R_2}})^2 \]

where \(\bar{y}_{\mathcal{R_1}}\) is the mean response for the individuals in region \(\mathcal{R_1}(j,s)\), and \(\bar{y}_{\mathcal{R_2}}\) is the mean response for the individuals in region \(\mathcal{R_2}(j,s)\).

When \(j\) and the value \(s\) have been identified, the predicted response \(\hat{y}_0\) of an observation is calculated as the mean of the in-sample individuals in the region to which \(\mathbf{x_0}\) belongs to.