# 26 Preamble for Discriminant Analysis

We now turn our attention to a different kind of classification methods that
belong to a general framework known as Dicriminant Analysis.
In this chapter, we introduce some preliminary concepts that should allow you
to have a better understanding of some underlying ideas behind discriminant
analysis methods. The starting point involves discussing some aspects that are
typically studied in
Analysis of Variance
(ANOVA). Overall, we focus on certain notions and formulas to measure variation
(or dispersion) within classes *and* between classes. Now, keep in mind that we
won’t describe any inferential aspects that are commonly used in statistical
tests for comparing means of classes.

## 26.1 Motivation

Let’s consider the famous Iris data set collected by Edgar
Anderson (1935), and used by
Ronald Fisher (1936) in his
seminal paper about Discriminant Analysis:
*The use of multiple measurements in taxonomic problems*.

This data consists of 5 variables measured on \(n = 150\) iris flowers. There are \(p\) = 4 predictors, and one response. The four variables are:

`Sepal.Length`

`Sepal.Width`

`Petal.Length`

`Petal.Width`

The response is a categorical (i.e. qualitative) variables that indicates the species of iris flower with three categories:

`setosa`

`versicolor`

`virginica`

We should say that the iris data set is a classic textbook example. It is:

- clean data
- tidy data
- classess are fairly well separated
- size of data (small)
- good for learning and teaching purposes

Keep in mind that most real data sets won’t be like the iris data.

```
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
```

Some summary statistics:

```
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
```

Let’s look at the distribution of the four input variables without making distinction between species of iris flowers:

As you can tell from the above conditional boxplots, all predictors have different ranges, as well as different types of distributions.

Now let’s take into account the group structure according to the species:

You should be able to observe that the boxplot of some predictors are fairly
different between iris species. For example, take a look at the boxplots of
`Petal.Length`

, and the boxplots of `Petal.Width`

. In contrast, predictors like
`Sepal.Length`

and `Sepal.Width`

have boxplots that are not as different as
those of `Petal.Length`

.

The same differences can be seen if we take a look at density curves:

### 26.1.1 Distinguishing Species

Let us consider the following question:

**Which predictor provides the “best” distinction between Species?**

- In classification problems, the response variable \(Y\) provides a group or class structure to the data.
- We expect that the predictors will help us to differentiate (i.e. discriminate) between one class and the others.
- The general idea is to look for systematic differences among classes. But how?
- A “natural” way to look for differences is paying attention to class means.

Let’s begin with a single predictor \(X\) and a categorical response \(Y\) measured on \(n\) individuals. Let’s take into account the class structure conveyed by \(Y\)

- Assume there are \(K\) classes (or categories)
- Let \(\mathcal{C_k}\) represent the \(k\)-th class in \(Y\)
- Let \(n_k\) be the number of observations in class \(\mathcal{C_k}\),

Then:

\[ n = n_1 + n_2 + \dots + n_K = \sum_{k=1}^{K} n_k \tag{26.1} \]

The (global) mean value of \(X\) is:

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \tag{26.2} \]

Each class \(k\) will have its mean \(\bar{x}_k\):

\[ \bar{x}_k = \frac{1}{n_k} \sum_{i \in \mathcal{C_k}} x_{ik} \tag{26.3} \]

```
#> overall mean of Sepal.Length
#> [1] 5.843333
```

Recall that a measure of (global) dispersion in \(X\) is given by the
**total sum of squares** (\(\text{tss}\)):

\[ \text{tss} = \sum_{i=1}^{n} (x_i - \bar{x})^2 \tag{26.4} \]

```
#> total sums-of-squares of Sepal.Length
#> [1] 102.1683
```

Each class \(k\) will also have its own sum-of-squares \(\text{ss}_k\)

\[ \text{ss}_k = \sum_{i \in \mathcal{C_k}} (x_{ik} - \bar{x}_k)^2 \tag{26.5} \]

One way to look for systematic differences between the classes is to compare their means. If there’s no group difference in \(X\), then the group means \(\bar{x}_{k}\) should be similar. If there is really a difference, it is likely that one or more of the mean values will differ.

A useful measure to compare differences among the \(k\) means is the deviation from the overall mean:

\[ \bar{x}_{k} - \bar{x} \]

An effective summary of these deviations is the so-called
**between-group sum of squares** (\(\text{bss}\)) given by:

\[ \text{bss} = \sum_{k=1}^{K} n_k (\bar{x}_{k} - \bar{x})^2 \tag{26.6} \]

```
#> between sums-of-squares of Sepal.Length
#> [1] 63.21213
```

To assess the relative magnitude of the between sum of squares (\(\text{bss}\)), we need to compare it to a measure of the “background” variation.

Such a measure of background variation can be formed by combining the group
variances into a pooled-estimate called **within-group sum of squares**
(\(\text{wss}\)):

\[ \text{wss} = \sum_{k=1}^{K} \sum_{i \in \mathcal{C_k}} (x_{ik} - \bar{x}_k)^2 = \text{ss}_1 + \dots + \text{ss}_K \tag{26.7} \]

```
#> within sums-of-squares of Sepal.Length
#> [1] 38.9562
```

So far we have three types of sums of squares:

\[\begin{align*} \textsf{total} \quad \text{tss} &= \sum_{i=1}^{n} (x_i - \bar{x})^2\\ \textsf{between} \quad \text{bss} &= \sum_{k=1}^{K} n_k (\bar{x}_{k} - \bar{x})^2 \\ \textsf{within} \quad \text{wss} &= \sum_{k=1}^{K} \sum_{i \in \mathcal{C_k}} (x_{ik} - \bar{x}_k)^2 \tag{26.8} \end{align*}\]

### 26.1.2 Sum of Squares Decomposition

An important aspect has to do with looking at the squared deviations, \((x_{i} - \bar{x})^2\), in terms of the class structure.

A useful trick is to rewrite the deviation terms \(x_{i} - \bar{x}\) as:

\[\begin{align*} x_{i} - \bar{x} &= x_{i} - (\bar{x}_{k} - \bar{x}_{k}) - \bar{x} \\ &= (x_{i} - \bar{x}_{k}) + (\bar{x}_{k} - \bar{x}) \tag{26.9} \end{align*}\]

We can decompose \(\text{tss}\) in terms of \(\text{bss}\) and \(\text{wss}\) as follows:

\[ \underbrace{\sum_{k=1}^{K} \sum_{i \in \mathcal{C_k}} (x_{ik} - \bar{x})^2}_{\text{tss}} = \underbrace{\sum_{k=1}^{K} n_k (\bar{x}_k - \bar{x})^2}_{\text{bss}} + \underbrace{\sum_{k=1}^{K} \sum_{i \in \mathcal{C_k}} (x_{ik} - \bar{x}_k)^2}_{\text{wss}} \tag{26.10} \]

In summary:

\[ \text{tss} = \text{bss} + \text{wss} \tag{26.11} \]

## 26.2 Derived Ratios from Sum-of-Squares

We now present two ratios derived from these sums of squares:

- Correlation ratio
- F-ratio

### 26.2.1 Correlation Ratio

The correlation ratio is a measure of the relationship between the dispersion within groups and the dispersion across all individuals.

Correlation ratio \(\eta^2\) (originally proposed by Karl Pearson) is given by:

\[ \eta^2(X,Y) = \frac{\text{bss}}{\text{tss}} \tag{26.12} \]

- \(\eta^2\) takes vaues between 0 and 1.
- \(\eta^2 = 0\) represents the special case of no dispersion among the means of the different groups.
- \(\eta^2 = 1\) refers to no dispersion within the respective groups.

```
#> correlation ratio of Sepal Length and Species
#> [1] 0.6187057
```

### 26.2.2 F-Ratio

With \(\text{tss} = \text{bss} + \text{wss}\), we can also calculate the \(F\)-ratio (proposed by R.A. Fisher):

\[ F = \frac{\text{bss} / (K-1)}{\text{wss} / (n-K)} \tag{26.13} \]

The larger the value of both ratios, the more variability is there between groups than within groups.

```
#> F-ratio of Sepal Length and Species
#> [1] 119.2645
```

The \(F\)-ratio can be used for hypothesis testing purposes. More formally, a null hypothesis postulates that the population means do not differ (\(H_0: \mu_1 = \mu_2 = \dots = \mu_K = \mu)\) versus the alternative hypothesis \(H_1\) that one or more population means differ among the \(K\) normally distributed populations.

Assuming or knowing that the variances of each sampled population are the same \(\sigma^2\), a test statistic to assess the null hypothesis is:

\[ F = \frac{\text{bss} / (K-1)}{\text{wss} / (n-K)} \tag{26.14} \]

which has an \(F\)-distribution with \(K-1\) and \(n-K\) degrees of freedom under the null hypothesis.

#### Example with Iris data

Let’s compute the dispersion decompositions for all predictors, and obtain the correlation ratios and \(F\)-ratios

```
#> correlation ratios
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 0.6187057 0.4007828 0.9413717 0.9288829
```

```
#> F-ratios
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 119.26450 49.16004 1180.16118 960.00715
```

## 26.3 Geometric Perspective

As we’ve been doing throughout most chapters in the book, let’s provide a geometric perspective of what’s going on with the data, and the classification setting behind discriminant analysis. As usual, assume that the objects form a cloud of points in \(p\)-dimensional space.

One of the things that we can do is to look at the average individual, denoted by \(\mathbf{g}\), also known as the global centroid (i.e. the center of gravity of the cloud of points).

The global centroid \(\mathbf{g}\) is the point of averages which consists of the point formed with all the variable means:

\[ \mathbf{g} = [\bar{x}_1, \bar{x}_2, \dots, \bar{x}_p] \tag{26.15} \]

where:

\[ \bar{x}_j = \frac{1}{n} \sum_{i=1}^{n} x_{ij} \tag{26.16} \]

If all variables are mean-centered, the centroid is the origin

\[ \mathbf{g} = \underbrace{[0, 0, \dots, 0]}_{p \text{ times}} \tag{26.17} \]

Taking the global centroid as a point of reference, we can look at the amount of spread or dispersion in the data.

Assuming centered features, a matrix of total dispersion is given by the
**Total Sums of Squares** (\(\text{TSS}\)):

\[ \text{TSS} = \mathbf{X^\mathsf{T} X} \tag{26.18} \]

Alternatively, we can get the (sample) variance-covariance matrix \(\mathbf{V}\):

\[ \mathbf{V} = \frac{1}{n-1} \mathbf{X^\mathsf{T} X} \tag{26.19} \]

### 26.3.1 Clouds from Class Structure

Here’s some notation that we’ll be using while covering some discriminant methods:

Let \(n_k\) be the number of observations in the \(k\)-th class

Let \(x_{ijk}\) represent the \(i\)-th observation, of the \(j\)-th variable, in the \(k\)-th class

Let \(x_{ik}\) represent \(i\)-th observation in class \(k\)

Let \(x_{jk}\) represent \(j\)-th variable in class \(k\)

Let \(n_k\) be the number of observations in the \(k\)-th class \(\mathcal{C_k}\)

The number of individuals: \(n = n_1 + n_2 + \dots + n_K = \sum_{k=1}^{K} n_k\)

Let’s now take into the account the class structure given by the response variable.

Each class is denoted by \(\mathcal{C_k}\), and it is supposed to be formed by \(n_k\) individuals.

We can also look at the local or class centroids (one per class)

The class centroid \(\mathbf{g_k}\) is the point of averages for those observations in class \(k\):

\[ \mathbf{g_k} = [\bar{x}_{1k}, \bar{x}_{2k}, \dots, \bar{x}_{pk}] \tag{26.20} \]

where:

\[ \bar{x}_{jk} = \frac{1}{n_k} \sum_{i \in \mathcal{C_k}} x_{ij} \tag{26.21} \]

We can focus on the dispersion within the clouds

Each group will have an associated spread or dispersion matrix given by a
**Class Sums of Squares** (\(\text{CSS}\)):

\[ \text{CSS}_k = \mathbf{X_{k}^{\mathsf{T}} X_k} \tag{26.22} \]

Equivalently, there is an associated variance matrix \(\mathbf{W_k}\) for each class

\[ \mathbf{W_k} = \frac{1}{n_k - 1} \mathbf{X_{k}^{\mathsf{T}} X_k} \tag{26.23} \]

where \(\mathbf{X_k}\) is the data matrix of the \(k\)-th class

We can combine the class dispersion to obtain a Within-class Sums of Squares (\(\text{WSS}\)) matrix:

\[\begin{align*} \text{WSS} &= \sum_{k=1}^{K} \mathbf{X_{k}^{\mathsf{T}} X_k} \\ &= \sum_{k=1}^{K} \text{CSS}_k \tag{26.24} \end{align*}\]

Likewise, we can combine the class variances \(\mathbf{W_k}\) as a weighted average
to get the **Within-class variance matrix** \(\mathbf{W}\):

\[ \mathbf{W} = \sum_{k=1}^{K} \frac{n_k - 1}{n - 1} \mathbf{W_k} \tag{26.25} \]

What if we focus on just the centroids?

Note that the global centroid \(\mathbf{g}\) can be expressed as a weighted average of the group centroids:

\[\begin{align*} \mathbf{g} &= \frac{n_1}{n} \mathbf{g_1} + \frac{n_2}{n} \mathbf{g_2} + \dots + \frac{n_K}{n} \mathbf{g_K} \\ &= \sum_{k=1}^{K} \left ( \frac{n_k}{n} \right ) \mathbf{g_k} \tag{26.26} \end{align*}\]

Focusing on just the centroids, we can get its corresponding matrix of dispersion
given by the **Between Sums of Squares** (\(\text{BSS}\)):

\[ \text{BSS} = \sum_{k=1}^{K} n_k (\mathbf{g_k - g})(\mathbf{g_k - g})^\mathsf{T} \tag{26.27} \]

Equivalently, there is an associated **Between-class variance matrix**
\(\mathbf{B}\)

\[ \mathbf{B} = \sum_{k=1}^{K} \frac{n_k}{n - 1} (\mathbf{g_k - g})(\mathbf{g_k - g})^\mathsf{T} \tag{26.28} \]

#### Three types of Dispersions

Let’s recap. We have three types of sums-of-squares matrices:

- \(\text{TSS}\): Total Sums of Squares
- \(\text{WSS}\): Within-class Sums of Squares
- \(\text{BSS}\): Between-class Sums of Squares

Alternatively, we also have three types of variance matrices:

- \(\mathbf{V}\): Total variance
- \(\mathbf{W}\): Within-class variance
- \(\mathbf{B}\): Between-class variance

### 26.3.2 Dispersion Decomposition

It can be shown (Huygens theorem) for both, sums-of-squares and variances, that the total dispersion (TSS or \(\mathbf{V}\)) can be decomposed as:

\(\text{TSS} = \text{BSS} + \text{WSS}\)

\(\mathbf{V} = \mathbf{B} + \mathbf{W}\)

Let \(\mathbf{X}\) be the \(n \times p\) mean-centered matrix of predictors, and \(\mathbf{Y}\) be the \(n \times K\) dummy matrix of classes:

\(\text{TSS} = \mathbf{X^\mathsf{T} X}\)

\(\text{BSS} = \mathbf{X^\mathsf{T} Y (Y^\mathsf{T} Y)^{-1} Y^\mathsf{T} X}\)

\(\text{WSS} = \mathbf{X^\mathsf{T} (I - Y (Y^\mathsf{T} Y)^{-1} Y^\mathsf{T}) X}\)