28 Discriminant Analysis
So far we have considered Discriminant Analysis (DA) from a largely conceptual standpoint. In this lecture, we start to formalize our notions into a mathematical framework in what we will call Probabilistic Discriminant Analysis.
28.1 Probabilistic DA
We start with the notion of the Bayes Classifier, which is explained in more detail in the introductory chapter about Classification. Suppose we have a response with \(K\) classes \((\mathcal{C}_1, \dots, \mathcal{C}_K)\), and further suppose we have \(p\) predictors. The Bayes Classifier gives us the following classification rule:
\[ \text{assign } \mathbf{x_i} \text{ to the class for which } P( y_i = k \mid \mathbf{x_i}) \text{ is the largest} \]
This rule is optimal in the sense that it minimizes the misclassification error.
The formula for the conditional probability is given by:
\[ \underbrace{ P(y_i = k \mid \mathbf{x}) }_{\text{posterior}} = \frac{ \overbrace{ P(\mathbf{x_i} \mid y_i = k) }^{\text{likelihood}} \overbrace{ P(y_i = k) }^{\text{prior}} }{P(\mathbf{x_i}) } \tag{28.1} \]
where the denominator is obtained as:
\[ P(\mathbf{x_i}) = \sum_{k=1}^{K} P(y_i = k) P(\mathbf{x_i} \mid y_i = k) \tag{28.2} \]
Changing some of the notation, let:
\(P(y = k)\) = \(\pi_k\), the prior probability for class \(k\).
\(P(X = \mathbf{x} \mid y = k)\) = \(f_k(\mathbf{x})\), the class-conditional density for inputs \(X\) in class \(k\).
Thus, the posterior probability (the conditional probability of the response given the inputs) is:
\[ P(y = k \mid X = \mathbf{x}) = \frac{f_k(\mathbf{x}) \hspace{1mm} \pi_k}{\sum_{k=1}^{K} f_k(\mathbf{x}) \pi_k} \tag{28.3} \]
Note that the numerator in the above equation is the same for all classes \(k = 1, \dots, K\).
28.1.1 Normal Distributions
Now, here comes one of the key assumptions of Discriminant Analysis: we assume that our class-conditional probabilities follow a Gaussian distribution.
Univariate Data (\(\boldsymbol{p = 1}\))
In the case of univariate data with just one input \(x\), we have that the class-conditional density has a Normal distribution: \(x|\mathcal{C_k} \sim N(\mu_k, \sigma_k)\).
\[ f_k(x) = \frac{1}{\sqrt{2 \pi \sigma_k^2}} \hspace{1mm} \exp\left\{ -\frac{1}{2} \left( \frac{x - \mu_k}{\sigma_k} \right)^2 \right\} \tag{28.4} \]
Multivariate Data (\(\boldsymbol{p > 1}\))
In the case of multivariate data, we use a multivariate normal distribution for our class-conditional density:
\[ \mathbf{x} \mid \mathcal{C}_k \sim \textit{MVN}(\boldsymbol{\mu_k}, \mathbf{\Sigma_k}) \tag{28.5} \]
Thus, the density function is given by:
\[ f_k(\mathbf{x}) = \frac{1}{(2 \pi)^{p/2} | \mathbf{\Sigma_k} |^{1/2} } \exp\left\{ - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu_k})^{\mathsf{T}} \mathbf{\Sigma_k}^{-1} (\mathbf{x} - \boldsymbol{\mu_k}) \right\} \tag{28.6} \]
where \(\mathbf{\Sigma_k}\) denotes the variance-covariance matrix associated with \(\mathcal{C_k}\), and \(\boldsymbol{\mu_k}\) denotes the centroid of class \(\mathcal{C_k}\). Note that the exponent of the density is simply the Mahalanobis distance between \(\mathbf{x}\) and \(\boldsymbol{\mu_k}\):
\[ d^2(\mathbf{x}, \boldsymbol{\mu_k}) = (\mathbf{x} - \boldsymbol{\mu_k})^{\mathsf{T}} \mathbf{\Sigma_k}^{-1} (\mathbf{x} - \boldsymbol{\mu_k}) \tag{28.7} \]
The following interactive plot shows an example of a (pseudo) bivariate normal distribution:
28.1.2 Estimating Parameters of Normal Distributions
Let’s parse through our equation from above. Clearly, in order for this to be of any practical use, we need to estimate some of these quantities:
- prior probabilities: \(\hat{\pi}_k\)
- mean-vectors: \(\boldsymbol{\hat{\mu}_k}\)
- variance-covariance matrices: \(\mathbf{\widehat{\Sigma}_k}\)
Priors
Estimating \(\pi_k\) is relatively intuitive:
\[ \hat{\pi}_k = \frac{n_k}{n} \tag{28.8} \]
where \(n_k = |\mathcal{C}_k|\) denotes the size of class \(k\) and \(n\) denotes the total number of data points.
Mean Vectors
For \(\boldsymbol{\hat{\mu}_k}\), we can use the centroid of \(\mathcal{C}_k\); i.e. the average individual of class \(k\).
\[ \boldsymbol{\hat{\mu}_k} = \mathbf{g_k} \tag{28.9} \]
Variance-Covariance Matrices
For \(\mathbf{\widehat{\Sigma}_k}\), we can use the within-variance matrix:
\[ \mathbf{\widehat{\Sigma}_k} = \frac{1}{n-1} \mathbf{X_k}^\mathsf{T} \mathbf{X_k} \tag{28.10} \]
where \(\mathbf{X_k}\) is the mean-centered data matrix for objects of class \(\mathcal{C_k}\).
28.2 Discriminant Functions
Given all of our above estimations, we can now find an estimate for the posterior probability \(P(y_i = k \mid \mathbf{x_i})\):
\[ \widehat{P(y_i = k \mid \mathbf{x} )} = \frac{\hat{\pi}_k \hspace{1mm} \hat{f_k} (\mathbf{x}) }{\sum_{k=1}^{K} \hat{\pi}_k \hat{f_k} (\mathbf{x}) } \tag{28.11} \]
Now, note that the denominator remains constant across classes. Furthermore, we have a Gaussian expression in our numerator; hence, it makes sense to take logarithms:
\[ \ln\left[ P(y_i = k \mid \mathbf{x} ) \right] \propto \ln( \hat{\pi}_k) + \ln\left[ \hat{f_k} (\mathbf{x}) \right] \tag{28.12} \]
We now substitute the Multivariate Normal p.d.f. into our expression above, to see:
\[\begin{align*} \ln\left[ \hat{f_k}(x) \right] & = \ln \left[ (2\pi)^{-p/2} | \mathbf{\hat{\Sigma}_k} |^{-1/2} \exp\left\{ - \frac{1}{2} (\mathbf{x} - \boldsymbol{\hat{\mu}_k})^{\mathsf{T}} \mathbf{\hat{\Sigma}_k}^{-1} (\mathbf{x} - \boldsymbol{\hat{\mu}_k}) \right\} \right] \\ & \to \ln\left( | \mathbf{\hat{\Sigma}_k} |^{-1/2} \right) - \frac{1}{2} (\mathbf{x} - \boldsymbol{\hat{\mu}_k})^{\mathsf{T}} \mathbf{\hat{\Sigma}_k}^{-1} (\mathbf{x} - \boldsymbol{\hat{\mu}_k}) \\ & = -\frac{1}{2} \ln\left( | \mathbf{\hat{\Sigma}_k} | \right) - \frac{1}{2} \left[ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}_k^{-1}} \mathbf{x} - 2 \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}_k^{-1}} \boldsymbol{\hat{\mu}_k} + \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}_k^{-1}} \boldsymbol{\hat{\mu}_k} \right] \tag{28.13} \end{align*}\]
Substituting this back into our expression for \(\ln[P(y_i = k \mid \mathbf{x})]\) leads us to the so-called discriminant functions:
\[ \delta_k(\mathbf{x}) = -\frac{1}{2} \ln\left( | \mathbf{\hat{\Sigma}_k} | \right) - \frac{1}{2} \left[ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}_k^{-1}} \mathbf{x}^\mathsf{T} - 2 \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}_k^{-1}} \boldsymbol{\hat{\mu}_k} + \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}_k^{-1}} \boldsymbol{\hat{\mu}_k} \right] + \ln\left( \hat{\pi}_k \right) \tag{28.14} \]
In general, our classification rule is as follows:
\[ \text{assign } \mathbf{x} \text{ to the class for which } \delta_k(\mathbf{x}) \text{ is the largest} \]
28.3 Quadratic Discriminant Analysis (QDA)
Let us classify the order of each term in \(\delta_K(\mathbf{x})\), w.r.t. \(\mathbf{x}\):
\[ \delta_k(\mathbf{x}) = -\frac{1}{2} \underbrace{ \ln\left( | \mathbf{\hat{\Sigma}_k} | \right) }_{\text{constant}}- \frac{1}{2} \left[ \underbrace{ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}_k^{-1}} \mathbf{x} }_{\text{quadratic}} - \underbrace{ 2 \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}_k^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{linear}} + \underbrace{ \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}_k^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{constant}} \right] + \underbrace{ \ln\left( \hat{\pi}_k \right) }_{\text{constant}} \tag{28.15} \]
Of course, we could group terms to obtain an expression of the form \(\text{const} + \text{linear} + \text{quadratic}\). The important thing is that we obtain a quadratic function of \(\mathbf{x}\); this leads us to Quadratic Discriminant Analysis (QDA).
Having a quadratic discriminant function causes the decision boundaries in QDA to be quadratic surfaces.
28.4 Linear Discriminant Analysis
What if, in our expression for \(\delta_K(\mathbf{x})\), we have that all covariance matrices are the same:
\[ \mathbf{\hat{\Sigma}_1} = \mathbf{\hat{\Sigma}_2} = \dots = \mathbf{\hat{\Sigma}_K} = \mathbf{\hat{\Sigma}} \tag{28.16} \]
In this case, the discriminant function becomes
\[ \delta_k(\mathbf{x}) = \underbrace{ \ln\left( \hat{\pi}_k \right) }_{\text{constant}} - \underbrace{ \frac{1}{2} \ln\left( | \mathbf{\hat{\Sigma}} | \right) }_{\text{no k dependency}} - \frac{1}{2} \left[ \underbrace{ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}^{-1}} \mathbf{x} }_{\text{no k dependency}} - \underbrace{ 2 \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{linear}} + \underbrace{ \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{constant}} \right] \tag{28.17} \]
In other words, ignoring terms that do not depend on \(k\), we obtain (again, w.r.t. \(\mathbf{x}\))
\[ \delta_k(\mathbf{x}) = \underbrace{ \ln\left( \hat{\pi}_k \right) }_{\text{constant}} - \frac{1}{2} \left[ - \underbrace{ 2 \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{linear}} + \underbrace{ \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{constant}} \right] \tag{28.18} \]
That is, our discriminant function is now linear w.r.t. \(\mathbf{x}\); hence, we end up with Linear Discriminant Analysis (LDA) in its most general form.
28.4.1 Canonical Discriminant Analysis
Let us continue adding more assumptions. We again assume that all covariance matrices are the same across classes, as well as the prior probabilities:
\[ \mathbf{\hat{\Sigma}_1} = \mathbf{\hat{\Sigma}_2} = \dots = \mathbf{\hat{\Sigma}_K} = \mathbf{\hat{\Sigma}} \quad \text{and} \quad \hat{\pi}_1 = \dots = \hat{\pi}_k = \hat{\pi} \tag{28.19} \]
Then, the discriminant functions \(\delta_k(\mathbf{x})\) become:
\[ \delta_k(\mathbf{x}) = \underbrace{ \ln\left( \hat{\pi} \right) }_{\text{no k dependency}} + \underbrace{ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{linear}} - \frac{1}{2} \underbrace{ \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{constant}} \tag{28.20} \]
which becomes, after ignoring \(k\)-independent terms,
\[ \delta_k(\mathbf{x}) = \underbrace{ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{linear}} - \frac{1}{2} \underbrace{ \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{constant}} \tag{28.21} \]
Now, what is going in with this first term? It is simply the distance between \(\mathbf{x}\) and the centroid of \(\mathcal{C_k}\), using the within-class variance matrix as a metric matrix. In other words, the above expression is our good old friend CDA!
28.4.2 Naive Bayes
Let’s add one more supposition to the list of assumptions considered so far. Equal covariance matrices, equal priors, and now also assume that \(\mathbf{\hat{\Sigma}}\) is diagonal which means that the predictors are uncorrelated:
\[ \mathbf{\hat{\Sigma}} = \begin{pmatrix} Var(X_1) & 0 & \dots & 0 \\ 0 & Var(X_2) & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & Var(X_p) \\ \end{pmatrix} \tag{28.22} \]
So we have the following assumptions:
\[ \mathbf{\hat{\Sigma}_1} = \mathbf{\hat{\Sigma}_2} = \dots = \mathbf{\hat{\Sigma}_K} = \mathbf{\hat{\Sigma}}, \qquad \hat{\pi}_1 = \dots = \hat{\pi}_k = \hat{\pi} \\ \quad \text{and} \quad \text{diagonal } \mathbf{\hat{\Sigma}} \tag{28.22} \]
Then, the discriminant function becomes (with diagonal \(\mathbf{\hat{\Sigma}}\)):
\[ \delta_k(\mathbf{x}) = \underbrace{ \mathbf{x}^\mathsf{T} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{linear}} - \frac{1}{2} \underbrace{ \boldsymbol{\hat{\mu}_k}^{\mathsf{T}} \mathbf{\hat{\Sigma}^{-1}} \boldsymbol{\hat{\mu}_k} }_{\text{constant}} \tag{28.23} \]
This leads to what is known as Naive Bayes.
28.4.3 Fifth Case
Again assume that \(\mathbf{\hat{\Sigma}_1} = \mathbf{\hat{\Sigma}_2} = \dots = \mathbf{\hat{\Sigma}_K} = \mathbf{\hat{\Sigma}}\) as well as \(\hat{\pi}_1 = \dots = \hat{\pi}_k =: \hat{\pi}\). Also assume that \(\mathbf{\hat{\Sigma}} = \mathbf{I_p}\), the \(p \times p\) identity matrix. In other words, our Mahalanobis distance collapses to the Euclidean distance.
28.5 Comparing the Cases
Here’s a summary table of the five cases discussed above. All cases assume that the class-conditional distributions \(f_k(\mathbf{x})\) follow a Normal or Multivariate-Normal Distribution, depending on the number of input features \(p\).
Cov Matrix | Priors | Method | Bayes Rule | Flexibility | Bias |
---|---|---|---|---|---|
Unequal across classes | Unequal across classes | QDA | quadratic | +++++ | + |
Equal across classes | Unequal across classes | LDA | linear | ++++ | ++ |
Equal across classes | Equal across classes | CDA | linear | +++ | +++ |
Equal across classes, diagonal | Equal across classes | Naive Bayes | linear | ++ | ++++ |
Equal across classes, identity | Equal across classes | Euclid. Dist. | linear | + | +++++ |
Notice that the cases are listed from more flexible to less flexible. Discriminant methods such as Naive Bayes imposes fairly restrictive assumptions. Also, notice that you can have discriminant methods with linear decision boundaries of different degrees of flexibility.