27 Canonical Discriminant Analysis

In this chapter we talk about Canonical Discriminant Analysis (CDA), which is a special case of Linear Discriminant Analysis (LDA). The main reason why we introduce CDA separately, is because this method has a somewhat hybrid learning nature with two aspects:

Semi-supervised: On one hand, CDA has an unsupervised or descriptive aspect that aims to tackle the following question. How to find a representation of the objects which provides the best separation between classes?
Supervised: On the other hand, CDA also has a decisively supervised aspect that tackles the question: How to find the rules for assigning a class to a given object?

We begin with the semi-supervised part of CDA, approaching its exploratory nature from a purely geometric approach. This discussion covers most of the material in this chapter. And then at the end, we present its supervised aspect, describing how to use CDA for classification purposes.

27.1 CDA: Semi-Supervised Aspect

We can formulate the classfication problem behind CDA in a geometric way. For sake of illustration, let’s consider three classes in a 2-dimensional space, like depicted in the figure below.

Figure 27.1: Three classes in 2-dim space

From an exploratory/descriptive perspective, we are looking for a good low dimensional representation that separates the three classes. One option is to consider the axis associated with the predictor \(X_1\), that is, the horizontal axis in the next figure:

Figure 27.2: Axis \(X_1\) separates class 1 from groups 2 and 3

If we project the individuals onto the \(X_1\) axis, class 1 is well separated from classes 2 and 3. However, class 2 is largely confounded with class 3.

Another option is to consider the axis associated with the predictor \(X_2\), that is, the vertical axis in the figure below:

Figure 27.3: Axis \(X_1\) separates class 3 from groups 1 and 2

As you can tell, if we project the individuals onto this axis \(X_2\), class 3 is well separated from classes 1 and 2. However, class 1 is largely confounded with class 2.

Is there an axis or dimension that “best” separates the three clouds?

We can try to look for an optimal representation in the sense of finding an axis that best separates the clouds, like in the following diagram:

Figure 27.4: An axis the best separates all three classes

To summarize, the exploratory aspect of canonical discriminant analysis has to do with seeking a low dimensional representation in which the class of objects are well separated.

27.2 Looking for a discriminant axis

How do we find a low dimensional representation of the objects which provides the best separation between classes? To answer this question, we can look for an axis \(\Delta_u\), spanned by some vector \(\mathbf{u}\), such that the linear combination

\[ Z = u_1 X_1 + u_2 X_2 \tag{27.1} \]

separates all three groups in an adequate way.

Figure 27.5: Looking for a discriminant axis

Algebraically, the idea is to look for a linear combination of the predictors:

\[ \mathbf{z} = \mathbf{Xu} \tag{27.2} \]

that ideally could achieve the following two goals:

Minimize within-class dispersion (\(\text{wss}\)): \(\quad min \{ \mathbf{u^\mathsf{T} W u \}}\)
Maximize between-class dispersion (\(\text{bss}\)): \(\quad max \{ \mathbf{u^\mathsf{T} B u \}}\)

On one hand, it would be nice to have \(\mathbf{u}\), such that the between-class dispersion is maximized. This corresponds to a situation in which the class centroids are well separated:

Figure 27.6: Maximize between-class dispersion

On the other hand, it would also make sense to have \(\mathbf{u}\), such that the within-class dispersion is minimized. This implies having classes in which, on average, the “inner” variation is small (i.e. concentrated local dispersion)

Figure 27.7: Minimize within-class dispersion

Can we find such a mythical vector \(\mathbf{u}\)?

We have not so good news. It is generally impossible to find an axis \(\Delta_u\), spanned by \(\mathbf{u}\), which simoultaneously:

maximizes the between-groups variance
minimizes the within-groups variance

Let’s see a cartoon picture of this issue. Say we have two classes, each one with its centroid.

Figure 27.8: Minimize within-class dispersion

Maximizing the between-class separation, involves choosing \(\mathbf{u}_a\) parallel to the segment linking the centroids. In other words, the direction of \(\mathbf{u}\) is the one that crosses the centroids:

Figure 27.9: Maximize between-class dispersion

Minimizing the within-class separation, involves finding \(\mathbf{u}_b\) perpendicular to the principal axis of the ellipses. This type of \(\mathbf{u}\) does not necessarily crosses through the centroids:

Figure 27.10: Minimize within-class dispersion

In general, we end up with two possibilities \(\mathbf{u}_a \neq \mathbf{u}_b\):

Figure 27.11: Double goal of discriminant analysis … generally impossible

Let’s summarize the geometric motivation behind canonical discriminant analysis. We seek to find the linear combination of the predictors such that the between-class variance is maximized relative to the within-class variance. In other words, we want to find the combination of the predictors that gives maximum separation between the centroids of the data while at the same time minimizing the variation within each class of data. In general, accomplishing these two goals in a simultaneous way is impossible.

27.3 Looking for a Compromise Criterion

So far we have an impossible simultaneity involving a minimization criterion, as well as a maximization criterion:

\[ min \{ \mathbf{u^\mathsf{T} W u} \} \Longrightarrow \mathbf{W u} = \alpha \mathbf{u} \\ \textsf{and} \\ max \{ \mathbf{u^\mathsf{T} B u} \} \Longrightarrow \mathbf{B u} = \beta \mathbf{u} \tag{27.3} \]

What can we do about this simultaneous impossibility?

Well, we need to look for a compromise. This is where the variance decomposition discussed in the previous chapter comes handy:

\[ \mathbf{V} = \mathbf{W} + \mathbf{B} \tag{27.4} \]

Doing some algebra, it can be shown that the quadratic form \(\mathbf{u^\mathsf{T} V u}\) can be decomposed as:

\[ \mathbf{u^\mathsf{T} V u} = \mathbf{u^\mathsf{T} W u} + \mathbf{u^\mathsf{T} B u} \tag{27.5} \]

Again, we are pursuing a dual goal that is, in general, hard to accomplish:

\[ \mathbf{u^\mathsf{T} V u} = \underbrace{\mathbf{u^\mathsf{T} W u}}_{\text{minimize}} + \underbrace{\mathbf{u^\mathsf{T} B u}}_{\text{maximize}} \tag{27.6} \]

We have two options for the compromise:

\[ max \left \{ \frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} V u}} \right \} \quad \textsf{OR} \quad max \left \{ \frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} W u}} \right \} \tag{27.7} \]

which are actually associated to the two ratios described in the previous chapter: correlation ratio \(\eta^2\), and \(F\)-ratio:

\[ \eta^2 \leftrightarrow \frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} V u}}, \qquad F \leftrightarrow\frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} W u}} \tag{27.8} \]

Which option to choose? The good news is that both options are equivalent, so using any one of them should give you similar—although not exactly identical—results.

27.3.1 Correlation Ratio Criterion

Suppose we decide to work with the first criterion. We look for \(\mathbf{u}\) such that:

\[ max \left \{ \frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} V u}} \right \} \tag{27.9} \]

This criterion is scale invariant, meaning that we use any scale variation of \(\mathbf{u}\): i.e. \(\alpha \mathbf{u}\). For convenience, we can impose a normalizing restriction: \(\mathbf{u^\mathsf{T} V u} = 1\). Consequently:

\[ max \left \{ \frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} V u}} \right \} \Longleftrightarrow max \{\mathbf{u^\mathsf{T} B u}\} \quad \text{s.t.} \quad \mathbf{u^\mathsf{T} V u} = 1 \tag{27.10} \]

Using the method of Lagrangian multiplier:

\[ L(\mathbf{u}) = \mathbf{u^\mathsf{T} B u} - \lambda (\mathbf{u^\mathsf{T} V u} - 1) \tag{27.11} \]

Deriving w.r.t \(\mathbf{u}\) and equating to zero:

\[ \frac{\partial L(\mathbf{u})}{\partial \mathbf{u}} = 2 \mathbf{Bu} - 2 \lambda \mathbf{Vu} = \mathbf{0} \tag{27.12} \]

The optimal vector \(\mathbf{u}\) is such that:

\[ \mathbf{Bu} = \lambda \mathbf{Vu} \tag{27.13} \]

If the matrix \(\mathbf{V}\) is inversible (which it is in general) then:

\[ \mathbf{V^{-1}Bu} = \lambda \mathbf{u} \tag{27.14} \]

that is, the optimal vector \(\mathbf{u}\) is eigenvector of \(\mathbf{V^{-1}B}\). Keep in mind that, in general, \(\mathbf{V^{-1}B}\) is not symmetric.

27.3.2 F-ratio Criterion

If we decide to work with the criterion associated to the \(F\)-ratio, then the criterion to be maximized is:

\[ max \left \{ \frac{\mathbf{u^\mathsf{T} B u}}{\mathbf{u^\mathsf{T} W u}} \right \} \Longleftrightarrow max \{\mathbf{u^\mathsf{T} B u}\} \quad \text{s.t.} \quad \mathbf{u^\mathsf{T} W u} = 1 \tag{27.15} \]

Applying the same Lagrangian procedure, with a multiplier \(\rho\), we have that \(\mathbf{u}\) is such a vector that:

\[ \mathbf{Bu} = \rho \mathbf{Wu} \tag{27.16} \]

and if \(\mathbf{W}\) is invertible (which it is in most cases), then it can be shown that \(\mathbf{u}\) is also eigenvector of \(\mathbf{W^{-1}B}\), associated to eigenvalue \(\rho\):

\[ \mathbf{W^{-1}Bu} = \rho \mathbf{u} \tag{27.17} \]

Likewise, keep in mind that, in general, \(\mathbf{W^{-1}B}\) is not symmetric.

The relationship between the eigenvalues \(\lambda\) and \(\rho\) is:

\[ \rho = \frac{\lambda}{1 - \lambda} \tag{27.18} \]

Why is it important to keep in mind that, in general, both matrices \(\mathbf{V^{-1}B}\) and \(\mathbf{W^{-1}B}\) are not symmetric? Because most software routines that diagonilize a matrix (to find its eigenvalue decomposition) use the Jacobi method, which works with symmetric matrices.

27.3.3 A Special PCA

What do \(\mathbf{u}\) and \(\mathbf{W^{-1}}\) correspond in geometric terms? The vector \(\mathbf{u}\) is the axis from the PCA on the cloud of centroids \(\mathbf{g_1}, \mathbf{g_2}, \dots, \mathbf{g_K}\), but it is an axis on which the points are projected obliquely, not orthogonally.

Figure 27.12: Oblique Projection

Without this obliqueness, corresponding to the equivalent metrics \(\mathbf{V^{-1}}\) and \(\mathbf{W^{-1}}\), this would be a simple PCA performed on the centroids, in which the classes would be less well separated, because of an orthogonal projection.

Figure 27.13: Orthogonal Projection

The implication of using a metric matrix such as \(\mathbf{W^{-1}}\) (or \(\mathbf{V^{-1}}\)), is that the separation of two points depends not only on a Euclidean measurement, but also on the variance and correlation of the variables. So, in summary:

\(\mathbf{u}\) is the vector associated to the so-called canonical axis
When the first canonical axis has been determined, we search for a 2nd one
The second axis should be the most discriminant and uncorrelated with the first one
This procedure is repeated until the number of axis reaches the minimum of: \(K - 1\) and \(p\)

In fact, it is not the canonical axes that are manipulated directly, but the canonical variables or vectors associated to the canonical axes.

27.4 CDA: Supervised Aspect

The supervised learning aspect of CDA has to do with the question: how do we use it for classification purposes? This involves establishing a decision rule that let us predict the class of an object.

CDA proposes a geometric rule of classification. As we will see, such rule has a linear form. This is the reason why we consider CDA to be a special member of the family of Linear Discriminant Analysis.

27.4.1 Distance behind CDA

The first thing we need to discuss is the notion of distance used in CDA. For this, consider two vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^{p}\). Recall that the inner product of \(\mathbf{a}\) and \(\mathbf{b}\) is defined to be:

\[ \langle \mathbf{a}, \mathbf{b} \rangle = \mathbf{a}^\mathsf{T} \mathbf{b} \tag{27.19} \]

Note that we can equivalently write the inner product as:

\[ \langle \mathbf{a}, \mathbf{b} \rangle = \mathbf{a}^\mathsf{T} \mathbf{I_p} \mathbf{b} \tag{27.20} \]

where \(\mathbf{I_p}\) denotes the \(p \times p\) identity matrix. This notation allows us to generalize inner products: for a symmetric matrix \(\mathbf{M}\) (called the metric matrix) of conformable dimensions, we define the inner product of \(\mathbf{a}\) and \(\mathbf{b}\), under metric \(\mathbf{M}\), to be

\[ \langle \mathbf{a}, \mathbf{b} \rangle_{\mathbf{M}} = \mathbf{a}^\mathsf{T} \mathbf{M} \mathbf{b} \tag{27.21} \]

With this new notion of an inner product, we can also expand our definition of distance. Recall that the squared Euclidean distance between two vectors \(\mathbf{x_i}\) and \(\mathbf{x_\ell}\) is defined as:

\[\begin{align*} d^2(i, \ell) &= \langle \mathbf{x_i} - \mathbf{x_\ell}, \mathbf{x_i} - \mathbf{x_\ell} \rangle \\ &= (\mathbf{x_i} - \mathbf{x_\ell})^{\mathsf{T}} (\mathbf{x_i} - \mathbf{x_\ell}) \\ &= (\mathbf{x_i} - \mathbf{x_\ell})^{\mathsf{T}} \mathbf{I_p} (\mathbf{x_i} - \mathbf{x_\ell}) \tag{27.22} \end{align*}\]

Now, we can play the same game as we did for the inner product and replace \(\mathbf{I_p}\) with any metric matrix \(\mathbf{M}\) to obtain a generalized distance metric:

\[ d^2_{\mathbf{M}} (i, \ell) = (\mathbf{x_i} - \mathbf{x_\ell})^{\mathsf{T}} \mathbf{M} (\mathbf{x_i} - \mathbf{x_\ell}) \tag{27.23} \]

27.4.2 Predictive Idea

The classification rule used in CDA consists of assigning each individual \(\mathbf{x_i}\) to the class \(\mathcal{C_k}\) for which the distance to the centroid \(\mathbf{g_k}\) is minimal.

\[ \text{assign object } i \text{ to the class for which } d^2(\mathbf{x_i},\mathbf{g_k}) \text{ is minimal} \]

But here we don’t use the typical Euclidean distance. Instead, in CDA we use a different distance: the Mahalanobis distance. This other type of distance is based on the Mahalanobis metric matrix \(\mathbf{W^{-1}}\). Consequently, the formula of the (squared) distance is given by:

\[ \textsf{Mahalanobis:} \quad d^2(\mathbf{x_i, g_k}) = (\mathbf{x_i} - \mathbf{g_k})^\mathsf{T} \mathbf{W}^{-1} (\mathbf{x_i} - \mathbf{g_k}) \tag{27.24} \]

What does this distance do? The Mahalanobis distance measures the (squared) distance of a point \(\mathbf{x_i}\) to a centroid \(\mathbf{g_k}\) by taking into account the correlational structure of the variables.

The following figure illustrates what this distance is doing compared to the Euclidean distance. Suppose we have two points \(\mathbf{x_i}\) and \(\mathbf{x_\ell}\) lying on the same density ellipsoid.

Figure 27.14: Euclidean -vs- Mahalanobis distances

Without taking into account the correlational structure of the features, the points \(\mathbf{x_i}\) and \(\mathbf{x_\ell}\) are at different distances from the centroid \(\mathbf{g_k}\). This is the case of the Euclidean distance which does not take into account such correlational structure.

However, if we use a metric such as the Mahalanobis metric based on \(\mathbf{W}^{-1}\) (or an equivalent metric \(\mathbf{V}^{-1}\)), the distance of a particular point \(\mathbf{x_i}\) to the centroid \(\mathbf{g_k}\) depends on how spread out is the cloud of points in class \(k\). Moreover, if two points \(i\) and \(\ell\) are on the same density ellipsoid, then they are equidistant to the centroid:

\[ d^{2}_{\mathbf{W}^{-1}} (\mathbf{x_i}, \mathbf{g_k}) = d^{2}_{\mathbf{W}^{-1}} (\mathbf{x_\ell}, \mathbf{g_k}) \tag{27.25} \]

27.4.3 CDA Classifier

As we said before, the classification rule behind CDA says that we should assign an object \(\mathbf{x_i}\) to the class it is nearest to, using the \(\mathbf{W^{-1}}\) metric to calculate the distance of the object from the centroid of the group. Expanding the Mahalanobis distance of \(\mathbf{x_i}\) to centroid \(\mathbf{g_k}\) we get:

\[\begin{align*} d^2(\mathbf{x_i, g_k}) &= (\mathbf{x_i} - \mathbf{g_k})^\mathsf{T} \mathbf{W}^{-1} (\mathbf{x_i} - \mathbf{g_k}) \\ &= \mathbf{x_i}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_i} - 2 \mathbf{g_k}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_i} + \mathbf{g_k}^\mathsf{T} \mathbf{W}^{-1} \mathbf{g_k} \tag{27.26} \end{align*}\]

As you can tell, there are three terms: the first one does not depend on \(k\), but the second and third terms do depend on \(k\):

\[ d^2(\mathbf{x_i, g_k}) = \underbrace{\mathbf{x_i}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_i}}_{constant} - \underbrace{2 \mathbf{g_k}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_i}}_{\text{depends on } k} + \underbrace{ \mathbf{g_k}^\mathsf{T} \mathbf{W}^{-1} \mathbf{g_k} }_{\text{depends on } k} \tag{27.27} \]

Note that minimizing \(d^2(\mathbf{x_i, g_k})\) is equivalent to maximizing the negative of those terms that depend on \(k\):

\[ 2 \mathbf{g_k}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_i} - \mathbf{g_k}^\mathsf{T} \mathbf{W}^{-1} \mathbf{g_k} \tag{27.28} \]

Let’s go back to our hypothetical example at the beginning of the chapter. Now suppose that we have three unclassified individuals A, B, and C. In canonical discriminant analysis we compute the Mahalanobis distances of the unclassified objects to the three centroids.

Figure 27.15: Mahalanobis distance of an object to the class centroids

For instance, with individual A, we would have to compute:

\[ d^2(\mathbf{x_A}, \mathbf{g_1}), \quad d^2(\mathbf{x_A}, \mathbf{g_2}), \quad and \quad d^2(\mathbf{x_A}, \mathbf{g_3}) \tag{27.29} \]

and then select the class, in this case class \(C_2\), for which the Mahalanobis distance is minimal. An equivalent approach is to look for the maximum of:

\(2 \mathbf{g_1}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_A} - \mathbf{g_1}^\mathsf{T} \mathbf{W}^{-1} \mathbf{g_1}\)
\(2 \mathbf{g_2}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_A} - \mathbf{g_2}^\mathsf{T} \mathbf{W}^{-1} \mathbf{g_2}\)
\(2 \mathbf{g_3}^\mathsf{T} \mathbf{W}^{-1} \mathbf{x_A} - \mathbf{g_3}^\mathsf{T} \mathbf{W}^{-1} \mathbf{g_3}\)

Figure 27.16: Assign class for which Mahalanobis distance is minimal

27.4.4 Limitations of CDA classifier

There is an underlying assumption behind the geometric rule of CDA. This rule should not be used if the two classes have different a priori probabilities or variances (we discuss this in the next chapter).

Figure 27.17: To which class should we assign the new object?

In the next chapter we shift gears by introducing a more generic framework for discriminant analysis based on a probabilistic approach, as opposed to the geometric distance-based approach of CDA.