30 Clustering
In this part of the book, we go back to unsupervised learning methods. In particular, the so-called clustering methods.
Clustering refers to a very broad set of techniques for finding groups or clusters of objects, in a data set. Here are some examples that have become famous in the clustering literature:
Marketing: discover groups of customers and used them for targeted marketing.
Medical Field: discover groups of patients suitable for particular treatment protocols.
Astronomy: find groups of similar stars and galaxies.
Genomics: find groups of genes with similar expressions.
The main idea of clustering (or cluster analysis) is to find groups of objects in data. When we say “objects”, it can be groups of individuals, but also groups of variables. So, more formally, clustering refers to the statistical process of forming groups of objects (individuals or variables) into a limited number of groups known as clusters.
Notice from the above diagram that both clustering and classification involve dealing with a variable of interest that is categorical. However, the difference between classification and clustering has to to do with whether that feature of interest is observed or not. In classification we actually observe the class of the individuals through the response variable \(Y\). In contrast, we don’t have an explicit response in clustering; we want to (in a sense) find a “hidden” variable that exposes group structure in the data. The groups are not defined in advanced. In other words, there is no observed variable that imposes a group structure.
30.1 About Clustering
Clustering has its roots in what used to be called numerical taxonomy, later referred to as numerical ecology, which originated in ancient Greece (with the work of Aristotle and his disciples).
In a more mathematical (numeric) way, the first formal approaches can be traced back to the late 1950’s and early 1960’s. The main reason: the appearance of computers during this time. Before computers, how did people find groups? Well, typically people would generate a scatterplot and visually ascertain whether or not there are groups. In other words, the humans were the machine learners!
Luckily, the human brain is quite good at seeing patterns, so this approach isn’t actually that bad. But, of course, computers made this quite a bit easier (not to mention, for more than 3 variables, visualizing a scatterplot is pretty much impossible).
30.1.1 Types of Clustering
We can divide clustering approaches into two main types: Hard (aka crisp), and Soft (aka fuzzy). In this book, we will only focus on hard clustering (for more about soft clustering, see ESL).
Within hard/crisp clustering, there are two primary approaches. The first is direct partitioning, e.g. through \(k\)-means and \(k\)-medoids. The other is hierarchical partitions or embedded partitions, that is, in which partitions may have sub-partitions within them.
Direct Partitioning: Clusters are always separated, and their number is defined a priori.
Hierarchical Partitions: clusters are either separate or embedded, defining a hierarchy of partitions.
You may have what are referred to as “natural” groups; that is, groups that are somewhat obvious from the data at first glance. Of course, there will be cases where there are no natural groups; in such a case, we might say we have “artificial” groups.
30.1.2 Hard Clustering
In hard clustering, we want objects in a particular cluster to be relatively similar to one another (of course, we will need to define what “similar” means). Additionally, we want there to be some notion of separation/spread; that is, we want each cluster to be relatively distinguishable from the other cluster(s). More specifically, we want objects in different clusters to be less similar than objects within clusters. We give names to these concepts:
Within-cluster Homogeneity: Objects in a cluster should be similar enough; each group is as much homogeneous as possible.
Between-cluster Heterogeneity: Objects in different clusters should be less similar than objects within each cluster; groups are as distinct as possible among them.
In other words: we want high within-cluster homogeneity and between-cluster heterogeneity.
In summary:
- We seek a partition of the data into distinct groups.
- We want the observations within each group to be quite similar to each other.
- We must define what it means for two or more observations to be similar or different.
- This is often a domain-specific consideration that must be made based on knowledge of the data being studied.
Of course, to quantify “similar” and “different,” we need some sort of mathematical notion of closeness; that is, we need a distance metric. As usual, we will assume that the rows of the data matrix correspond to the individuals to be clustered (although you could also cluster variables).
30.2 Dispersion Measures
Let’s assume that the individuals are embedded in a Euclidean space (e.g. quantitative variables, or output of dimension reduction method). As usual, suppose that the individuals form a cloud of points in a \(p\)-dimensional space, and that we also have the centroid or average individual, like in the following figure:
We can look at the amount of spread or dispersion in the data with respect to the global centroid. One way to do this is by getting the sum of all squared distances from the centroid given by:
\[ \text{TSS} = \sum_{i=1}^{n} d^2(\mathbf{x_i}, \mathbf{g}) \tag{30.1} \]
Now suppose that we have some group structure in which objects are divided into several clusters \(\mathcal{C}_1, \mathcal{C}_2, \dots, \mathcal{C}_K\).
Because each cluster \(\mathcal{C}_k\) will form its own cloud of points, we can make the visual representation more pattent by using different shapes for the points. Likewise, we suppose that each cluster has its own centroid \(\mathbf{g_k}\):
Within Cluster Dispersion
Each cluster will have an associated amount of spread from its centroid, that is a specific Cluster Sum of Squared distances (\(\text{CSSD}\)):
\[ \text{CSSD} = \sum_{i \in \mathcal{C}_k} d^2(\mathbf{x_i}, \mathbf{g_k}) \tag{30.2} \]
We can combine the cluster sum-of-squared distances to obtain the Within-clusters sum of squared distances \(\text{WSSD}\):
\[ \text{WSSD} = \sum_{k=1}^{K} \text{CSSD}_k = \sum_{k=1}^{K} \sum_{i \in \mathcal{C}_k} d^2(\mathbf{x_i}, \mathbf{g_k}) \tag{30.3} \]
Between Cluster Dispersion
Let’s now turn our attention to the centroids:
Focusing on just the centroids, and taking into acount the size \(n_k\) of each cluster, we could examine the distances between the centroids of the groups and the overall centroid. Let us call this between-cluster dispersion the \(\text{BSSD}\) given by:
\[ \text{BSSD} = \sum_{k=1}^{K} n_k d^2(\mathbf{g_k}, \mathbf{g}) \tag{30.4} \]
Dispersion Decomposition
Let’s recap. We have three types of sums-of-squared distances:
- \(\text{TSSD}\): Total sum of squared distances
- \(\text{WSSD}\): Within-clusters sum of squared distances
- \(\text{BSSD}\): Between-clusters sum of squared distances
It can be shown (Huygens’ theorem) that
\[ \text{TSS} = \text{BSS} + \text{WSS} \]
that is:
\[ \sum_{i=1}^{n} d^2(\mathbf{x_i}, \mathbf{g}) = \sum_{k=1}^{K} n_k d^2(\mathbf{g_k}, \mathbf{g}) + \sum_{k=1}^{K} \sum_{i \in \mathcal{C}_k} d^2(\mathbf{x_i}, \mathbf{g_k}) \tag{30.5} \]
There are two criteria for “correct” clustering: \(\text{BSSD}\) should be large and \(\text{WSSD}\) should be small. From the decomposition formula, for a fixed number of clusters \(K\), minimizing \(\text{WSSD}\) or maximizing \(\text{BSSD}\) are equivalent.
Notice that in discriminant analysis, we cared a lot about \(\text{BSSD}\). Now, in clustering, we will want to minimize \(\text{WSSD}\).
Cluster analysis aims to minimize the within-clusters sum of squared distances \(\text{WSSD}\), to a fixed number of clusters \(K\). A cluster becomes more homogenous as its sum of squared distances decreases, and the clustering of the data set becomes better as \(\text{WSSD}\) decreases. Also, as \(\text{BSSD}\) increases, the separation between clusters also increases, indicating satisfactory clustering.
30.3 Complexity in Clustering
Consider the following (fictitious) dataset with \(n = 5\) observations and 2 continuous variables:
x y
1 0.7 0.0
2 1.0 0.4
3 0.0 0.7
4 0.3 0.6
5 0.4 1.0
and their corresponding scatterplot
To measure the distance between observations, we compute the squared Euclidean distance. Recall that this distance between objects \(i\) and \(\ell\) (i.e. \(\mathbf{x_i}\) and \(\mathbf{x_{\ell}}\)) is given by:
\[ d^{2}_{i\ell} = d^2(\mathbf{x_i}, \mathbf{x_{\ell}}) = \sum_{j=1}^{p} (x_{ij} - x_{\ell j})^2 = \| \mathbf{x_i} - \mathbf{x_{\ell}} \|^{2}_{2} \tag{30.6} \]
If we compute all distances between pairs of objects, we obtain a distance matrix given below (we take into account just the diagonal and the entries below the diagonal)
1 2 3 4 5
1 0.00
2 0.25 0.00
3 0.98 1.09 0.00
4 0.52 0.53 0.10 0.00
5 1.09 0.72 0.25 0.17 0.00
We could consider two possible clustering partitions or configurations. One possible partition is the blue configuration that produces \(K = 2\) clusters. The other partition is the red configuration which produces another set of \(K = 2\) clusters.
We could compute the centroids of each group and obtain exact values for \(\mathrm{CSSD}_1\) and \(\mathrm{CSSD}_2\) (using the squared-Euclidian distance as our metric), combine them, and obtain \(\mathrm{WSSD}\). We would then select the partition that gives the smallest \(\mathrm{WSSD}\).
Red clustering:
\[ \text{W}_{red} = \frac{0.25 + 0.53 + 0.52}{3} + \frac{0.25}{2} = 0.56 \]
Blue clustering:
\[ \text{W}_{blue} = \frac{0.25}{2} + \frac{0.10 + 0.17 + 0.25}{3} = 0.30 \]
Smaller \(\text{W}\) is better. This suggests that, to find the partition with smallest within-cluster sum of squares \(\text{W}\), we would require trying all possible assignments of the \(n\) objects into \(K\) groups. So why don’t we just directly find the clustering partition that minimizes \(\text{W}\)?
The problem is that this would only work for a small number of individuals, and a small number of partitions. Unfortunately, the computation cost increases drastically as we add more clusters and more data points. To see this, consider four objects \(a, b, c\) and \(d\). Suppose we are interested in partitions that produce clusters of different sizes.
The possible (non-overlapping) partitions for \(n = 4\) objects are:
1 partition with 1 cluster: \(\{abcd\}\)
7 partitions with 2 clusters: \(\{ab,cd\}, \hspace{1mm} \{ac,bd\}, \hspace{1mm} \{ad,bc\}, \hspace{1mm} \{a,bcd\}, \hspace{1mm} \{b,acd\}, \hspace{1mm} \{c,abd\}, \hspace{1mm} \{d,abc\}\)
6 partitions with 3 clusters: \(\{a,b,cd\}, \hspace{1mm} \{a,c,bd\}, \hspace{1mm} \{a,d,bc\}, \hspace{1mm} \{b,c,ad\}, \hspace{1mm} \{b,d,ac\}, \hspace{1mm} \{c,d,ab\}\)
1 partition with 4 clusters: \(\{a,b,c,d\}\)
If we wanted only one cluster, we have only \(\{abcd\}\). If we wanted 4 clusters, we would have \(\{a, b, c, d\}\).
In general, how many possible assignments of \(n\) objects into \(K\) groups?
The number of possible assignments of \(n\) objects into \(K\) groups is given by:
\[ A(n, K) = \frac{1}{K!} \sum_{k=1}^{K} (-1)^{(K-k)} \binom{K}{k} k^n \tag{30.7} \]
With four objects, we could form clusters of sizes 1, 2, 3, and 4. Their corresponding number of assignments are:
- \(A(4, 1) = 1\)
- \(A(4, 2) = 7\)
- \(A(4, 3) = 6\)
- \(A(4, 4) = 1\)
Hence, with our toy data set of four objects \(a, b, c\) and \(d\), the different number of (non-overlapping) partitions for different cluster sizes is:
\[ A(4, 1) + A(4, 2) + A(4, 3) + A(4, 4) = 15 \]
Note that:
- \(A(10, 4) = 34105\), and
- \(A(25, 4) \approx 5 \times 10^{13} \dots\) huge!.
It turns out that the number of (non-overlapping) partitions for \(n\) objects is the so-called Bell number \(B_n\):
\[ B_n = \frac{1}{e} \sum_{k=1}^{\infty} \frac{k^n}{k!} \tag{30.8} \]
For \(n = 4\) objects, we have \(B_4 = 15\). This is for only 4 objects! Imagine if we had \(n = 30\) objects (which is still relatively small), we find
\[ B_{30} = 8.47 \times 10^{23} = \dots \]
a huge number greater than Avogadro’s number (\(6.022 \times 10^{23}\)), which is the number of molecules in one mole of any gas!
As a general rule
\[ B_n > \text{exp}(n) \]
which means this is very computationally intractable!
Because of the complexity of clustering problems, we’ll have to settle for an approximation. Typically, this means that instead of finding the global minimum we will have to settle for finding a local minimum.
Methodological Considerations
Before moving to the next chapters that cover direct partitioning, and hierarchical partitions, we want to provide a list of general considerations that is worth keeping in mind when performing any clustering endevour.
The definition of natural clusters is a tricky matter.
Clusters are not always as obvious as presented in textbooks.
Cluster configurations differ according to the employed algorithm.
The determination of the “real” number of clusters is emperical as much as theoretical.
Practical matters must be taken into account when choosing the number of clusters.
Clusters are often required to be readable and interpretable, rather than theoretically optimal. e.g. the data are naturally clustered into three groups, but four clusters are requested.
In practice, sometimes analysts mix/combine different clustering methods to obtain consolidations. e.g. Perform hierarchical clustering on a sample, determine number of clusters, and then apply direct partitioning on the entire set.