31 K-Means
With direct partitioning methods, two clusters are always separated; and the number of clusters is generally defined a priori. Some common methods are:
- \(K\)-means
- \(K\)-medoids
- \(K\)-prototypes
- Methods based on a concept of density
- Kohonen networks (or Kohonen maps)
31.1 Toy Example
Imagine we are in \(\mathbb{R}^p\) (or \(\mathbb{R}^n\), depending on if we are looking at things from the individuals perspective or variables perspective). For illustrative purposes, let’s consider two clusters, that is \(K = 2\).
We start by initializing two centroids \(\mathbf{g_1}\) and \(\mathbf{g_2}\). Typically, the way centroids are initialized is by randomly selecting two objects. Sometimes, however, a more “educated” initialization may be possible when we have additional information about where the clusters may be located.
Given a new data point, we compute the distances from this point to the two centroids, and classify the point to the cluster to which the closest centroid belongs.
The distances of all objects to the centroids are computed as well, and the objects get assigned to the group of the closest centroid.
Now, with this new grouped data, we have to recompute the centroids:
We repeat this procedure, that is, calculate distances of all objects to the centroids, and re-assigning objects to clusters.
Likewise, after assigning objects to new clusters, new centroids are updated.
After enough iterations, the centroids will not move too much between each iteration. At this point we will say the algorithm has converged, and we stop (and the resulting clusters will be the ones we use).
Another question that may arise pertains to efficiency, and convergence times. One benefit of the K-Means Clustering algorithm is that it converges quite quickly; the complexity is of order \(O(n \cdot K)\)).
31.2 What does K-means do?
Consider the transition from the \(s\)-th step to the \(s+1\) step of the algorithm:
\[ \textbf{step s}, \quad \textbf{s} \rightarrow \textbf{s+1}, \quad \textbf{then s+1} \]
Specifically, let us focus on the intermediate step, when step \(s\) goes to step \(s+1\). In other words, the centroids have moved but we have not reassigned points.
At step \(s\) we would compute \(\mathrm{WSSD}_1\) and \(\mathrm{WSSD}_2\) which, if we are using the squared Euclidean norm, becomes
\[ \mathrm{WSSD}^{(s)} = \sum_{i \in \mathcal{C}_1^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s)} \big) + \sum_{i \in \mathcal{C}_2^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_2}^{(s)} \big) \tag{31.1} \]
At the end of the \(s\)-th step, we update centroids, obtaining \(\mathbf{g_1}^{(s + 1)}\) and \(\mathbf{g_2}^{(s + 1)}\). These are the centroids used during the assignment of step \(s+1\). Notice that there is an intermediate step, with new centroids, but objects clustered according to \(\mathcal{C}^{(s)}_1\) and \(\mathcal{C}^{(s)}_2\). Technically, we can compute within-group dispersions during this intermediate step, given by:
\[ \mathrm{WSSD}^{(s \to s+1)} = \sum_{i \in \mathcal{C}_1^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) + \sum_{i \in \mathcal{C}_2^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) \tag{31.2} \]
At the end of step \(s+1\), objects are clustered under new configurations \(C^{(s+1)}_1\) and \(C^{(s+1)}_2\). Likewise, the within-group dispersions at the end of step \(s+1\) become:
\[ \mathrm{WSSD}^{(s+1)} = \sum_{i \in \mathcal{C}_1^{(s + 1)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) + \sum_{i \in \mathcal{C}_2^{(s + 1)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) \tag{31.3} \]
It can be proved that these three dispersions are in order of smallest to largest (i.e. the intermediate is bigger than the \(s\)-th step, the \(s+1\)-th step is larger than the intermediate). In other words,
\[ \mathrm{WSSD}^{(s)} \geq \mathrm{WSSD}^{(s \to s+1)} \geq \mathrm{WSSD}^{(s+1)} \tag{31.4} \]
31.3 K-Means Algorithms
In \(K\)-means clustering, we must first define the number of clusters we want (for example, \(K = 3\) clusters).
31.3.1 Classic Version
The original \(K\)-means algorithm was proposed by MacQueen in 1967, and is as follows:
Define \(K\), the number of clusters we want.
Start with \(K\) random centroids.
Start examining a particular individual object, and assign it to the group of the closest centroid.
Recompute the centroid for this assigned cluster.
Assign the next object to the group of the closest centroid.
Recompute the centroid for this assigned cluster.
Repeat this procedure until all objects have been assigned.
Note that this algorithm proceeds object-by-object, re-computing centroids after each object has been assigned. That is, if we have \(n\) objects, we will recompute our centroid \(n\) times.
31.3.2 Moving Centers Algorithm
This version of \(K\)-means was proposed by Forgy in 1965, and is as follows:
Define \(K\), the number of clusters we want.
Start with \(K\) random centroids.
Assign each object to the nearest centroid, without updating the centroids until all objects have been assigned.
Now, recalculate the centroids.
Repeat steps (2) and (3) until convergence.
Of course, this algorithm requires some criterion to define “convergence.” Some stopping mechanisms could be:
Stop when \(\mathrm{WSSD}^{(s)}\) is equal (or roughly equal to) \(\mathrm{WSSD}^{(s + 1)}\).
Assign a maximum number of iterations (e.g. we stop the algorithm after 10 iterations).
This algorithm is the most popular out of the other two discussed.
31.3.3 Dynamic Clouds
This algorithm was proposed by Diday in 1971, and is as follows:
Instead of centroids, we work with “kernels” (i.e. a group of objects; that is, we no longer consider individual points (centroids) but rather groups of objects).
Repeat the steps from the other algorithms.
31.3.4 Choosing \(K\)
How many groups should be pick? Well, the answer is highly dependent on our initial centroids. The algorithm will always converge at local minima, but there is no guarantee it will arrive at the global minimum. One idea is to run \(K\)-means multiple times, to get a sense of how stable groups are.
Note that cross-validation cannot be used to determine \(K^*\), the optimal number of clusters. For an explanation of why this is, see pages 518-9 in ESL.
There is another proposal on how to choose a “good” number of clusters, but it involves hierarchical clustering, the topic of the next chapter.
31.3.5 Comments
The main advantage of most flavors of K-means is their complexity which in general is linear, in the order \(O(n \cdot K)\). In other words, their computational complexity is proportional to the number of observations \(n\), because we compute \(n \cdot K\) distances at each step. This is the reason why K-means procedures are favored when dealing with large data sets.
The main disadvantage of direct partitioning methods is that the clustering configuration depends on the initial centroids. Which means we don’t obtain the global optimum, but just a local optima. An attempt to overcome this limitation is to run the algorithm several times for different values of \(k\), and then compare the obtained configurations in order to detect strong or stable groups.
Another suggestion is to run K-means on various random samples of observations, and then use the centroids of a good configuration as the initial centroids for the algorithm applied on the entire data set.