31 K-Means

With direct partitioning methods, two clusters are always separated; and the number of clusters is generally defined a priori. Some common methods are:

\(K\)-means
\(K\)-medoids
\(K\)-prototypes
Methods based on a concept of density
Kohonen networks (or Kohonen maps)

31.1 Toy Example

Imagine we are in \(\mathbb{R}^p\) (or \(\mathbb{R}^n\), depending on if we are looking at things from the individuals perspective or variables perspective). For illustrative purposes, let’s consider two clusters, that is \(K = 2\).

Figure 31.1: Toy data set

We start by initializing two centroids \(\mathbf{g_1}\) and \(\mathbf{g_2}\). Typically, the way centroids are initialized is by randomly selecting two objects. Sometimes, however, a more “educated” initialization may be possible when we have additional information about where the clusters may be located.

Figure 31.2: Arbitrary centroids

Given a new data point, we compute the distances from this point to the two centroids, and classify the point to the cluster to which the closest centroid belongs.

Figure 31.3: Distance of an object to both centroids

The distances of all objects to the centroids are computed as well, and the objects get assigned to the group of the closest centroid.

Figure 31.4: Assigning objects to closest centroid

Now, with this new grouped data, we have to recompute the centroids:

Figure 31.5: Update centroids

We repeat this procedure, that is, calculate distances of all objects to the centroids, and re-assigning objects to clusters.

Figure 31.6: Reassigning objects

Likewise, after assigning objects to new clusters, new centroids are updated.

Figure 31.7: Recalculate centroids

After enough iterations, the centroids will not move too much between each iteration. At this point we will say the algorithm has converged, and we stop (and the resulting clusters will be the ones we use).

Figure 31.8: Rinse and repeat until convergence

Another question that may arise pertains to efficiency, and convergence times. One benefit of the K-Means Clustering algorithm is that it converges quite quickly; the complexity is of order \(O(n \cdot K)\)).

31.2 What does K-means do?

Consider the transition from the \(s\)-th step to the \(s+1\) step of the algorithm:

\[ \textbf{step s}, \quad \textbf{s} \rightarrow \textbf{s+1}, \quad \textbf{then s+1} \]

Specifically, let us focus on the intermediate step, when step \(s\) goes to step \(s+1\). In other words, the centroids have moved but we have not reassigned points.

Figure 31.9: From step s to step s+1

At step \(s\) we would compute \(\mathrm{WSSD}_1\) and \(\mathrm{WSSD}_2\) which, if we are using the squared Euclidean norm, becomes

\[ \mathrm{WSSD}^{(s)} = \sum_{i \in \mathcal{C}_1^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s)} \big) + \sum_{i \in \mathcal{C}_2^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_2}^{(s)} \big) \tag{31.1} \]

At the end of the \(s\)-th step, we update centroids, obtaining \(\mathbf{g_1}^{(s + 1)}\) and \(\mathbf{g_2}^{(s + 1)}\). These are the centroids used during the assignment of step \(s+1\). Notice that there is an intermediate step, with new centroids, but objects clustered according to \(\mathcal{C}^{(s)}_1\) and \(\mathcal{C}^{(s)}_2\). Technically, we can compute within-group dispersions during this intermediate step, given by:

\[ \mathrm{WSSD}^{(s \to s+1)} = \sum_{i \in \mathcal{C}_1^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) + \sum_{i \in \mathcal{C}_2^{(s)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) \tag{31.2} \]

At the end of step \(s+1\), objects are clustered under new configurations \(C^{(s+1)}_1\) and \(C^{(s+1)}_2\). Likewise, the within-group dispersions at the end of step \(s+1\) become:

\[ \mathrm{WSSD}^{(s+1)} = \sum_{i \in \mathcal{C}_1^{(s + 1)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) + \sum_{i \in \mathcal{C}_2^{(s + 1)}} d^2 \big(\mathbf{x_i}, \mathbf{g_1}^{(s + 1)} \big) \tag{31.3} \]

It can be proved that these three dispersions are in order of smallest to largest (i.e. the intermediate is bigger than the \(s\)-th step, the \(s+1\)-th step is larger than the intermediate). In other words,

\[ \mathrm{WSSD}^{(s)} \geq \mathrm{WSSD}^{(s \to s+1)} \geq \mathrm{WSSD}^{(s+1)} \tag{31.4} \]

31.3 K-Means Algorithms

In \(K\)-means clustering, we must first define the number of clusters we want (for example, \(K = 3\) clusters).

31.3.1 Classic Version

The original \(K\)-means algorithm was proposed by MacQueen in 1967, and is as follows:

Define \(K\), the number of clusters we want.
Start with \(K\) random centroids.
Start examining a particular individual object, and assign it to the group of the closest centroid.
Recompute the centroid for this assigned cluster.
Assign the next object to the group of the closest centroid.
Recompute the centroid for this assigned cluster.
Repeat this procedure until all objects have been assigned.

Note that this algorithm proceeds object-by-object, re-computing centroids after each object has been assigned. That is, if we have \(n\) objects, we will recompute our centroid \(n\) times.

31.3.2 Moving Centers Algorithm

This version of \(K\)-means was proposed by Forgy in 1965, and is as follows:

Define \(K\), the number of clusters we want.
Start with \(K\) random centroids.
Assign each object to the nearest centroid, without updating the centroids until all objects have been assigned.
Now, recalculate the centroids.
Repeat steps (2) and (3) until convergence.

Of course, this algorithm requires some criterion to define “convergence.” Some stopping mechanisms could be:

Stop when \(\mathrm{WSSD}^{(s)}\) is equal (or roughly equal to) \(\mathrm{WSSD}^{(s + 1)}\).
Assign a maximum number of iterations (e.g. we stop the algorithm after 10 iterations).

This algorithm is the most popular out of the other two discussed.

31.3.3 Dynamic Clouds

This algorithm was proposed by Diday in 1971, and is as follows:

Instead of centroids, we work with “kernels” (i.e. a group of objects; that is, we no longer consider individual points (centroids) but rather groups of objects).
Repeat the steps from the other algorithms.

31.3.4 Choosing \(K\)

How many groups should be pick? Well, the answer is highly dependent on our initial centroids. The algorithm will always converge at local minima, but there is no guarantee it will arrive at the global minimum. One idea is to run \(K\)-means multiple times, to get a sense of how stable groups are.

Note that cross-validation cannot be used to determine \(K^*\), the optimal number of clusters. For an explanation of why this is, see pages 518-9 in ESL.

There is another proposal on how to choose a “good” number of clusters, but it involves hierarchical clustering, the topic of the next chapter.

31.3.5 Comments

The main advantage of most flavors of K-means is their complexity which in general is linear, in the order \(O(n \cdot K)\). In other words, their computational complexity is proportional to the number of observations \(n\), because we compute \(n \cdot K\) distances at each step. This is the reason why K-means procedures are favored when dealing with large data sets.

The main disadvantage of direct partitioning methods is that the clustering configuration depends on the initial centroids. Which means we don’t obtain the global optimum, but just a local optima. An attempt to overcome this limitation is to run the algorithm several times for different values of \(k\), and then compare the obtained configurations in order to detect strong or stable groups.

Another suggestion is to run K-means on various random samples of observations, and then use the centroids of a good configuration as the initial centroids for the algorithm applied on the entire data set.