Reference: Introduction to Statistical Learning Chapter 10.3
You've seen a lot of models for \( p(y \mid x) \). This is supervised learning (knowing outcomes \( y \) = “supervision”).
The next few topics are all about models for \( x \) alone.
Today: clustering
Clusters should partition the data:
Data points within the same cluster should be close/similar, and data points in different clusters should be far apart/dissimilar.
Clusters should (ideally) be balanced. A sensible clustering of living creatures:
A less sensible clustering of living creatures:
There are many algorithms which do this, i.e. try to find clusters that are homogenous, well separated, balanced, etc.
Key fact: we need to know about distances to quantify similarity (within a cluster) and difference (between clusters).
Generically, if \( x_i \) and \( x_j \) are two data points, we let \( d(x_i, x_j) \) denote the distance between them.
\( d(x_i, x_j) \geq 0 \)
\( d(x_i, x_j) = 0 \iff x_i = x_j \).
\( d(x_i, x_j) = d(x_j, x_i) \) (symmetry)
\( d(x_i, x_j) \leq d(x_i, x_k) + d(x_j, x_k) \) (triangle inequality, i.e. “If you want to get from Austin to Houston, don't go through Dallas!”)
NB: in math, distance functions are called “metrics.”
Euclidean (\( \ell^2) \): \[ d(x_i, x_j) = \left[ \sum_{d=1}^D (x_{id} - x_{jd})^2 \right]^{1/2} \] (just the Pythagorean theorem!)
Manhattan (\( \ell^1) \): \[ d(x_i, x_j) = \sum_{d=1}^D |x_{id} - x_{jd}| \] (also called “taxicab” distance)
The \( x \) points can be arbitrary objects in potentially crazy spaces:
As long as we can measure the distance between any two points, we can cluster them.
Clusters can be ambiguous:
How many clusters do you see?
We can organize algorithms into two broad classes:
Distance-based clustering isn't magic.
Should blue cluster with red or with yellow?
(We can often deal with situations like this by redefining what distance is. The term here is “manifold learning.”)
The basic algorithm is super simple:
Each pair of clusters has two initial centroids.
Not bad!
Now one pair of clusters has one centroid (another pair has three).
Not as good!
Multiple restarts (helps, but probability is not on your side)
Select more than K initial centroids and then select the most widely separated among these initial centroids
Postprocessing
Different initialization strategies (e.g. K-means++)
Sample and use hierarchical clustering to determine initial centroids (sampling reduces computational cost of hierarchical clustering)
Initialize by choosing 1 centroid at random, \( m_1 \).
Then for \( k=2, \ldots, K \):
1) For each point, compute \( d_i \) as the minimum distance of \( x_i \) to the existing centroids.
2) Choose \( x_i \) as initial centroid \( k \) with probability proportional to \( d_i^2 \).
Thus points that are far away from existing cluster centroids are more likely to get chosen.
Note: this is just a different initialization strategy. After initialization, it works the same as K-means.
Let \( m_k \) be cluster centroid \( k \), and let \( C_k \) be the set of points in cluster \( k \) (i.e. for which \( \gamma_i = k \)).
Within-cluster sum of squares should be low: \[ \mbox{SSE}_W = \sum_{k=1}^K \sum_{x_i \in C_k} d(m_k, x_i)^2 \]
Between-cluster sum of squares should be high: \[ \mbox{SSE}_B = \sum_{k=1}^K d(m_k, \bar{x})^2 \] where \( \bar{x} \) is the overall sample mean.
Let's see an example in cars.R.
K-means tries to minimize within-cluster SSE. But:
\[ \begin{aligned} A(N, K) &= \mbox{# of assignments of $N$ points to $K$ groups} \\ &= \frac{1}{K!} \sum_{j=1}^K (-1)^{K-j} \cdot \binom{K}{j} \cdot j^N \\ \\ A(10, 4) &= 34,105 \\ A(25, 4) &\approx 5 \times 10^{13} \quad \mbox{(ouch!)} \end{aligned} \]
Let \( m_1, \ldots, m_K \) be the cluster centroids, and define
\[ V_k = \{x \in \mathcal{R}^D : d(x - m_k) \leq d(x, m_j) \; \mbox{for $j = 1, \ldots, K$} \} \]
These \( V_k \) are convex polyhedra, and they form a Voronoi tesselation.
Upshot: K-means will not successfully recover clusters of points that aren't shaped like a convex polyhedron (rings, arcs, amoeba-like blobs, bananas, etc)
K-means is best at recovering clusters shaped like multivariate Gaussians.
But: k-means can often be successfully applied in conjunction with another algorithm to find non-convex clusters.
Remember: K is an input to K-means, not an estimated parameter. There is no generally accepted “best” method for choosing K, and there never will be. Recall this example:
Unlike in supervised learning, we can't really cross validate, because we don't know the “true” cluster labels of points in a training set.
Some heuristics that people use in practice:
But probably the most popular principle is this:
Let's try many clustering with many different values of K:
library(foreach)
cars = read.csv('../data/cars.csv')
cars = scale(cars[,10:18]) # cluster on measurables
k_grid = seq(2, 20, by=1)
SSE_grid = foreach(k = k_grid, .combine='c') %do% {
cluster_k = kmeans(cars, k, nstart=50)
cluster_k$tot.withinss
}
plot(k_grid, SSE_grid)
Do you see an elbow?
The CH index is just one of many model-selection criteria that people have proposed for choosing K in K-means clustering. It tries to balance fit with simplicity.
Let \( n_j \) be the number of points in cluster \( j \), and define \[ \begin{aligned} B(K) &= \sum_{j=1}^K n_j d(m_j, \bar{x})^2 \quad \mbox{("between")} \\ W(K) &= \sum_{j=1}^K \sum_{x_i \in C_j} d(x_i, m_j)^2 \quad \mbox{("within")} \end{aligned} \]
The CH-index is defined as \[ CH(K) = \frac{B(K)}{W(K)} \cdot \frac{N-K}{K-1} \]
So choose \( \hat{K} = \arg \max CH(K) \).
N = nrow(cars)
CH_grid = foreach(k = k_grid, .combine='c') %do% {
cluster_k = kmeans(cars, k, nstart=50)
W = cluster_k$tot.withinss
B = cluster_k$betweenss
CH = (B/W)*((N-k)/(k-1))
CH
}
plot(k_grid, CH_grid)
Probably K=2 or K=3?
See the clusGap function in the cluster package.
library(cluster)
cars_gap = clusGap(cars, FUN = kmeans, nstart = 50, K.max = 10, B = 100)
plot(cars_gap)
Look for a dip: here after K=5.
cars_gap
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = cars, FUNcluster = kmeans, K.max = 10, B = 100, nstart = 50)
B=100 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
--> Number of clusters (method 'firstSEmax', SE.factor=1): 5
logW E.logW gap SE.sim
[1,] 5.892765 6.624251 0.7314856 0.013369867
[2,] 5.613163 6.347348 0.7341850 0.009533179
[3,] 5.454826 6.260008 0.8051826 0.009670759
[4,] 5.387105 6.203786 0.8166817 0.009178549
[5,] 5.315312 6.154320 0.8390086 0.009082719
[6,] 5.275226 6.113331 0.8381042 0.009544935
[7,] 5.222457 6.078473 0.8560163 0.009185893
[8,] 5.182540 6.047143 0.8646029 0.009235927
[9,] 5.150924 6.018937 0.8680134 0.009271739
[10,] 5.113672 5.994129 0.8804574 0.008804019
Look for a dip: here after K=5. (The output helpfully tells you.)
Note: I've generally had the most luck with the gap statistic as a model-selection heuristic. But don't be afraid to just pick a value of \( K \) that seems reasonable to you!
You don't have to assume any particular number of clusters.
The hierarchy itself may correspond to meaningful taxonomies:
Two main types of hierarchical clustering
Agglomerative:
Divisive:
The basic algorithm is straightforward: start with every point in its own cluster and compute a proximity matrix \( D(i, j) \) between every pair of points \( i \) and \( j \).
Key operation is the computation of the proximity of two clusters. Different approaches to defining the distance between clusters distinguish the different algorithms.
Start with a bunch of points and a proximity matrix:
After some merging steps, we have a handful of clusters:
We want to merge C2 and C5 and update the proximity matrix:
But how?
The key math question is: how do we measure similarity/distance between two clusters of points?
Four obvious options: min, max, average, centroids. There are called “linkage functions”
The minimum distance between two points, one in each cluster.
The maximum distance between two points, one in each cluster.
The average distance between all pairs of points, one in each cluster.
The distance between the centroids of each cluster.
Each has its own pros and cons:
See linkage_minmax.R and hclust_examples.R