So far, we’ve learned about dimension reduction techniques that can help us identify groups of similar points with respect to several potentially correlated variables.
All color-coding/grouping in resulting biplots are by way of a pre-specified grouping variable
Olive oil region
Fast food type
Olympic medal winner/non-winner
Clustering techniques are ways of identifying clusters of similar data values when potentially there is no pre-defined grouping variable.
Motivation
We’ll look at a variety of methods, all of which are variations on the theme:
Find clusters of data points that more similar to each other than they are to points in other clusters.
As this goal suggests, measuring dissimilarity between points using the distance measures we’ve previously discussed is a crucial aspect of all clustering methods!
Hierarchical clustering
Hierarchical clustering includes 2 different variants:
Agglomerative clustering
Divisive clustering
Agglomerative clustering
Start with each point in its own cluster
Fuse points that are most similar
Fuse groups that are most similar
Continue until data are in one big cluster
Necessary ingredients:
Way to measure dissimilarity between points (covered)
Way to measure similarity between clusters (new)
Simple example
Let’s start simple by looking at a small 2-dimensional data set.
simple_data
id x y
A A 1.0 6.0
B B 1.5 5.5
C C 3.1 4.0
D D 3.5 5.0
E E 4.5 5.0
F F 3.5 4.5
G G 5.0 7.0
Note the letters here are row IDs, not grouping variables!!
All data points start off in their own cluster
First step
As the data are numeric we’ll use Euclidean metric to measure inter-point dissimilarity
A B C D E F
B 0.71
C 2.90 2.19
D 2.69 2.06 1.08
E 3.64 3.04 1.72 1.00
F 2.92 2.24 0.64 0.50 1.12
G 4.12 3.81 3.55 2.50 2.06 2.92
Points D and F are most similar, forming the first cluster
Decision time:
Do we add a point to the (D,F) cluster (maybe E or C)?
Do we form a new cluster (e.g., (A,B))?
Linkage
We now need a way to define distance between two clusters, or between a point and a cluster.
In agglomerative clustering, this distance is called linkage.
Several different linkage methods exist, which can impact the cluster results.
Single linkage
Let \(C_1\) and \(C_2\) represent two clusters
Note: either of these may consist of only a single point
Single linkage defines inter-cluster difference \(d(C_1,C_2)\) as the minimum distance between any point in \(C_1\) and any point in \(C_2\).
\(d_{single}(C_1,E)\)
A B C D E F
B 0.71
C 2.90 2.19
D 2.69 2.06 1.08
E 3.64 3.04 1.72 1.00
F 2.92 2.24 0.64 0.50 1.12
G 4.12 3.81 3.55 2.50 2.06 2.92
\(d_{single}(C_1, E) = \min(1.00, 1.12) = 1.12\)
\(d_{single}(C_1,C)\)
A B C D E F
B 0.71
C 2.90 2.19
D 2.69 2.06 1.08
E 3.64 3.04 1.72 1.00
F 2.92 2.24 0.64 0.50 1.12
G 4.12 3.81 3.55 2.50 2.06 2.92
\(d_{single}(C_1, C) = \min(1.08, 0.64) = 0.64\)
\(d_{single}(A,B)\)
A B C D E F
B 0.71
C 2.90 2.19
D 2.69 2.06 1.08
E 3.64 3.04 1.72 1.00
F 2.92 2.24 0.64 0.50 1.12
G 4.12 3.81 3.55 2.50 2.06 2.92
\(d_{single}(A,B) = 0.71\)
Second step result: single linkage
A B C D E F
B 0.71
C 2.90 2.19
D 2.69 2.06 1.08
E 3.64 3.04 1.72 1.00
F 2.92 2.24 0.64 0.50 1.12
G 4.12 3.81 3.55 2.50 2.06 2.92
\(d_{single}(C_1, E) = 1.12\)
\(d_{single}(C_1, C) = 0.64\)
\(d_{single}(A,B) = 0.71\)
\(\implies\) second step: \(C\) is added to the \((D,F)\) cluster.
Can you identify the third step in this process?
Will \(E\) be added to the \((C,D,F)\) cluster, or
Will \((A,B)\) form a new cluster?
Complete linkage
Let \(C_1\) and \(C_2\) represent two clusters
Note: either of these may consist of only a single point
Complete linkage defines inter-cluster difference \(d(C_1,C_2)\) as the maximum distance between any point in \(C_1\) and any point in \(C_2\).
\(d_{complete}(C_1,E)\)
A B C D E F
B 0.71
C 2.90 2.19
D 2.69 2.06 1.08
E 3.64 3.04 1.72 1.00
F 2.92 2.24 0.64 0.50 1.12
G 4.12 3.81 3.55 2.50 2.06 2.92
Note for single linkage how the \((C,D,F)\) cluster forms before the \((A,B)\) cluster
For the average linkage (and Ward and complete too), \((A,B)\) cluster before \(C\) joins \((D,F)\).
Unsurprisingly, \(G\) clusters as a last resort.
Agglomerative coefficient
The agglomerative coefficient (AC) measures the amount of clustering structure found for a given linkage.
For each object \(i\) in the data set, \(d(i)\) represents dissimilarity to the first cluster it is merged with, divided by its dissimilarity to the final cluster it belongs to.
AC is then defined as the average of all \(1 - d(i)\).
AC close to 1: strong-well-defined clustering structure
AC close to 0: clusters are less distinct, more heterogeneous data are merged together
cluster_simple_single$ac
[1] 0.5761809
cluster_simple_complete$ac
[1] 0.6963881
cluster_simple_average$ac
[1] 0.6513288
cluster_simple_ward$ac
[1] 0.7426257
The Ward linkage appears to identify the best structure for this tiny data set.
Cutting the tree: visualization
From a hierarchical clustering result, final clusters can be determined by:
Specifying number of clusters (most common);
Specifying the height at which to cut the tree
A visual of the resulting clusters is easily found by specifying k= or h= in the fviz_dend() function.
The code below visualizes the results of cutting the single-linkage tree by specifying k=, and in the average-linkage tree by specifying h=. Both result in 5 clusters.
fviz_dend(cluster_simple_single, k =5) +labs(title='Single linkage')
fviz_dend(cluster_simple_average, h =2) +labs(title='Average linkage')
Cutting the tree: execution
fviz_dend() shows the results of cutting the tree at a certain place
hcut actually performs the cutting, producing cluster labels you can use e.g. in a biplot
Must specify number of clusters k to produce in the cut.
Notes:
Cluster labels are arbitrary!! Points in “cluster 1” for one cut are not necessarily in “cluster 1” for a second cut.
Will often want this treated as a factor rather than numeric in plots; hence use of factor().
(single_clusters <-cutree(cluster_simple_single, k =5))
[1] 1 2 3 3 4 3 5
(average_clusters <-cutree(cluster_simple_average, k =5))
[1] 1 1 2 3 4 3 5
# Adding to the data framebind_cols(simple_data, sing_clust=factor(single_clusters), ave_clust=factor(average_clusters)) %>%as_tibble()
# A tibble: 7 × 5
id x y sing_clust ave_clust
<chr> <dbl> <dbl> <fct> <fct>
1 A 1 6 1 1
2 B 1.5 5.5 2 1
3 C 3.1 4 3 2
4 D 3.5 5 3 3
5 E 4.5 5 4 4
6 F 3.5 4.5 3 3
7 G 5 7 5 5
Divisive clustering
Works opposite of agglomerative clustering
All observations start in 1 cluster
In each subsequent step, largest available cluster split into two
Size of cluster \(C_k\) determined by its “diameter”: \(D_k = \max_{i,j \in C_k}d(i,j)\)
Split is done in a way to optimize inter-cluster dissimilarity
Less common than agglomerative, and can be computationally intensive for large data sets.
Divising clustering with diana()
The diana() (DIvisive ANAlysis) function in the cluster package performs divisive clustering.
Like agnes(), can feed it a dissimilarity matrix instead of data frame if you want to use something like Bray-Curtis to measure dissimilarity
Steep drop off until \(k=3\), beyond which we don’t gain much improvement
\(k=3\) clusters seems like the best choice.
Not always this clear-cut!
Silhouette
While WSS is only really appropriate when Euclidean “distance from centroid” is appropriate, the silhouette can be used when clustering on any distance metric is performed.
Let \(C_j\) represent cluster \(k\), with \(|C_k|\) points in the cluster.
\(b_i\) measures average dissimilarity of point \(i\) to all other points within closest cluster \(C_l\).
Silhouette width
We define the silhouette width\(s(i)\) of point \(i\) as:
\[s(i) = \frac{b(i)-a(i)}{\max_i(a(i),b(i))}\]
\(-1 \le s(i) \le 1\)
\(s(i)\) near 1: point \(i\) is much more similar to points in its own cluster than to points in the nearest cluster (i.e., it’s probably in the right cluster).
\(s(i)\) near -1: point \(i\) is much more similar to points in the nearest cluster than to points in its own cluster (i.e., it’s probably in the wrong cluster).
\(s(i)\) near 0: point \(i\) is equally similar to its own members and nearest members.
If point \(i\) is in its own cluster, we define \(s(i) = 0\)
Average silhouette width
Averaging:
\[\bar s = \frac{1}{n} \sum_{i=1}^n s(i)\]
gives us a measure of the overall cluster quality, with larger = better.
Computing silhouette widths
To compute the silhouette given a cluster definition, we can use the silhouette() function in the cluster package.
Arguments:
x, a vector of cluster assignments as integers;
dist, the distance matrix used.
(k3_sils <-silhouette(x =cutree(df1_agnes, k =3), dist =dist(df1)) ) %>% head
Plotting the WSS and mean silhouette width vs \(k\), using the Ward linkage clustering results:
fviz_nbclust(drugs, FUNcluster = hcut,method='wss',diss = drug_dist,hc_method='ward',hc_func ='agnes') +labs(title ='Plot of WSS vs k')
fviz_nbclust(drugs, FUNcluster = hcut,method='silhouette',diss = drug_dist,hc_method='ward',hc_func ='agnes') +labs(title ='Plot of mean silhouette width vs k')
The WSS plot suggests \(k=3\), while the silhouette method suggests \(k=2\), with \(k=3\) the second-best choice. I also re-created these using hc_method='complete'; the results looked nearly identical.
Will further explore both \(k=2\) and \(k=3\).
Visualizing resulting clusters
# The adding of plots made possible by patchworkfviz_dend(drug_ward,k=2)+ggtitle('Ward linkage, 2 clusters') +fviz_dend(drug_ward,k=3)+ggtitle('Ward linkage, 3 clusters')
Cutting the tree
Proceeding to cut the trees:
(ward_clusters <-cutree(drug_ward, k =2:3) %>%data.frame()%>%setNames(c('k2','k3'))) %>% head
k2 k3
1 1 1
2 2 2
3 1 1
4 1 1
5 1 1
6 2 2
Cluster characterizations
It remains to characterize the clusters, using one of the dimension-reduction techniques we’ve learned about.
Both solutions cluster the high-poverty, high-HS-dropout states
High-poverty: southern states, e.g. Missippi, Alabama, Tennessee, etc.
High HS dropout rates: Western states, e.g. California, Oregon, Arizona; plus DC and Florida
The 3-cluster solution makes a pretty informative separation between the “binge-drinking” states (hello, Wisconsin and Minnesota), and the “high-income/high drug use” states, many in the north Atlantic region of the country
Would probably proceed with 3 clusters, in spite of its slightly lower average silhouette width
fviz_pca(drug_pca, habillage =factor(average_clusters),repel =TRUE) +ggtitle('3 clusters with average linkage') +guides(color='none',shape='none')+theme(plot.title =element_text(hjust =0.5, size =20))
Single and average linkage methods produce very lopsided clusters, with 49 states in one cluster, and solo clusters of 1 state each.
The 3-cluster complete solution looks similar to the Ward solution, but group Maine and Vermont in the “green” cluster, with other more geographically similar states (they were in the “blue”, Midwestern-heavy cluster with Ward linkage)
Although geographically more similar to states in the “green” cluster, it does appear they are more similar to Midwestern binge-drinking states, with respect to the variables used.
Boxplots for inter-cluster differences
Boxplots can be a more intuitive way for visualizing inter-cluster differences
# Add Ward k3 clusters to data set# pivot to prepare for faceting(drug_update <- drugs %>%rownames_to_column('State')%>%mutate(ward_k3 =factor(ward_clusters$k3)) %>%pivot_longer(cols=DrugUse:Income, names_to ='variable', values_to ='value')) %>%head()