4.1 - hierarchical clustering

Motivation

  • So far, we’ve learned about dimension reduction techniques that can help us identify groups of similar points with respect to several potentially correlated variables.
  • All color-coding/grouping in resulting biplots are by way of a pre-specified grouping variable
    • Olive oil region
    • Fast food type
    • Olympic medal winner/non-winner
  • Clustering techniques are ways of identifying clusters of similar data values when potentially there is no pre-defined grouping variable.

Motivation

  • We’ll look at a variety of methods, all of which are variations on the theme:
Find clusters of data points that more similar to each other than they are to points in other clusters.
  • As this goal suggests, measuring dissimilarity between points using the distance measures we’ve previously discussed is a crucial aspect of all clustering methods!

Hierarchical clustering

  • Hierarchical clustering includes 2 different variants:
    • Agglomerative clustering
    • Divisive clustering

Agglomerative clustering

  • Start with each point in its own cluster
  • Fuse points that are most similar
  • Fuse groups that are most similar
  • Continue until data are in one big cluster
  • Necessary ingredients:
    • Way to measure dissimilarity between points (covered)
    • Way to measure similarity between clusters (new)

Simple example

  • Let’s start simple by looking at a small 2-dimensional data set.
simple_data
  id   x   y
A  A 1.0 6.0
B  B 1.5 5.5
C  C 3.1 4.0
D  D 3.5 5.0
E  E 4.5 5.0
F  F 3.5 4.5
G  G 5.0 7.0
  • Note the letters here are row IDs, not grouping variables!!
  • All data points start off in their own cluster

First step

  • As the data are numeric we’ll use Euclidean metric to measure inter-point dissimilarity
     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92
  • Points D and F are most similar, forming the first cluster
  • Decision time:
    • Do we add a point to the (D,F) cluster (maybe E or C)?
    • Do we form a new cluster (e.g., (A,B))?

Linkage

  • We now need a way to define distance between two clusters, or between a point and a cluster.
  • In agglomerative clustering, this distance is called linkage.
  • Several different linkage methods exist, which can impact the cluster results.

Single linkage

  • Let \(C_1\) and \(C_2\) represent two clusters
    • Note: either of these may consist of only a single point
  • Single linkage defines inter-cluster difference \(d(C_1,C_2)\) as the minimum distance between any point in \(C_1\) and any point in \(C_2\).

\(d_{single}(C_1,E)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{single}(C_1, E) = \min(1.00, 1.12) = 1.12\)

\(d_{single}(C_1,C)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{single}(C_1, C) = \min(1.08, 0.64) = 0.64\)

\(d_{single}(A,B)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{single}(A,B) = 0.71\)

Second step result: single linkage

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92
  • \(d_{single}(C_1, E) = 1.12\)
  • \(d_{single}(C_1, C) = 0.64\)
  • \(d_{single}(A,B) = 0.71\)
  • \(\implies\) second step: \(C\) is added to the \((D,F)\) cluster.
  • Can you identify the third step in this process?
    • Will \(E\) be added to the \((C,D,F)\) cluster, or
    • Will \((A,B)\) form a new cluster?

Complete linkage

  • Let \(C_1\) and \(C_2\) represent two clusters
    • Note: either of these may consist of only a single point
  • Complete linkage defines inter-cluster difference \(d(C_1,C_2)\) as the maximum distance between any point in \(C_1\) and any point in \(C_2\).

\(d_{complete}(C_1,E)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{complete}(C_1, E) = \max(1.00, 1.12) = 1.12\)

\(d_{complete}(C_1,C)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{single}(C_1, C) = \max(1.08, 0.64) = 1.08\)

\(d_{complete}(A,B)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{complete}(A,B) = 0.71\)

Second step result: complete linkage

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92
  • \(d_{complete}(C_1, E) = 1.12\)
  • \(d_{complete}(C_1, C) = 1.08\)
  • \(d_{complete}(A,B) = 0.71\)
  • \(\implies\) second step: \((A,B)\) forms a new cluster.
  • Can you identify the third step in this process?

Average linkage

  • Let \(C_1\) and \(C_2\) represent two clusters
    • Note: either of these may consist of only a single point
  • Average linkage defines inter-cluster difference \(d(C_1,C_2)\) as the average of all pairwise distances between the points in \(C_1\) and \(C_2\).

\(d_{average}(C_1,E)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{average}(C_1, E) = mean(1.00, 1.12) = 1.06\)

\(d_{average}(C_1,C)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{average}(C_1, C) = mean(1.08, 0.64) = 0.86\)

\(d_{average}(A,B)\)

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(d_{average}(A,B) = 0.71\)

Second step result: average linkage

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92
  • \(d_{average}(C_1, E) = 1.06\)
  • \(d_{average}(C_1, C) = 0.86\)
  • \(d_{average}(A,B) = 0.71\)
  • \(\implies\) second step: \((A,B)\) forms a new cluster.

\(d_{average}(C_1, C_2)\)

Note that with \(C_1 = (D,F)\) and \(C_2 = (A,B)\):

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

\(\small d_{average}(C_1, C_2) = \frac{2.69+2.92+2.06+2.24}{4} = 2.4775\)

Ward’s linkage

  • Joins clusters in an attempt to minimize within-cluster variability
  • Join clusters with the closest centroids, weighting by the sizes of the clusters
  • Define \(C_1,C_2\) as two clusters, with centroids \(\bar C_1\) and \(\bar C_2\).

\[d_{Ward}(C_1,C_2) = \frac{|C_1||C_2|}{|C_1|+|C_2|} d(\bar C_1, \bar C_2)^2\]

Example: finding \(d_{Ward}\)

If \(C_1 = (D,F)\) and \(C_2 = (A,B)\), \(d(\bar C_1, \bar C_2)\) is visualized below.

  • Finding \(d(\bar C_1, \bar C_2)\) is straightforward with raw data
    • \(\bar C_1\) = the \(p-\)dimensional mean vector of all points in \(C_1\)
    • \(\bar C_2\) = the \(p-\)dimensional mean vector of all points in \(C_2\)
    • Compute \(d(\bar C_1, \bar C_2)\) using whatever distance metric you’re using for everything else
  • Can also be computed from distance matrix recursively using Lance-Williams algorithm

Linkages summarized

Table modified from Everitt, Landau, and Leese Cluster Analysis 4ed page 62:

Method How defined Remarks
Single Minimum distance between inter-cluster pairs Tends to produce unbalanced and straggly clusters (“chaining”), especially in large data sets.
Complete Maximum distance between inter-cluster pairs Tends to find compact clusters with equal diameters (maximum distance between objects within clusters)
Average Average of all inter-cluster pair distances Tends to join clusters with small within-cluster variances. Intermediate between single and complete. Relatively robust.
Ward Weighted squared distance between cluster centroids Tends to find same size, spherical clusters; sensitive to outliers.

This is not an exhaustive list of possible linkages, but is enough to get you started.

Agglomerative clustering: execution

  • Two primary functions:
    • hclust() (base R)
    • agnes() (cluster package) ← will use this one
  • agnes() stands for AGglomerative NESting
  • Accepts either a data set, or a dissimilarity object
    • If data set, only dissimilarity options are metric='euclidean' or metric = 'manhattan'
    • If you want to use something else (Gower, Bray-Curtis, etc.) you have to compute dissimilarity beforehand
  • Specify linkage with method =, default is method='average'

Execution for simple data set

  • Let’s implement agnes on our simple data set:
  • I am not pre-scaling here, but you should in practice if using Euclidean/Manhattan!!
  • Any sort of row labeling should be stored as rownames not as a variable (sorry, tidy data philosophy) - use column_to_rownames() if you need.
library(cluster)
(simple_numeric <- simple_data %>% dplyr::select(x,y))
    x   y
A 1.0 6.0
B 1.5 5.5
C 3.1 4.0
D 3.5 5.0
E 4.5 5.0
F 3.5 4.5
G 5.0 7.0
cluster_simple_single <- agnes(simple_numeric, metric='euclidean', method='single')
cluster_simple_complete <- agnes(simple_numeric, metric='euclidean', method='complete')
cluster_simple_average <- agnes(simple_numeric, metric='euclidean', method='average')
cluster_simple_ward <- agnes(simple_numeric, metric='euclidean', method='ward')

Dendrograms

  • Dendrograms, aka “tree diagrams,” are frequently used to explore the results of hierarchical clustering.
  • The fviz_dend() function, from the factoextra package, produces ggplot-styled version of a dendrogram from a clustering result.
    • Caution: takes a long time for data sets in the hundreds, even; consider base plot methods for larger examples.

Producing dendrograms

Producing the dendrograms for our simple example, for single and average linkages (complete and Ward look very similar to average):

library(factoextra)
fviz_dend(cluster_simple_single) + 
  labs(title='Single linkage')

fviz_dend(cluster_simple_average) + 
  labs(title='Average linkage')

  • Note for single linkage how the \((C,D,F)\) cluster forms before the \((A,B)\) cluster
  • For the average linkage (and Ward and complete too), \((A,B)\) cluster before \(C\) joins \((D,F)\).
  • Unsurprisingly, \(G\) clusters as a last resort.

Agglomerative coefficient

  • The agglomerative coefficient (AC) measures the amount of clustering structure found for a given linkage.
  • For each object \(i\) in the data set, \(d(i)\) represents dissimilarity to the first cluster it is merged with, divided by its dissimilarity to the final cluster it belongs to.
  • AC is then defined as the average of all \(1 - d(i)\).
  • AC close to 1: strong-well-defined clustering structure
  • AC close to 0: clusters are less distinct, more heterogeneous data are merged together
cluster_simple_single$ac
[1] 0.5761809
cluster_simple_complete$ac
[1] 0.6963881
cluster_simple_average$ac
[1] 0.6513288
cluster_simple_ward$ac
[1] 0.7426257
  • The Ward linkage appears to identify the best structure for this tiny data set.

Cutting the tree: visualization

  • From a hierarchical clustering result, final clusters can be determined by:
    • Specifying number of clusters (most common);
    • Specifying the height at which to cut the tree
  • A visual of the resulting clusters is easily found by specifying k= or h= in the fviz_dend() function.
  • The code below visualizes the results of cutting the single-linkage tree by specifying k=, and in the average-linkage tree by specifying h=. Both result in 5 clusters.
fviz_dend(cluster_simple_single, k = 5) + 
  labs(title='Single linkage')

fviz_dend(cluster_simple_average, h = 2) + 
  labs(title='Average linkage')

Cutting the tree: execution

  • fviz_dend() shows the results of cutting the tree at a certain place
  • hcut actually performs the cutting, producing cluster labels you can use e.g. in a biplot
  • Must specify number of clusters k to produce in the cut.
  • Notes:
    • Cluster labels are arbitrary!! Points in “cluster 1” for one cut are not necessarily in “cluster 1” for a second cut.
    • Will often want this treated as a factor rather than numeric in plots; hence use of factor().
(single_clusters <- cutree(cluster_simple_single, k = 5))
[1] 1 2 3 3 4 3 5
(average_clusters <- cutree(cluster_simple_average, k = 5))
[1] 1 1 2 3 4 3 5
# Adding to the data frame
bind_cols(simple_data, 
          sing_clust=factor(single_clusters), 
          ave_clust=factor(average_clusters)
) %>% as_tibble()
# A tibble: 7 × 5
  id        x     y sing_clust ave_clust
  <chr> <dbl> <dbl> <fct>      <fct>    
1 A       1     6   1          1        
2 B       1.5   5.5 2          1        
3 C       3.1   4   3          2        
4 D       3.5   5   3          3        
5 E       4.5   5   4          4        
6 F       3.5   4.5 3          3        
7 G       5     7   5          5        

Divisive clustering

  • Works opposite of agglomerative clustering
  • All observations start in 1 cluster
  • In each subsequent step, largest available cluster split into two
  • Size of cluster \(C_k\) determined by its “diameter”: \(D_k = \max_{i,j \in C_k}d(i,j)\)
  • Split is done in a way to optimize inter-cluster dissimilarity
  • Less common than agglomerative, and can be computationally intensive for large data sets.

Divising clustering with diana()

  • The diana() (DIvisive ANAlysis) function in the cluster package performs divisive clustering.
  • Like agnes(), can feed it a dissimilarity matrix instead of data frame if you want to use something like Bray-Curtis to measure dissimilarity
simple_divisive <- diana(simple_data[,c('x','y')], metric='euclidean')
fviz_dend(simple_divisive)

  • Interestingly, points \(E\) and \(G\) are grouped in a cluster
  • If this tree is cut to form 4 clusters, \((E,G)\) will be together while \(C\) is on its own. Seems not right.

How many clusters?

  • Hierarchical clustering defines methods for forming clusters, given \(k\)
  • What should \(k\) be?
  • Two prominent ways of choosing \(k\):
    • Within-cluster sum-of-squares (“wss”); only appropriate for numeric data where a cluster “centroid” can be defined
    • Silhouette widths; applicable for any distance metric (Jaccard, Gower, etc.)

Simulated data example

Consider the following simulated 2-dimensional data:

  • Agglomerative clustering, Euclidean metric, Ward’s linkage
  • Cutting the tree at \(k=2,3,4,5\), and converting cluster labels to factor variables:
df1_agnes <- agnes(df1, method='ward')
clusters <- cutree(df1_agnes, k = 2:5) %>% 
  data.frame() %>% 
  setNames(c('clust2','clust3','clust4','clust5')) %>% 
  mutate(across(everything(), as.factor))

df1_with_clusters <- bind_cols(df1,clusters) 

Plotting the data with defined clusters

How many clusters would you pick?

Within-cluster sum-of-squares (WSS)

  • This method computes the \(p\)-dimensional centroid (mean vector) of each cluster
  • Determines sum-of-squared Euclidean distances between centroid and data points within cluster

\[WSS = \sum_{j=1}^k\sum_{i=1}^{n_j}(||X_{ij}- \bar X_j||_{2})^2\]

  • WSS always decreases as \(k\) increases
  • Plotting as a function of \(k\), choose \(k\) that yields an “elbow” in the plot, beyond which there are diminishing returns (like a scree plot)

fviz_nbclust()

  • The function fviz_nbclust() is a versatile function for plotting metrics like WSS as a function of \(k\).
  • Basic utilization for our example:
fviz_nbclust(df1,
             FUNcluster = hcut,
             k.max = 10,
             method='wss',
             hc_method='ward',
             hc_func = 'agnes')

  • Steep drop off until \(k=3\), beyond which we don’t gain much improvement
  • \(k=3\) clusters seems like the best choice.
  • Not always this clear-cut!

Silhouette

  • While WSS is only really appropriate when Euclidean “distance from centroid” is appropriate, the silhouette can be used when clustering on any distance metric is performed.
  • Let \(C_j\) represent cluster \(k\), with \(|C_k|\) points in the cluster.

  • Define, for data point \(i \in C_k\):

\[a(i) = \frac{1}{|C_i|-1} \sum_{j\in C_k, i\ne j} d(i,j)\]

  • \(a_i\) measures average dissimilarity of point \(i\) to all other points within same cluster \(C_k\).
  • Now consider a different cluster \(C_l\), and define:

\[b(i) = \min_{j\ne i}\frac{1}{|C_l|} \sum_{j\in C_l}d(i,j)\]

  • \(b_i\) measures average dissimilarity of point \(i\) to all other points within closest cluster \(C_l\).

Silhouette width

  • We define the silhouette width \(s(i)\) of point \(i\) as:

\[s(i) = \frac{b(i)-a(i)}{\max_i(a(i),b(i))}\]

  • \(-1 \le s(i) \le 1\)
  • \(s(i)\) near 1: point \(i\) is much more similar to points in its own cluster than to points in the nearest cluster (i.e., it’s probably in the right cluster).
  • \(s(i)\) near -1: point \(i\) is much more similar to points in the nearest cluster than to points in its own cluster (i.e., it’s probably in the wrong cluster).
  • \(s(i)\) near 0: point \(i\) is equally similar to its own members and nearest members.
    • If point \(i\) is in its own cluster, we define \(s(i) = 0\)

Average silhouette width

Averaging:

\[\bar s = \frac{1}{n} \sum_{i=1}^n s(i)\]

gives us a measure of the overall cluster quality, with larger = better.

Computing silhouette widths

  • To compute the silhouette given a cluster definition, we can use the silhouette() function in the cluster package.
    • Arguments:
      • x, a vector of cluster assignments as integers;
      • dist, the distance matrix used.
(k3_sils <- silhouette(x = cutree(df1_agnes, k = 3), 
           dist = dist(df1))
 ) %>% head
     cluster neighbor sil_width
[1,]       1        2 0.7960441
[2,]       1        2 0.7162754
[3,]       1        2 0.7894861
[4,]       1        2 0.6817848
[5,]       1        2 0.6571660
[6,]       1        2 0.7463931
(k5_sils <- silhouette(x = cutree(df1_agnes, k = 5), 
           dist = dist(df1))
 ) %>% head
     cluster neighbor   sil_width
[1,]       1        2 -0.08724185
[2,]       2        1  0.56753787
[3,]       1        2  0.45855553
[4,]       1        2  0.47252905
[5,]       2        1  0.46549746
[6,]       1        2  0.47466167

Silhouette plots

  • Silhouette plots plot \(s(i)\) for every data point, arranged in decreasing order within cluster.
  • For this we use the fviz_silhouette() function:
fviz_silhouette(k3_sils)
  cluster size ave.sil.width
1       1  100          0.74
2       2   99          0.71
3       3  101          0.73

fviz_silhouette(k5_sils)
  cluster size ave.sil.width
1       1   69          0.31
2       2   31          0.44
3       3   99          0.65
4       4   18          0.47
5       5   83          0.30

The 3-cluster solution has much better \(\bar s\) (0.73 vs 0.44), and only a few tiny negative blue silhouette widths

Can you identify the points with negative widths?

Plotting \(\bar s\) vs \(k\)

  • Plotting the average silhouette widths versus \(k\), our goal is to identify \(k\) with the largest \(\bar s\).
  • Easily implemented with fviz_nbclust():
fviz_nbclust(df1,
             FUNcluster = hcut,
             k.max = 10,
             method='silhouette',
             hc_method='ward',
             hc_func = 'agnes')
  • As expected, \(k=3\) appears to be the best number of clusters.

Real data

  • Let’s analyze some real data!
library(tidyverse)
drugs <- read.csv('Data/IllicitDrug.csv') %>% 
  column_to_rownames('State')
head(drugs)
           DrugUse BingeDrink Poverty HSdrop Income
Alabama        3.3       15.6    14.5   12.6  22946
Alaska         8.5       19.8     9.4   10.9  28523
Arizona        5.4       17.4    16.6   14.4  25307
Arkansas       2.8       16.8    14.8   11.4  22114
California     6.2       16.7    15.4   14.2  29819
Colorado       6.5       19.8     9.2    9.8  31678
  • DrugUse = estimate of the % of the population in the state that use illicit drugs.
  • BingeDrink = estimate of the % of the population in the state that binge drink.
  • Poverty = estimate of the % of the population in the state living in poverty.
  • HSdrop = estimate of the % of the population that has dropped out of high school.
  • Income = estimate of the per capita income ($).

Distance matrix

  • For this example, will demonstrate clustering using the distance matrix as input to agnes() instead of the data matrix.
  • Since we have numeric data, will use Euclidean distance.
  • Clearly need to scale (or Income will drive everything):
drugs_scaled <- scale(drugs)
drug_dist  <- dist(drugs_scaled, method='euclidean')

Agglomerative clustering

  • Implementation of agglomerative clustering, with various linkages:
(drug_single <- agnes(drug_dist, method='single'))$ac
[1] 0.6173047
(drug_complete <- agnes(drug_dist, method='complete'))$ac
[1] 0.8164902
(drug_average <- agnes(drug_dist, method='average'))$ac
[1] 0.7364308
(drug_ward <- agnes(drug_dist, method='ward'))$ac
[1] 0.9015376
  • Ward once again has the best agglomerative coefficient, with complete linkage in second.

Dendrograms

The “scraggly-ness” of single, and the more balanced cluster sizes of Complete and Ward, are evident here:

p_single <- fviz_dend(drug_single) + ggtitle('Single linkage')
p_complete <- fviz_dend(drug_complete)+ ggtitle('Complete linkage')
p_average <- fviz_dend(drug_average)+ ggtitle('Average linkage')
p_ward <- fviz_dend(drug_ward)+ ggtitle('Ward linkage')

library(patchwork)
(p_single + p_complete)/(p_average + p_ward)

Choosing number of clusters

  • Plotting the WSS and mean silhouette width vs \(k\), using the Ward linkage clustering results:
fviz_nbclust(drugs, 
             FUNcluster = hcut,
             method='wss',
             diss = drug_dist,
             hc_method='ward',
             hc_func = 'agnes') + 
  labs(title = 'Plot of WSS vs k')

fviz_nbclust(drugs, 
             FUNcluster = hcut,
             method='silhouette',
             diss = drug_dist,
             hc_method='ward',
             hc_func = 'agnes') + 
  labs(title = 'Plot of mean silhouette width vs k')

  • The WSS plot suggests \(k=3\), while the silhouette method suggests \(k=2\), with \(k=3\) the second-best choice. I also re-created these using hc_method='complete'; the results looked nearly identical.
  • Will further explore both \(k=2\) and \(k=3\).

Visualizing resulting clusters

# The adding of plots made possible by patchwork
fviz_dend(drug_ward,k=2)+ ggtitle('Ward linkage, 2 clusters') +
fviz_dend(drug_ward,k=3)+ ggtitle('Ward linkage, 3 clusters')

Cutting the tree

Proceeding to cut the trees:

(ward_clusters <- cutree(drug_ward, k = 2:3) 
    %>% data.frame()
    %>% setNames(c('k2','k3'))
) %>% head
  k2 k3
1  1  1
2  2  2
3  1  1
4  1  1
5  1  1
6  2  2

Cluster characterizations

  • It remains to characterize the clusters, using one of the dimension-reduction techniques we’ve learned about.
  • Let’s apply PCA:
drug_pca <- prcomp(drugs, center=TRUE, scale. = TRUE)

PCA biplots with cluster labels

  • Using fviz_pca() to create the biplot, color-coding by cluster label:
k2_biplot <- fviz_pca(drug_pca, 
         habillage = factor(ward_clusters$k2),
         repel = TRUE) + 
      ggtitle('2-cluster solution') + 
      guides(color='none',shape='none')
k3_biplot <- fviz_pca(drug_pca, 
         habillage = factor(ward_clusters$k3),
         repel = TRUE) + 
        ggtitle('3-cluster solution') + 
        guides(color='none',shape='none')
k2_biplot + k3_biplot

  • Both solutions cluster the high-poverty, high-HS-dropout states
    • High-poverty: southern states, e.g. Missippi, Alabama, Tennessee, etc.
    • High HS dropout rates: Western states, e.g. California, Oregon, Arizona; plus DC and Florida
  • The 3-cluster solution makes a pretty informative separation between the “binge-drinking” states (hello, Wisconsin and Minnesota), and the “high-income/high drug use” states, many in the north Atlantic region of the country
  • Would probably proceed with 3 clusters, in spite of its slightly lower average silhouette width

3 cluster solutions, other linkages

single_clusters <- cutree(drug_single, k=3)
complete_clusters <- cutree(drug_complete, k=3)
average_clusters <- cutree(drug_average, k=3)
fviz_pca(drug_pca, 
         habillage = factor(single_clusters),
         repel = TRUE) + 
      ggtitle('3 clusters with single linkage') + 
      guides(color='none',shape='none') +
  theme(plot.title = element_text(hjust = 0.5, size = 20))

fviz_pca(drug_pca, 
         habillage = factor(complete_clusters),
         repel = TRUE) + 
      ggtitle('3 clusters with complete linkage') + 
      guides(color='none',shape='none')+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

fviz_pca(drug_pca, 
         habillage = factor(average_clusters),
         repel = TRUE) + 
      ggtitle('3 clusters with average linkage') + 
      guides(color='none',shape='none')+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

  • Single and average linkage methods produce very lopsided clusters, with 49 states in one cluster, and solo clusters of 1 state each.
  • The 3-cluster complete solution looks similar to the Ward solution, but group Maine and Vermont in the “green” cluster, with other more geographically similar states (they were in the “blue”, Midwestern-heavy cluster with Ward linkage)

Where do Maine and Vermont belong?

complete_k3_sil <- silhouette(complete_clusters, drug_dist)
rownames(complete_k3_sil) <- rownames(drugs)
fviz_silhouette(complete_k3_sil, print.summary = FALSE) + 
  guides(fill=FALSE,color=FALSE) + 
  theme_classic() +
  ggtitle('Complete linkage,\n average sillhouette width = 0.24') +
  theme(axis.text.x = element_text(angle = 90,size = 8))

ward_k3_sil <- silhouette(ward_clusters$k3, drug_dist)
rownames(ward_k3_sil) <- rownames(drugs)
fviz_silhouette(ward_k3_sil, print.summary=FALSE) + 
  guides(fill=FALSE,color=FALSE) + 
  theme_classic() +
  ggtitle('Ward linkage,\n average sillhouette width = 0.25') + 
  theme(axis.text.x = element_text(angle = 90,size = 8))

  • Although geographically more similar to states in the “green” cluster, it does appear they are more similar to Midwestern binge-drinking states, with respect to the variables used.

Boxplots for inter-cluster differences

  • Boxplots can be a more intuitive way for visualizing inter-cluster differences
# Add Ward k3 clusters to data set
# pivot to prepare for faceting
(drug_update <- drugs 
      %>% rownames_to_column('State')
      %>% mutate(ward_k3 = factor(ward_clusters$k3))  
      %>% pivot_longer(cols=DrugUse:Income, 
                       names_to = 'variable', 
                       values_to = 'value')
) %>% head()
# A tibble: 6 × 4
  State   ward_k3 variable     value
  <chr>   <fct>   <chr>        <dbl>
1 Alabama 1       DrugUse        3.3
2 Alabama 1       BingeDrink    15.6
3 Alabama 1       Poverty       14.5
4 Alabama 1       HSdrop        12.6
5 Alabama 1       Income     22946  
6 Alaska  2       DrugUse        8.5
#Visualize:
ggplot(data = drug_update) + 
  geom_boxplot(aes(x = ward_k3, y = value, fill = ward_k3)) + 
  labs(fill='Cluster', x = 'Cluster') + 
  facet_wrap(~variable, scales = 'free_y') + 
  theme_classic()

  • Not too bad to visualize all variables when only 5; in practice may need to filter down to “most important”