4.1 - hierarchical clustering

Motivation

So far, we’ve learned about dimension reduction techniques that can help us identify groups of similar points with respect to several potentially correlated variables.

All color-coding/grouping in resulting biplots are by way of a pre-specified grouping variable
- Olive oil region
- Fast food type
- Olympic medal winner/non-winner

Clustering techniques are ways of identifying clusters of similar data values when potentially there is no pre-defined grouping variable.

Motivation

We’ll look at a variety of methods, all of which are variations on the theme:

Find clusters of data points that more similar to each other than they are to points in other clusters.

As this goal suggests, measuring dissimilarity between points using the distance measures we’ve previously discussed is a crucial aspect of all clustering methods!

Hierarchical clustering

Hierarchical clustering includes 2 different variants:
- Agglomerative clustering
- Divisive clustering

Agglomerative clustering

Start with each point in its own cluster
Fuse points that are most similar
Fuse groups that are most similar
Continue until data are in one big cluster
Necessary ingredients:
- Way to measure dissimilarity between points (covered)
- Way to measure similarity between clusters (new)

Simple example

Let’s start simple by looking at a small 2-dimensional data set.

simple_data

  id   x   y
A  A 1.0 6.0
B  B 1.5 5.5
C  C 3.1 4.0
D  D 3.5 5.0
E  E 4.5 5.0
F  F 3.5 4.5
G  G 5.0 7.0

Note the letters here are row IDs, not grouping variables!!
All data points start off in their own cluster

First step

As the data are numeric we’ll use Euclidean metric to measure inter-point dissimilarity

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

Points D and F are most similar, forming the first cluster

Decision time:
- Do we add a point to the (D,F) cluster (maybe E or C)?
- Do we form a new cluster (e.g., (A,B))?

Linkage

We now need a way to define distance between two clusters, or between a point and a cluster.
In agglomerative clustering, this distance is called linkage.
Several different linkage methods exist, which can impact the cluster results.

Single linkage

Let $C_1$ and $C_2$ represent two clusters
- Note: either of these may consist of only a single point
Single linkage defines inter-cluster difference $d(C_1,C_2)$ as the minimum distance between any point in $C_1$ and any point in $C_2$.

$d_{single}(C_1,E)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{single}(C_1, E) = \min(1.00, 1.12) = 1.12$

$d_{single}(C_1,C)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{single}(C_1, C) = \min(1.08, 0.64) = 0.64$

$d_{single}(A,B)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{single}(A,B) = 0.71$

Second step result: single linkage

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{single}(C_1, E) = 1.12$
$d_{single}(C_1, C) = 0.64$
$d_{single}(A,B) = 0.71$
$\implies$ second step: $C$ is added to the $(D,F)$ cluster.
Can you identify the third step in this process?
- Will $E$ be added to the $(C,D,F)$ cluster, or
- Will $(A,B)$ form a new cluster?

Complete linkage

Let $C_1$ and $C_2$ represent two clusters
- Note: either of these may consist of only a single point
Complete linkage defines inter-cluster difference $d(C_1,C_2)$ as the maximum distance between any point in $C_1$ and any point in $C_2$.

$d_{complete}(C_1,E)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{complete}(C_1, E) = \max(1.00, 1.12) = 1.12$

$d_{complete}(C_1,C)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{single}(C_1, C) = \max(1.08, 0.64) = 1.08$

$d_{complete}(A,B)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{complete}(A,B) = 0.71$

Second step result: complete linkage

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{complete}(C_1, E) = 1.12$
$d_{complete}(C_1, C) = 1.08$
$d_{complete}(A,B) = 0.71$
$\implies$ second step: $(A,B)$ forms a new cluster.
Can you identify the third step in this process?

Average linkage

Let $C_1$ and $C_2$ represent two clusters
- Note: either of these may consist of only a single point
Average linkage defines inter-cluster difference $d(C_1,C_2)$ as the average of all pairwise distances between the points in $C_1$ and $C_2$.

$d_{average}(C_1,E)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{average}(C_1, E) = mean(1.00, 1.12) = 1.06$

$d_{average}(C_1,C)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{average}(C_1, C) = mean(1.08, 0.64) = 0.86$

$d_{average}(A,B)$

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{average}(A,B) = 0.71$

Second step result: average linkage

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$d_{average}(C_1, E) = 1.06$
$d_{average}(C_1, C) = 0.86$
$d_{average}(A,B) = 0.71$
$\implies$ second step: $(A,B)$ forms a new cluster.

$d_{average}(C_1, C_2)$

Note that with $C_1 = (D,F)$ and $C_2 = (A,B)$:

     A    B    C    D    E    F
B 0.71                         
C 2.90 2.19                    
D 2.69 2.06 1.08               
E 3.64 3.04 1.72 1.00          
F 2.92 2.24 0.64 0.50 1.12     
G 4.12 3.81 3.55 2.50 2.06 2.92

$\small d_{average}(C_1, C_2) = \frac{2.69+2.92+2.06+2.24}{4} = 2.4775$

Ward’s linkage

Joins clusters in an attempt to minimize within-cluster variability
Join clusters with the closest centroids, weighting by the sizes of the clusters
Define $C_1,C_2$ as two clusters, with centroids $\bar C_1$ and $\bar C_2$.

\[d_{Ward}(C_1,C_2) = \frac{|C_1||C_2|}{|C_1|+|C_2|} d(\bar C_1, \bar C_2)^2\]

Example: finding $d_{Ward}$

If $C_1 = (D,F)$ and $C_2 = (A,B)$, $d(\bar C_1, \bar C_2)$ is visualized below.

Finding $d(\bar C_1, \bar C_2)$ is straightforward with raw data
- $\bar C_1$ = the $p-$dimensional mean vector of all points in $C_1$
- $\bar C_2$ = the $p-$dimensional mean vector of all points in $C_2$
- Compute $d(\bar C_1, \bar C_2)$ using whatever distance metric you’re using for everything else
Can also be computed from distance matrix recursively using Lance-Williams algorithm

Linkages summarized

Table modified from Everitt, Landau, and Leese Cluster Analysis 4ed page 62:

Method	How defined	Remarks
Single	Minimum distance between inter-cluster pairs	Tends to produce unbalanced and straggly clusters (“chaining”), especially in large data sets.
Complete	Maximum distance between inter-cluster pairs	Tends to find compact clusters with equal diameters (maximum distance between objects within clusters)
Average	Average of all inter-cluster pair distances	Tends to join clusters with small within-cluster variances. Intermediate between single and complete. Relatively robust.
Ward	Weighted squared distance between cluster centroids	Tends to find same size, spherical clusters; sensitive to outliers.

This is not an exhaustive list of possible linkages, but is enough to get you started.

Agglomerative clustering: execution

Two primary functions:
- hclust() (base R)
- agnes() (cluster package) ← will use this one
agnes() stands for AGglomerative NESting
Accepts either a data set, or a dissimilarity object
- If data set, only dissimilarity options are metric='euclidean' or metric = 'manhattan'
- If you want to use something else (Gower, Bray-Curtis, etc.) you have to compute dissimilarity beforehand
Specify linkage with method =, default is method='average'

Execution for simple data set

Let’s implement agnes on our simple data set:
I am not pre-scaling here, but you should in practice if using Euclidean/Manhattan!!
Any sort of row labeling should be stored as rownames not as a variable (sorry, tidy data philosophy) - use column_to_rownames() if you need.

library(cluster)
(simple_numeric <- simple_data %>% dplyr::select(x,y))

    x   y
A 1.0 6.0
B 1.5 5.5
C 3.1 4.0
D 3.5 5.0
E 4.5 5.0
F 3.5 4.5
G 5.0 7.0

cluster_simple_single <- agnes(simple_numeric, metric='euclidean', method='single')
cluster_simple_complete <- agnes(simple_numeric, metric='euclidean', method='complete')
cluster_simple_average <- agnes(simple_numeric, metric='euclidean', method='average')
cluster_simple_ward <- agnes(simple_numeric, metric='euclidean', method='ward')

Dendrograms

Dendrograms, aka “tree diagrams,” are frequently used to explore the results of hierarchical clustering.
The fviz_dend() function, from the factoextra package, produces ggplot-styled version of a dendrogram from a clustering result.
- Caution: takes a long time for data sets in the hundreds, even; consider base plot methods for larger examples.

Producing dendrograms

Producing the dendrograms for our simple example, for single and average linkages (complete and Ward look very similar to average):

library(factoextra)

fviz_dend(cluster_simple_single) + 
  labs(title='Single linkage')

fviz_dend(cluster_simple_average) + 
  labs(title='Average linkage')

Note for single linkage how the $(C,D,F)$ cluster forms before the $(A,B)$ cluster
For the average linkage (and Ward and complete too), $(A,B)$ cluster before $C$ joins $(D,F)$.
Unsurprisingly, $G$ clusters as a last resort.

Agglomerative coefficient

The agglomerative coefficient (AC) measures the amount of clustering structure found for a given linkage.
For each object $i$ in the data set, $d(i)$ represents dissimilarity to the first cluster it is merged with, divided by its dissimilarity to the final cluster it belongs to.
AC is then defined as the average of all $1 - d(i)$.
AC close to 1: strong-well-defined clustering structure
AC close to 0: clusters are less distinct, more heterogeneous data are merged together

cluster_simple_single$ac

[1] 0.5761809

cluster_simple_complete$ac

[1] 0.6963881

cluster_simple_average$ac

[1] 0.6513288

cluster_simple_ward$ac

[1] 0.7426257

The Ward linkage appears to identify the best structure for this tiny data set.

Cutting the tree: visualization

From a hierarchical clustering result, final clusters can be determined by:
- Specifying number of clusters (most common);
- Specifying the height at which to cut the tree
A visual of the resulting clusters is easily found by specifying k= or h= in the fviz_dend() function.
The code below visualizes the results of cutting the single-linkage tree by specifying k=, and in the average-linkage tree by specifying h=. Both result in 5 clusters.

fviz_dend(cluster_simple_single, k = 5) + 
  labs(title='Single linkage')

fviz_dend(cluster_simple_average, h = 2) + 
  labs(title='Average linkage')

Cutting the tree: execution

fviz_dend() shows the results of cutting the tree at a certain place
hcut actually performs the cutting, producing cluster labels you can use e.g. in a biplot
Must specify number of clusters k to produce in the cut.
Notes:
- Cluster labels are arbitrary!! Points in “cluster 1” for one cut are not necessarily in “cluster 1” for a second cut.
- Will often want this treated as a factor rather than numeric in plots; hence use of factor().

(single_clusters <- cutree(cluster_simple_single, k = 5))

[1] 1 2 3 3 4 3 5

(average_clusters <- cutree(cluster_simple_average, k = 5))

[1] 1 1 2 3 4 3 5

# Adding to the data frame
bind_cols(simple_data, 
          sing_clust=factor(single_clusters), 
          ave_clust=factor(average_clusters)
) %>% as_tibble()

# A tibble: 7 × 5
  id        x     y sing_clust ave_clust
  <chr> <dbl> <dbl> <fct>      <fct>    
1 A       1     6   1          1        
2 B       1.5   5.5 2          1        
3 C       3.1   4   3          2        
4 D       3.5   5   3          3        
5 E       4.5   5   4          4        
6 F       3.5   4.5 3          3        
7 G       5     7   5          5

Divisive clustering

Works opposite of agglomerative clustering
All observations start in 1 cluster
In each subsequent step, largest available cluster split into two
Size of cluster $C_k$ determined by its “diameter”: $D_k = \max_{i,j \in C_k}d(i,j)$
Split is done in a way to optimize inter-cluster dissimilarity
Less common than agglomerative, and can be computationally intensive for large data sets.

Divising clustering with `diana()`

The diana() (DIvisive ANAlysis) function in the cluster package performs divisive clustering.
Like agnes(), can feed it a dissimilarity matrix instead of data frame if you want to use something like Bray-Curtis to measure dissimilarity

simple_divisive <- diana(simple_data[,c('x','y')], metric='euclidean')
fviz_dend(simple_divisive)

Interestingly, points $E$ and $G$ are grouped in a cluster
If this tree is cut to form 4 clusters, $(E,G)$ will be together while $C$ is on its own. Seems not right.

How many clusters?

Hierarchical clustering defines methods for forming clusters, given $k$
What should $k$ be?
Two prominent ways of choosing $k$:
- Within-cluster sum-of-squares (“wss”); only appropriate for numeric data where a cluster “centroid” can be defined
- Silhouette widths; applicable for any distance metric (Jaccard, Gower, etc.)

Simulated data example

Consider the following simulated 2-dimensional data:

Agglomerative clustering, Euclidean metric, Ward’s linkage
Cutting the tree at $k=2,3,4,5$, and converting cluster labels to factor variables:

df1_agnes <- agnes(df1, method='ward')
clusters <- cutree(df1_agnes, k = 2:5) %>% 
  data.frame() %>% 
  setNames(c('clust2','clust3','clust4','clust5')) %>% 
  mutate(across(everything(), as.factor))

df1_with_clusters <- bind_cols(df1,clusters)

Plotting the data with defined clusters

How many clusters would you pick?

Within-cluster sum-of-squares (WSS)

This method computes the $p$-dimensional centroid (mean vector) of each cluster
Determines sum-of-squared Euclidean distances between centroid and data points within cluster

\[WSS = \sum_{j=1}^k\sum_{i=1}^{n_j}(||X_{ij}- \bar X_j||_{2})^2\]

WSS always decreases as $k$ increases
Plotting as a function of $k$, choose $k$ that yields an “elbow” in the plot, beyond which there are diminishing returns (like a scree plot)

`fviz_nbclust()`

The function fviz_nbclust() is a versatile function for plotting metrics like WSS as a function of $k$.
Basic utilization for our example:

fviz_nbclust(df1,
             FUNcluster = hcut,
             k.max = 10,
             method='wss',
             hc_method='ward',
             hc_func = 'agnes')

Steep drop off until $k=3$, beyond which we don’t gain much improvement
$k=3$ clusters seems like the best choice.
Not always this clear-cut!

Silhouette

While WSS is only really appropriate when Euclidean “distance from centroid” is appropriate, the silhouette can be used when clustering on any distance metric is performed.

Let $C_j$ represent cluster $k$, with $|C_k|$ points in the cluster.
Define, for data point $i \in C_k$:

\[a(i) = \frac{1}{|C_i|-1} \sum_{j\in C_k, i\ne j} d(i,j)\]

$a_i$ measures average dissimilarity of point $i$ to all other points within same cluster $C_k$.

Now consider a different cluster $C_l$, and define:

\[b(i) = \min_{j\ne i}\frac{1}{|C_l|} \sum_{j\in C_l}d(i,j)\]

$b_i$ measures average dissimilarity of point $i$ to all other points within closest cluster $C_l$.

Silhouette width

We define the silhouette width $s(i)$ of point $i$ as:

\[s(i) = \frac{b(i)-a(i)}{\max_i(a(i),b(i))}\]

$-1 \le s(i) \le 1$

$s(i)$ near 1: point $i$ is much more similar to points in its own cluster than to points in the nearest cluster (i.e., it’s probably in the right cluster).

$s(i)$ near -1: point $i$ is much more similar to points in the nearest cluster than to points in its own cluster (i.e., it’s probably in the wrong cluster).

$s(i)$ near 0: point $i$ is equally similar to its own members and nearest members.
- If point $i$ is in its own cluster, we define $s(i) = 0$

Average silhouette width

Averaging:

\[\bar s = \frac{1}{n} \sum_{i=1}^n s(i)\]

gives us a measure of the overall cluster quality, with larger = better.

Computing silhouette widths

To compute the silhouette given a cluster definition, we can use the silhouette() function in the cluster package.
- Arguments:
  - x, a vector of cluster assignments as integers;
  - dist, the distance matrix used.

(k3_sils <- silhouette(x = cutree(df1_agnes, k = 3), 
           dist = dist(df1))
 ) %>% head

     cluster neighbor sil_width
[1,]       1        2 0.7960441
[2,]       1        2 0.7162754
[3,]       1        2 0.7894861
[4,]       1        2 0.6817848
[5,]       1        2 0.6571660
[6,]       1        2 0.7463931

(k5_sils <- silhouette(x = cutree(df1_agnes, k = 5), 
           dist = dist(df1))
 ) %>% head

     cluster neighbor   sil_width
[1,]       1        2 -0.08724185
[2,]       2        1  0.56753787
[3,]       1        2  0.45855553
[4,]       1        2  0.47252905
[5,]       2        1  0.46549746
[6,]       1        2  0.47466167

Silhouette plots

Silhouette plots plot $s(i)$ for every data point, arranged in decreasing order within cluster.
For this we use the fviz_silhouette() function:

fviz_silhouette(k3_sils)

  cluster size ave.sil.width
1       1  100          0.74
2       2   99          0.71
3       3  101          0.73

fviz_silhouette(k5_sils)

  cluster size ave.sil.width
1       1   69          0.31
2       2   31          0.44
3       3   99          0.65
4       4   18          0.47
5       5   83          0.30

The 3-cluster solution has much better $\bar s$ (0.73 vs 0.44), and only a few tiny negative blue silhouette widths

Can you identify the points with negative widths?

Plotting $\bar s$ vs $k$

Plotting the average silhouette widths versus $k$, our goal is to identify $k$ with the largest $\bar s$.
Easily implemented with fviz_nbclust():

fviz_nbclust(df1,
             FUNcluster = hcut,
             k.max = 10,
             method='silhouette',
             hc_method='ward',
             hc_func = 'agnes')

As expected, $k=3$ appears to be the best number of clusters.

Real data

Let’s analyze some real data!

library(tidyverse)
drugs <- read.csv('Data/IllicitDrug.csv') %>% 
  column_to_rownames('State')
head(drugs)

           DrugUse BingeDrink Poverty HSdrop Income
Alabama        3.3       15.6    14.5   12.6  22946
Alaska         8.5       19.8     9.4   10.9  28523
Arizona        5.4       17.4    16.6   14.4  25307
Arkansas       2.8       16.8    14.8   11.4  22114
California     6.2       16.7    15.4   14.2  29819
Colorado       6.5       19.8     9.2    9.8  31678

DrugUse = estimate of the % of the population in the state that use illicit drugs.
BingeDrink = estimate of the % of the population in the state that binge drink.
Poverty = estimate of the % of the population in the state living in poverty.
HSdrop = estimate of the % of the population that has dropped out of high school.
Income = estimate of the per capita income ($).

Distance matrix

For this example, will demonstrate clustering using the distance matrix as input to agnes() instead of the data matrix.
Since we have numeric data, will use Euclidean distance.
Clearly need to scale (or Income will drive everything):

drugs_scaled <- scale(drugs)
drug_dist  <- dist(drugs_scaled, method='euclidean')

Agglomerative clustering

Implementation of agglomerative clustering, with various linkages:

(drug_single <- agnes(drug_dist, method='single'))$ac

[1] 0.6173047

(drug_complete <- agnes(drug_dist, method='complete'))$ac

[1] 0.8164902

(drug_average <- agnes(drug_dist, method='average'))$ac

[1] 0.7364308

(drug_ward <- agnes(drug_dist, method='ward'))$ac

[1] 0.9015376

Ward once again has the best agglomerative coefficient, with complete linkage in second.

Dendrograms

The “scraggly-ness” of single, and the more balanced cluster sizes of Complete and Ward, are evident here:

p_single <- fviz_dend(drug_single) + ggtitle('Single linkage')
p_complete <- fviz_dend(drug_complete)+ ggtitle('Complete linkage')
p_average <- fviz_dend(drug_average)+ ggtitle('Average linkage')
p_ward <- fviz_dend(drug_ward)+ ggtitle('Ward linkage')

library(patchwork)
(p_single + p_complete)/(p_average + p_ward)

Choosing number of clusters

Plotting the WSS and mean silhouette width vs $k$, using the Ward linkage clustering results:

fviz_nbclust(drugs, 
             FUNcluster = hcut,
             method='wss',
             diss = drug_dist,
             hc_method='ward',
             hc_func = 'agnes') + 
  labs(title = 'Plot of WSS vs k')

fviz_nbclust(drugs, 
             FUNcluster = hcut,
             method='silhouette',
             diss = drug_dist,
             hc_method='ward',
             hc_func = 'agnes') + 
  labs(title = 'Plot of mean silhouette width vs k')

The WSS plot suggests $k=3$, while the silhouette method suggests $k=2$, with $k=3$ the second-best choice. I also re-created these using hc_method='complete'; the results looked nearly identical.
Will further explore both $k=2$ and $k=3$.

Visualizing resulting clusters

# The adding of plots made possible by patchwork
fviz_dend(drug_ward,k=2)+ ggtitle('Ward linkage, 2 clusters') +
fviz_dend(drug_ward,k=3)+ ggtitle('Ward linkage, 3 clusters')

Cutting the tree

Proceeding to cut the trees:

(ward_clusters <- cutree(drug_ward, k = 2:3) 
    %>% data.frame()
    %>% setNames(c('k2','k3'))
) %>% head

Cluster characterizations

It remains to characterize the clusters, using one of the dimension-reduction techniques we’ve learned about.
Let’s apply PCA:

drug_pca <- prcomp(drugs, center=TRUE, scale. = TRUE)

PCA biplots with cluster labels

Using fviz_pca() to create the biplot, color-coding by cluster label:

k2_biplot <- fviz_pca(drug_pca, 
         habillage = factor(ward_clusters$k2),
         repel = TRUE) + 
      ggtitle('2-cluster solution') + 
      guides(color='none',shape='none')
k3_biplot <- fviz_pca(drug_pca, 
         habillage = factor(ward_clusters$k3),
         repel = TRUE) + 
        ggtitle('3-cluster solution') + 
        guides(color='none',shape='none')
k2_biplot + k3_biplot

Both solutions cluster the high-poverty, high-HS-dropout states
- High-poverty: southern states, e.g. Missippi, Alabama, Tennessee, etc.
- High HS dropout rates: Western states, e.g. California, Oregon, Arizona; plus DC and Florida
The 3-cluster solution makes a pretty informative separation between the “binge-drinking” states (hello, Wisconsin and Minnesota), and the “high-income/high drug use” states, many in the north Atlantic region of the country
Would probably proceed with 3 clusters, in spite of its slightly lower average silhouette width

3 cluster solutions, other linkages

single_clusters <- cutree(drug_single, k=3)
complete_clusters <- cutree(drug_complete, k=3)
average_clusters <- cutree(drug_average, k=3)

fviz_pca(drug_pca, 
         habillage = factor(single_clusters),
         repel = TRUE) + 
      ggtitle('3 clusters with single linkage') + 
      guides(color='none',shape='none') +
  theme(plot.title = element_text(hjust = 0.5, size = 20))

fviz_pca(drug_pca, 
         habillage = factor(complete_clusters),
         repel = TRUE) + 
      ggtitle('3 clusters with complete linkage') + 
      guides(color='none',shape='none')+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

fviz_pca(drug_pca, 
         habillage = factor(average_clusters),
         repel = TRUE) + 
      ggtitle('3 clusters with average linkage') + 
      guides(color='none',shape='none')+
  theme(plot.title = element_text(hjust = 0.5, size = 20))

Single and average linkage methods produce very lopsided clusters, with 49 states in one cluster, and solo clusters of 1 state each.
The 3-cluster complete solution looks similar to the Ward solution, but group Maine and Vermont in the “green” cluster, with other more geographically similar states (they were in the “blue”, Midwestern-heavy cluster with Ward linkage)

Where do Maine and Vermont belong?

complete_k3_sil <- silhouette(complete_clusters, drug_dist)
rownames(complete_k3_sil) <- rownames(drugs)
fviz_silhouette(complete_k3_sil, print.summary = FALSE) + 
  guides(fill=FALSE,color=FALSE) + 
  theme_classic() +
  ggtitle('Complete linkage,\n average sillhouette width = 0.24') +
  theme(axis.text.x = element_text(angle = 90,size = 8))

ward_k3_sil <- silhouette(ward_clusters$k3, drug_dist)
rownames(ward_k3_sil) <- rownames(drugs)
fviz_silhouette(ward_k3_sil, print.summary=FALSE) + 
  guides(fill=FALSE,color=FALSE) + 
  theme_classic() +
  ggtitle('Ward linkage,\n average sillhouette width = 0.25') + 
  theme(axis.text.x = element_text(angle = 90,size = 8))

Although geographically more similar to states in the “green” cluster, it does appear they are more similar to Midwestern binge-drinking states, with respect to the variables used.

Boxplots for inter-cluster differences

Boxplots can be a more intuitive way for visualizing inter-cluster differences

# Add Ward k3 clusters to data set
# pivot to prepare for faceting
(drug_update <- drugs 
      %>% rownames_to_column('State')
      %>% mutate(ward_k3 = factor(ward_clusters$k3))  
      %>% pivot_longer(cols=DrugUse:Income, 
                       names_to = 'variable', 
                       values_to = 'value')
) %>% head()

# A tibble: 6 × 4
  State   ward_k3 variable     value
  <chr>   <fct>   <chr>        <dbl>
1 Alabama 1       DrugUse        3.3
2 Alabama 1       BingeDrink    15.6
3 Alabama 1       Poverty       14.5
4 Alabama 1       HSdrop        12.6
5 Alabama 1       Income     22946  
6 Alaska  2       DrugUse        8.5

#Visualize:
ggplot(data = drug_update) + 
  geom_boxplot(aes(x = ward_k3, y = value, fill = ward_k3)) + 
  labs(fill='Cluster', x = 'Cluster') + 
  facet_wrap(~variable, scales = 'free_y') + 
  theme_classic()

Not too bad to visualize all variables when only 5; in practice may need to filter down to “most important”

4.1 - hierarchical clustering

Motivation

Motivation

Hierarchical clustering

Agglomerative clustering

Simple example

First step

Linkage

Single linkage

\(d_{single}(C_1,E)\)

\(d_{single}(C_1,C)\)

\(d_{single}(A,B)\)

Second step result: single linkage

Complete linkage

\(d_{complete}(C_1,E)\)

\(d_{complete}(C_1,C)\)

\(d_{complete}(A,B)\)

Second step result: complete linkage

Average linkage

\(d_{average}(C_1,E)\)

\(d_{average}(C_1,C)\)

\(d_{average}(A,B)\)

Second step result: average linkage

\(d_{average}(C_1, C_2)\)

Ward’s linkage

Example: finding \(d_{Ward}\)

Linkages summarized

Agglomerative clustering: execution

Execution for simple data set

Dendrograms

Producing dendrograms

Agglomerative coefficient

Cutting the tree: visualization

Cutting the tree: execution

Divisive clustering

Divising clustering with diana()

How many clusters?

Simulated data example

Plotting the data with defined clusters

Within-cluster sum-of-squares (WSS)

fviz_nbclust()

Silhouette

Silhouette width

Average silhouette width

Computing silhouette widths

Silhouette plots

Can you identify the points with negative widths?

Plotting \(\bar s\) vs \(k\)

Real data

Distance matrix

Agglomerative clustering

Dendrograms

Choosing number of clusters

Visualizing resulting clusters

Cutting the tree

Cluster characterizations

PCA biplots with cluster labels

3 cluster solutions, other linkages

Where do Maine and Vermont belong?

Boxplots for inter-cluster differences

Divising clustering with `diana()`

`fviz_nbclust()`