Week 5 Coding Practice: Cluster Analysis in R

K-Means Analysis in R

Data Preparation

To perform cluster analysis in R, prepare data as follows:

  1. Rows are observations (individuals) and columns are variables.
  2. Any missing value in the data must be removed or estimated.
  3. The data must be standardized (i.e., scaled) to make variables comparable.
    • Standardization consists of transforming the variables such that they have mean zero and standard deviation one.

The data set: US Arrests

Overview:
  • Contains stats in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US States in 1973.
  • Includes the percent of the population living in urban areas.

Import built-in R Datasets USArrests into a DF

df <- USArrests

Remove any missing values present in the data

df <- na.omit(df)

Scale/Standardize the data using R function scale()

df <- scale(df)
head(df)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Clustering Distance Measures

Classifying observations into groups requires computing the distance or dissimilarity between each pair of observtation. The result is known as the dissimilarity or distance matrix.

The choice of distance measures is critical in clustering; it defines how the similarity of 2 elements (x,y) is calculated, influencing the shape of the clusters. For most common clustering software, the default distance measure is the Euclidean distance. However, other measures may be preferred depending on the research questions or type of data.

The classical methods for distance measures (Where x and y are 2 vectors of length n):

  1. Euclidean Distance

\[d_{euc}(x,y) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2} \]

  1. Manhattan Distance

\[d_{man}(x,y) = \sum^n_{i=1}|(x_i - y_i)| \]

Other dissimilarity measures such as correlation-based distances:

  1. Pearson Correlation Distance

\[ d_{cor}(x, y) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(y_i - \bar y)^2}}\]

  1. Spearman Correlation Distance:
  • Computes the correlation between the rank of x and the rank of y variables.

\[ d_{spear}(x, y) = 1 - \frac{\sum^n_{i=1}(x^\prime_i-\bar x^\prime)(y^\prime_i - \bar y^\prime)}{\sqrt{\sum^n_{i=1}(x^\prime_i-\bar x^\prime)^2\sum^n_{i=1}(y^\prime_i - \bar y^\prime)^2}} \] * Where \(x^\prime_i = rank(x_i)\) and \(y^\prime_i = rank(y_i)\)

  1. Kendall Correlation Distance
  • Measures the correspondence between the ranking of x and y variables.
  • The total number of possibles pairings of x and y observations is n(n − 1)/2, where n is the size of x and y.
  • Begin by ordering the paris by the x values.
    • If x and y are correlated, then they would have the same relative rank orders.
  • For each yi, count the number of yj > yi (concordant pairs (c)) and the number of yj < yi (discordant pairs (d)).

\[d_{kend}(x,y) = 1 - \frac{n_c - n_d}{\frac{1}{2}n(n - 1)} \]

Compute and Visualize the distance matrix in R

Using the factoextra R package:

  • get_dist(): for computing a distance matrix between the rows of a data matrix. * Default distance computed is Euclidean but also supports other computations.
  • fviz_dist(): For visualizing a distance matrix
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

K-Means Clustering

  • The most commonly used unsupervised ML algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst.
  • Classifies objects into multiple groups such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity) and objects in different clusters are as dissimilar as possible.
  • Each cluster is represented by its center (i.e., centroid), which corresponds to the mean of points assigned to the cluster.
  • Basic idea: Define clusters so that the total intra-cluster variation (aka the total within-cluster variation) is minimized.

Hartigan-Wong algorithm

  • The standard K-means algorithm.
  • Defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid. \[ W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2\]
  • Where:
    • \(x_i\) is a data point belonging to the cluster \(C_k\)
    • \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)
  • Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers (\(\mu_k\)) is minimized.
  • The total Within-cluster variation:
    • Measures the compactness (i.e goodness) of the clustering.
    • Smaller = better \[tot.withiness = \sum^k_{k=1}W(C_k) = \sum^k_{k=1}\sum_{x_i \in C_k}(x_i - \mu_k)^2\]

K-Means Algorithm

Steps:

  1. Indicate the number of clusters (k).
    • Algorithm will randomly select k objects form the data set to serve as initial centers for the clusters, aka centroids or cluster means.
  2. Cluster Assignment
    • Each of the remaining objects is assigned to its closest centroid where closest is defined using Euclidean distances between the object and the cluster mean.
  3. Centroid Update
    • Algorithm computes the new mean value of each cluster, recalculating the centers.
  4. Recheck each observation and Reassign observations to updated cluster means.
  5. Iteratively minimize the total within sum of square
    • Repeat iteratively until cluster assignments stop changing or until the max number of iterations is reached (default in R is 10).

Computing K-means Clustering in R

  • Use the kmeans() function.
  • nstart option attempts multiple initial configurations and reports on the best one.
    • Ex: nstart=25 will generate 25 initial configurations.
    • This approach is often recommended
k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)
## List of 9
##  $ cluster     : Named int [1:50] 2 2 2 1 2 2 1 1 2 2 ...
##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:2, 1:4] -0.67 1.005 -0.676 1.014 -0.132 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ totss       : num 196
##  $ withinss    : num [1:2] 56.1 46.7
##  $ tot.withinss: num 103
##  $ betweenss   : num 93.1
##  $ size        : int [1:2] 30 20
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Output of kmeans()

  • cluster: A vector of integers from 1:k indicating the cluster to which each point is allocated.
  • centers: A matrix of cluster centers
  • totss: The total sum of squares
  • withinss: Vector of within-cluster sum of squares, one component per cluster. * tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss)
  • betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss
  • size: The number of points in each cluster

If we print R example above, we can see the groupings resulted in 2 clusters of sizes 20 and 30. The cluster centers (means) for the 2 groups across the 4 variables (Murder, Assault, UrbanPop, Rape). We also get the cluster assignment for each observation (i.e. Alabama was assigned to cluster 2, Arkansas to cluster 1, etc. )

Print kmeans

k2
## K-means clustering with 2 clusters of sizes 30, 20
## 
## Cluster means:
##      Murder    Assault   UrbanPop       Rape
## 1 -0.669956 -0.6758849 -0.1317235 -0.5646433
## 2  1.004934  1.0138274  0.1975853  0.8469650
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              1              1              2              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              1              1              2              1              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              1              2              1              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              1              2              1              2              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              2              1              1 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              2              1              1 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              1              1              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              2              2              1              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              1              1              1              1 
## 
## Within cluster sum of squares by cluster:
## [1] 56.11445 46.74796
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualize Results

fviz_cluster(k2, data = df)

Note: If there are more than 2 dimensions (variables), fviz_cluster will perform PCA and plot the data points according to the 1st 2 principle components that explain the majority of the variance.

Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.

df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster,
         state = row.names(USArrests)) %>%
  ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
  geom_text()

Because k must be set before we start the algorithm, it’s often advantageous to use several different values of k and examine the differences in results. We can execute the same process for 3,4, and 5 clusters.

k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5")

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, p3, p4, nrow = 2)

Although the above visuals tells us where dilineations occur (or do not occur) it does not tell us the optimal number of clusters.

Determining Optimal Clusters

Extracting Results

Most of the approaches suggested 4 was the optimal number of clusters.

K-means clustering with k=4

# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 11.952463 16.212213 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualize the results with fviz_cluster()

fviz_cluster(final, data=df)

Extract the clusters and add to our initial data to do descriptive statistics at the cluster level

USArrests %>%
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
## # A tibble: 4 × 5
##   Cluster Murder Assault UrbanPop  Rape
##     <int>  <dbl>   <dbl>    <dbl> <dbl>
## 1       1  13.9    244.      53.8  21.4
## 2       2   3.6     78.5     52.1  12.2
## 3       3   5.66   139.      73.9  18.8
## 4       4  10.8    257.      76    33.2

Additional comments

  • K-means Clustering is a very fast, and simple algorithm.
  • It can efficiently deal with very large data sets.
  • Some weaknesses:
    • Requires pre-specified number of clusters
      • Hierarchical clustering is an alternative approach which does not require this commitment.
    • K-means is sensitive to outliers and different results can occur if you change the ordering of your data.
      • Partitioning Around Medoids (PAM) clustering approach is less sensitive to outliers and proides a robust alternative to Kmeans.

Hierarchical Analysis in R

  • An alternative to K-means clustering for identifying groups in a dataset.
  • Does not require us to pre-specify the number of clusters to be generated.
  • Added advantage over K-means in that it results in an attractive tree-based representation of observations called a dendrogram.
# Required packages
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree

Hierarchical Clustering Algorithms

  • 2 main types:
    1. Agglomerative clustering (aka AGNES, Agglomerative Nesting):
      • Works in a bottom-up manner.
      • Each object is initially considered a single-element cluster (leaf) and at each step in the alg, the 2 clusters that are the most similar are combined into a bigger cluster (nodes).
      • Iteration until all points are members of just one single big cluster (root).
      • Result is a tree that can be plotted as a dendrogram.
      • Good at identifying small clusters.
    2. Divisive Hierarchical Clustering (aka DIANA, Divise Analysis):
      • Works in a top-down manner.
      • Algorithm works in the inverse order of AGNES.
      • Begins with the root, in which all objects are grouped in a single cluster. At each step of iteration, the most heterogeneous cluster is divied into 2.
      • Process is iterated until all objects are in their own cluster.
      • Good at identifying large clusters.
  • How do we Measure the dissimilarity between 2 clusters of observations?
    • Linkage Methods: a variety of different cluster agglomeration methods.
    • Most common linkage methods:
      1. Maximum or Complete Linkage Clustering:
        • Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters.
        • Tends to produce more compact clusters.
      2. Minimum or Single Linkage Clustering:
        • Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion.
        • Tends to produce long, “loose” clusters.
      3. Mean or Average Linkage Clustering:
        • Minimizes the total within-cluster variance.
        • At each step the pair of clusters with minimum between-cluster distance are merged.

Data Preparation

To perform cluster analysis in R, prepare data as follows:

  1. Rows are observations (individuals) and columns are variables.
  2. Any missing value in the data must be removed or estimated.
  3. The data must be standardized (i.e., scaled) to make variables comparable.
    • Standardization consists of transforming the variables such that they have mean zero and standard deviation one.

Dataset

Same as in Part 1: K Means Analysis Coding practice above.

df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Hierarchical Clustering with R

Commonly used functions in R for computing hierarchical clustering:

  • hclust (in stats package) and agnes *in cluster package for agglomearitve hierachical clustering (JC)
  • diana (in cluster package) for divisive HC

Agglomerative Hierarchical Clustering

  1. Compute the dissimilarity values with dist().
  2. Feed these values into hclust() and specify the agglomeration method to be used (“complete”, “average”, “single”, “ward.D”).
  3. Plot the dendrogram
# Dissimilarity matrix
d <- dist(df, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )

# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)

Alternatively, use the agnes() function. Behaves similarly but with agnes() you can also get the agglomerative coefficient, which measures the amount of clustering structure found. Values closer to 1 suggest strong clustering structure.

# Compute with agnes
hc2 <- agnes(df, method = "complete")

# Agglomerative coefficient
hc2$ac
## [1] 0.8531583

This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures. Here we see that Ward’s method identifies the strongest clustering structure of the four methods assessed.

# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(df, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.7379371 0.6276128 0.8531583 0.9346210

Similar to before we can visualize the dendrogram:

hc3 <- agnes(df, method = "ward")
pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes") 

Divisive Hierarchical Clustering

Use R function diana() in the cluster package. diana() works similar to agnes(); however, there is no method to provide.

# compute divisive hierarchical clustering
hc4 <- diana(df)

# Divise coefficient; amount of clustering structure found
hc4$dc
## [1] 0.8514345
## [1] 0.8514345

# plot dendrogram
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")

Working with Dendrograms

In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.

The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are.

Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.

The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k in k-means clustering. In order to identify sub-groups (i.e. clusters), we can cut the dendrogram with cutree():

# Ward's method
hc5 <- hclust(d, method = "ward.D2" )

# Cut tree into 4 groups
sub_grp <- cutree(hc5, k = 4)

# Number of members in each cluster
table(sub_grp)
## sub_grp
##  1  2  3  4 
##  7 12 19 12

We can also use the cutree() output to add the the cluster each observation belongs to to our original data.

USArrests %>%
  mutate(cluster = sub_grp) %>%
  head
##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       2
## Arizona       8.1     294       80 31.0       2
## Arkansas      8.8     190       50 19.5       3
## California    9.0     276       91 40.6       2
## Colorado      7.9     204       78 38.7       2

It’s also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to specify the border colors for the rectangles:

plot(hc5, cex = 0.6)
rect.hclust(hc5, k = 4, border = 2:5)

We can also use the fviz_cluster() function from the factoextra package to visualize the result in a scatter plot.

fviz_cluster(list(data = df, cluster = sub_grp))

To use cutree() with agnes() and diana() you can perform the following:

# Cut agnes() tree into 4 groups
hc_a <- agnes(df, method = "ward")
cutree(as.hclust(hc_a), k = 4)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3
# Cut diana() tree into 4 groups
hc_d <- diana(df)
cutree(as.hclust(hc_d), k = 4)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3

We can also compare two dendrograms. Here we compare hierarchical clustering with complete linkage versus Ward’s method. The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines.

# Compute distance matrix
res.dist <- dist(df, method = "euclidean")

# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "complete")
hc2 <- hclust(res.dist, method = "ward.D2")

# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)

tanglegram(dend1, dend2)

The output displays “unique” nodes, with a combination of labels/items not present in the other tree, highlighted with dashed lines. The quality of the alignment of the two trees can be measured using the function entanglement().

Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment. The output of tanglegram can be customized using many other options as follow:

dend_list <- dendlist(dend1, dend2)

tanglegram(dend1, dend2,
  highlight_distinct_edges = FALSE, # Turn-off dashed lines
  common_subtrees_color_lines = FALSE, # Turn-off line colors
  common_subtrees_color_branches = TRUE, # Color common branches 
  main = paste("entanglement =", round(entanglement(dend_list), 2))
  )

Determining Optimal Clusters

Elbow Method

To perform Elbow method in hierarchical clustering, we just need to change the second arg in fviz_nbclust() to FUN=hcut.

fviz_nbclust(df, FUN = hcut, method = "wss")

Average Silhoutte Method

fviz_nbclust(df, FUN = hcut, method = "silhouette")

#### Gap Statistic Method

gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Additional Comments

  • Clustering can be a very useful tool for data analysis in the unsupervised setting. However, there are a number of issues that arise in performing clustering. * In the case of hierarchical clustering, we need to be concerned about:
    • What dissimilarity measure should be used?
    • What type of linkage should be used?
    • Where should we cut the dendrogram in order to obtain clusters?
  • Each of these decisions can have a strong impact on the results obtained.
  • In practice, try several different choices and look for the one with the most useful or interpretable solution.
  • With these methods, there is no single right answer - any solution that exposes some interesting aspects of the data should be considered.