Week 5 Coding Practice: Cluster Analysis in R

K-Means Analysis in R

Data Preparation

To perform cluster analysis in R, prepare data as follows:

Rows are observations (individuals) and columns are variables.
Any missing value in the data must be removed or estimated.
The data must be standardized (i.e., scaled) to make variables comparable.
- Standardization consists of transforming the variables such that they have mean zero and standard deviation one.

The data set: US Arrests

Overview:

Contains stats in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US States in 1973.
Includes the percent of the population living in urban areas.

Import built-in R Datasets USArrests into a DF

df <- USArrests

Remove any missing values present in the data

df <- na.omit(df)

Scale/Standardize the data using R function scale()

df <- scale(df)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Clustering Distance Measures

Classifying observations into groups requires computing the distance or dissimilarity between each pair of observtation. The result is known as the dissimilarity or distance matrix.

The choice of distance measures is critical in clustering; it defines how the similarity of 2 elements (x,y) is calculated, influencing the shape of the clusters. For most common clustering software, the default distance measure is the Euclidean distance. However, other measures may be preferred depending on the research questions or type of data.

The classical methods for distance measures (Where x and y are 2 vectors of length n):

Euclidean Distance

\[d_{euc}(x,y) = \sqrt{\sum^n_{i=1}(x_i - y_i)^2} \]

Manhattan Distance

\[d_{man}(x,y) = \sum^n_{i=1}|(x_i - y_i)| \]

Other dissimilarity measures such as correlation-based distances:

Pearson Correlation Distance

\[ d_{cor}(x, y) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(y_i - \bar y)^2}}\]

Spearman Correlation Distance:

Computes the correlation between the rank of x and the rank of y variables.

\[ d_{spear}(x, y) = 1 - \frac{\sum^n_{i=1}(x^\prime_i-\bar x^\prime)(y^\prime_i - \bar y^\prime)}{\sqrt{\sum^n_{i=1}(x^\prime_i-\bar x^\prime)^2\sum^n_{i=1}(y^\prime_i - \bar y^\prime)^2}} \] * Where \(x^\prime_i = rank(x_i)\) and \(y^\prime_i = rank(y_i)\)

Kendall Correlation Distance

Measures the correspondence between the ranking of x and y variables.
The total number of possibles pairings of x and y observations is n(n − 1)/2, where n is the size of x and y.
Begin by ordering the paris by the x values.
- If x and y are correlated, then they would have the same relative rank orders.
For each y_i, count the number of y_j > y_i (concordant pairs (c)) and the number of y_j < y_i (discordant pairs (d)).

\[d_{kend}(x,y) = 1 - \frac{n_c - n_d}{\frac{1}{2}n(n - 1)} \]

Compute and Visualize the distance matrix in R

Using the factoextra R package:

get_dist(): for computing a distance matrix between the rows of a data matrix. * Default distance computed is Euclidean but also supports other computations.
fviz_dist(): For visualizing a distance matrix

distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

K-Means Clustering

The most commonly used unsupervised ML algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst.
Classifies objects into multiple groups such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity) and objects in different clusters are as dissimilar as possible.
Each cluster is represented by its center (i.e., centroid), which corresponds to the mean of points assigned to the cluster.
Basic idea: Define clusters so that the total intra-cluster variation (aka the total within-cluster variation) is minimized.

Hartigan-Wong algorithm

The standard K-means algorithm.
Defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid. \[ W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2\]
Where:
- \(x_i\) is a data point belonging to the cluster \(C_k\)
- \(\mu_k\) is the mean value of the points assigned to the cluster \(C_k\)
Each observation (\(x_i\)) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers (\(\mu_k\)) is minimized.
The total Within-cluster variation:
- Measures the compactness (i.e goodness) of the clustering.
- Smaller = better \[tot.withiness = \sum^k_{k=1}W(C_k) = \sum^k_{k=1}\sum_{x_i \in C_k}(x_i - \mu_k)^2\]

K-Means Algorithm

Steps:

Indicate the number of clusters (k).
- Algorithm will randomly select k objects form the data set to serve as initial centers for the clusters, aka centroids or cluster means.
Cluster Assignment
- Each of the remaining objects is assigned to its closest centroid where closest is defined using Euclidean distances between the object and the cluster mean.
Centroid Update
- Algorithm computes the new mean value of each cluster, recalculating the centers.
Recheck each observation and Reassign observations to updated cluster means.
Iteratively minimize the total within sum of square
- Repeat iteratively until cluster assignments stop changing or until the max number of iterations is reached (default in R is 10).

Computing K-means Clustering in R

Use the kmeans() function.
nstart option attempts multiple initial configurations and reports on the best one.
- Ex: nstart=25 will generate 25 initial configurations.
- This approach is often recommended

k2 <- kmeans(df, centers = 2, nstart = 25)
str(k2)

## List of 9
##  $ cluster     : Named int [1:50] 2 2 2 1 2 2 1 1 2 2 ...
##   ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ centers     : num [1:2, 1:4] -0.67 1.005 -0.676 1.014 -0.132 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ totss       : num 196
##  $ withinss    : num [1:2] 56.1 46.7
##  $ tot.withinss: num 103
##  $ betweenss   : num 93.1
##  $ size        : int [1:2] 30 20
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Output of kmeans()

cluster: A vector of integers from 1:k indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers
totss: The total sum of squares
withinss: Vector of within-cluster sum of squares, one component per cluster. * tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss)
betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss
size: The number of points in each cluster

If we print R example above, we can see the groupings resulted in 2 clusters of sizes 20 and 30. The cluster centers (means) for the 2 groups across the 4 variables (Murder, Assault, UrbanPop, Rape). We also get the cluster assignment for each observation (i.e. Alabama was assigned to cluster 2, Arkansas to cluster 1, etc. )

Print kmeans

k2

## K-means clustering with 2 clusters of sizes 30, 20
## 
## Cluster means:
##      Murder    Assault   UrbanPop       Rape
## 1 -0.669956 -0.6758849 -0.1317235 -0.5646433
## 2  1.004934  1.0138274  0.1975853  0.8469650
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              1              1              2              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              1              1              2              1              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              1              2              1              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              1              2              1              2              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              2              1              1 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              2              1              1 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              1              1              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              2              2              1              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              1              1              1              1 
## 
## Within cluster sum of squares by cluster:
## [1] 56.11445 46.74796
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualize Results

fviz_cluster(k2, data = df)

Note: If there are more than 2 dimensions (variables), fviz_cluster will perform PCA and plot the data points according to the 1st 2 principle components that explain the majority of the variance.

Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.

df %>%
  as_tibble() %>%
  mutate(cluster = k2$cluster,
         state = row.names(USArrests)) %>%
  ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
  geom_text()

Because k must be set before we start the algorithm, it’s often advantageous to use several different values of k and examine the differences in results. We can execute the same process for 3,4, and 5 clusters.

k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = df) + ggtitle("k = 5")

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(p1, p2, p3, p4, nrow = 2)

Although the above visuals tells us where dilineations occur (or do not occur) it does not tell us the optimal number of clusters.

Determining Optimal Clusters

3 popular methods for determining the optimal number of clusters:

1. Elbow Method

Remember the basic idea of cluster partitioning models is to minimize the total intra-cluster variation.
Algorithm:
1. Compute clustering alg (e.g. k-means clustering) for different values of k. Ex: 1 to 10 clusters.
2. For each k, calculat the total within-cluster sum of squares (wss)
3. Plot the curve of wss according to the number of clusters, k.
4. The location of a bend (knee or elbow) in the plot is generally considered as an indicator of the appropriate number of clusters.

In R (the long way):

set.seed(123)

# function to compute total within-cluster sum of square 
wss <- function(k) {
  kmeans(df, k, nstart = 10 )$tot.withinss
}

# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15

# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)

plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

In R (using fviz_nbclust())

set.seed(123)

fviz_nbclust(df, kmeans, method = "wss")

2. Average Silhouette Method

Measures the quality of clustering, determining how well each object lies within its cluster.
A high average silhouette width inidcates good clustering.
Computes the average silhouette of observations for different values of k.
The optimal number of clusters k is the one that maximizes the avg silhouette over a range of possible values for k.
Use the silhhouette function in the cluster package to compute the avg silhouette width.
Following code computes this approach for 1-15 clusters. Results show that 2 clusters maximize the avg silhouette values with 4 clusters coming in as second optimal number of clusters.

In R (the long way):

# function to compute average silhouette for k clusters
avg_sil <- function(k) {
  km.res <- kmeans(df, centers = k, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(df))
  mean(ss[, 3])
}

# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15

# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)

plot(k.values, avg_sil_values,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Average Silhouettes")

In R (using fviz_nbclust())

fviz_nbclust(df, kmeans, method = "silhouette")

3. Gap Statistic

Compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
Can be applied to any clusterin method (i.e Kmeans clustering, hierarchical clustering) *The gap statistic for a given k is defined as: \[ Gap_n(k) = E^*_n{log(W_k)} - log(W_k) \]
- Where \(E^*_n\) denotes the expectation under a sample size n from the reference distribution.
The gap statistic measures the deviation of the observed \(W_k\) value from its expected value under the null hypothesis.
The estimate of the optimal clusters (\(\hat k\)) will be the value that maximizes \(Gap_n(k)\). This means that the clustering structure is far away from the uniform distribution of points.
Algorithm Steps:
1. Cluster the observed data, varying the clusters from \(k=1, \dots, k_{max}\), and compute the corresponding \(W_k\).
2. Generate B reference data sets and cluster each of them with varying number of clusters \(k=1, \dots, k_{max}\). Compute the estimated gap statistics.
3. Let \(\bar w = (1/B) \sum_b log(W^*_{kb})\) compute the standard deviation, \(sd(k) = \sqrt{(1/b)\sum_b(log(W^*_{kb})- \bar w)^2}\) and define \(s_k = sd_k \times \sqrt{1 + 1/B}\).
4. Choose the number of clusters as the smallest k such that: \(Gap(k) \geq Gap(k+1) - s_{k+1}\).

In R (using clusGap() which provides the gap statistic and the std Error for the output):

# compute gap statistic
set.seed(123)
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 4
##           logW   E.logW       gap     SE.sim
##  [1,] 3.458369 3.640154 0.1817845 0.04422857
##  [2,] 3.135112 3.372283 0.2371717 0.03559601
##  [3,] 2.977727 3.233771 0.2560446 0.03749193
##  [4,] 2.826221 3.119172 0.2929511 0.04067348
##  [5,] 2.738868 3.019965 0.2810969 0.04185469
##  [6,] 2.666967 2.930002 0.2630347 0.04105040
##  [7,] 2.609895 2.852152 0.2422572 0.04184725
##  [8,] 2.539156 2.778562 0.2394054 0.04292750
##  [9,] 2.468162 2.711752 0.2435901 0.04344197
## [10,] 2.407265 2.647595 0.2403307 0.04548446

Visualize the results with fviz_gap_stat Suggests that 4 clusters is the optimal number

fviz_gap_stat(gap_stat)

Extracting Results

Most of the approaches suggested 4 was the optimal number of clusters.

K-means clustering with k=4

# Compute k-means clustering with k = 4
set.seed(123)
final <- kmeans(df, 4, nstart = 25)
print(final)

## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 11.952463 16.212213 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Visualize the results with fviz_cluster()

fviz_cluster(final, data=df)

Extract the clusters and add to our initial data to do descriptive statistics at the cluster level

USArrests %>%
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")

## # A tibble: 4 × 5
##   Cluster Murder Assault UrbanPop  Rape
##     <int>  <dbl>   <dbl>    <dbl> <dbl>
## 1       1  13.9    244.      53.8  21.4
## 2       2   3.6     78.5     52.1  12.2
## 3       3   5.66   139.      73.9  18.8
## 4       4  10.8    257.      76    33.2

Additional comments

K-means Clustering is a very fast, and simple algorithm.
It can efficiently deal with very large data sets.
Some weaknesses:
- Requires pre-specified number of clusters
  - Hierarchical clustering is an alternative approach which does not require this commitment.
- K-means is sensitive to outliers and different results can occur if you change the ordering of your data.
  - Partitioning Around Medoids (PAM) clustering approach is less sensitive to outliers and proides a robust alternative to Kmeans.

Hierarchical Analysis in R

An alternative to K-means clustering for identifying groups in a dataset.
Does not require us to pre-specify the number of clusters to be generated.
Added advantage over K-means in that it results in an attractive tree-based representation of observations called a dendrogram.

# Required packages
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms

## 
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

Hierarchical Clustering Algorithms

2 main types:
1. Agglomerative clustering (aka AGNES, Agglomerative Nesting):
  - Works in a bottom-up manner.
  - Each object is initially considered a single-element cluster (leaf) and at each step in the alg, the 2 clusters that are the most similar are combined into a bigger cluster (nodes).
  - Iteration until all points are members of just one single big cluster (root).
  - Result is a tree that can be plotted as a dendrogram.
  - Good at identifying small clusters.
2. Divisive Hierarchical Clustering (aka DIANA, Divise Analysis):
  - Works in a top-down manner.
  - Algorithm works in the inverse order of AGNES.
  - Begins with the root, in which all objects are grouped in a single cluster. At each step of iteration, the most heterogeneous cluster is divied into 2.
  - Process is iterated until all objects are in their own cluster.
  - Good at identifying large clusters.
How do we Measure the dissimilarity between 2 clusters of observations?
- Linkage Methods: a variety of different cluster agglomeration methods.
- Most common linkage methods:
  1. Maximum or Complete Linkage Clustering:
    - Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters.
    - Tends to produce more compact clusters.
  2. Minimum or Single Linkage Clustering:
    - Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion.
    - Tends to produce long, “loose” clusters.
  3. Mean or Average Linkage Clustering:
    - Minimizes the total within-cluster variance.
    - At each step the pair of clusters with minimum between-cluster distance are merged.

Data Preparation

To perform cluster analysis in R, prepare data as follows:

Rows are observations (individuals) and columns are variables.
Any missing value in the data must be removed or estimated.
The data must be standardized (i.e., scaled) to make variables comparable.
- Standardization consists of transforming the variables such that they have mean zero and standard deviation one.

Dataset

Same as in Part 1: K Means Analysis Coding practice above.

df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Hierarchical Clustering with R

Commonly used functions in R for computing hierarchical clustering:

hclust (in stats package) and agnes *in cluster package for agglomearitve hierachical clustering (JC)
diana (in cluster package) for divisive HC

Agglomerative Hierarchical Clustering

Compute the dissimilarity values with dist().
Feed these values into hclust() and specify the agglomeration method to be used (“complete”, “average”, “single”, “ward.D”).
Plot the dendrogram

# Dissimilarity matrix
d <- dist(df, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(d, method = "complete" )

# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)

Alternatively, use the agnes() function. Behaves similarly but with agnes() you can also get the agglomerative coefficient, which measures the amount of clustering structure found. Values closer to 1 suggest strong clustering structure.

# Compute with agnes
hc2 <- agnes(df, method = "complete")

# Agglomerative coefficient
hc2$ac

## [1] 0.8531583

This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures. Here we see that Ward’s method identifies the strongest clustering structure of the four methods assessed.

# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(df, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.7379371 0.6276128 0.8531583 0.9346210

Similar to before we can visualize the dendrogram:

hc3 <- agnes(df, method = "ward")
pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes")

Divisive Hierarchical Clustering

Use R function diana() in the cluster package. diana() works similar to agnes(); however, there is no method to provide.

# compute divisive hierarchical clustering
hc4 <- diana(df)

# Divise coefficient; amount of clustering structure found
hc4$dc

## [1] 0.8514345

## [1] 0.8514345

# plot dendrogram
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")

Working with Dendrograms

In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.

The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are.

Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.

The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k in k-means clustering. In order to identify sub-groups (i.e. clusters), we can cut the dendrogram with cutree():

# Ward's method
hc5 <- hclust(d, method = "ward.D2" )

# Cut tree into 4 groups
sub_grp <- cutree(hc5, k = 4)

# Number of members in each cluster
table(sub_grp)

## sub_grp
##  1  2  3  4 
##  7 12 19 12

We can also use the cutree() output to add the the cluster each observation belongs to to our original data.

USArrests %>%
  mutate(cluster = sub_grp) %>%
  head

##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       2
## Arizona       8.1     294       80 31.0       2
## Arkansas      8.8     190       50 19.5       3
## California    9.0     276       91 40.6       2
## Colorado      7.9     204       78 38.7       2

It’s also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to specify the border colors for the rectangles:

plot(hc5, cex = 0.6)
rect.hclust(hc5, k = 4, border = 2:5)

We can also use the fviz_cluster() function from the factoextra package to visualize the result in a scatter plot.

fviz_cluster(list(data = df, cluster = sub_grp))

To use cutree() with agnes() and diana() you can perform the following:

# Cut agnes() tree into 4 groups
hc_a <- agnes(df, method = "ward")
cutree(as.hclust(hc_a), k = 4)

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3

# Cut diana() tree into 4 groups
hc_d <- diana(df)
cutree(as.hclust(hc_d), k = 4)

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3

We can also compare two dendrograms. Here we compare hierarchical clustering with complete linkage versus Ward’s method. The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines.

# Compute distance matrix
res.dist <- dist(df, method = "euclidean")

# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "complete")
hc2 <- hclust(res.dist, method = "ward.D2")

# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)

tanglegram(dend1, dend2)

The output displays “unique” nodes, with a combination of labels/items not present in the other tree, highlighted with dashed lines. The quality of the alignment of the two trees can be measured using the function entanglement().

Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment. The output of tanglegram can be customized using many other options as follow:

dend_list <- dendlist(dend1, dend2)

tanglegram(dend1, dend2,
  highlight_distinct_edges = FALSE, # Turn-off dashed lines
  common_subtrees_color_lines = FALSE, # Turn-off line colors
  common_subtrees_color_branches = TRUE, # Color common branches 
  main = paste("entanglement =", round(entanglement(dend_list), 2))
  )

Determining Optimal Clusters

Elbow Method

To perform Elbow method in hierarchical clustering, we just need to change the second arg in fviz_nbclust() to FUN=hcut.

fviz_nbclust(df, FUN = hcut, method = "wss")

Average Silhoutte Method

fviz_nbclust(df, FUN = hcut, method = "silhouette")

#### Gap Statistic Method

gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Additional Comments

Clustering can be a very useful tool for data analysis in the unsupervised setting. However, there are a number of issues that arise in performing clustering. * In the case of hierarchical clustering, we need to be concerned about:
- What dissimilarity measure should be used?
- What type of linkage should be used?
- Where should we cut the dendrogram in order to obtain clusters?
Each of these decisions can have a strong impact on the results obtained.
In practice, try several different choices and look for the one with the most useful or interpretable solution.
With these methods, there is no single right answer - any solution that exposes some interesting aspects of the data should be considered.

week6_CP1_KMeans

Leona Chase

2024-10-02

Week 5 Coding Practice: Cluster Analysis in R

K-Means Analysis in R

Data Preparation

The data set: US Arrests

Overview:

Clustering Distance Measures

The classical methods for distance measures (Where x and y are 2 vectors of length n):

Other dissimilarity measures such as correlation-based distances:

Compute and Visualize the distance matrix in R

K-Means Clustering

Hartigan-Wong algorithm

K-Means Algorithm

Steps:

Computing K-means Clustering in R

Output of kmeans()

Determining Optimal Clusters

3 popular methods for determining the optimal number of clusters:

Extracting Results

Additional comments

Hierarchical Analysis in R

Hierarchical Clustering Algorithms

Data Preparation

Dataset

Hierarchical Clustering with R

Agglomerative Hierarchical Clustering

Divisive Hierarchical Clustering

Working with Dendrograms

Determining Optimal Clusters

Elbow Method

Average Silhoutte Method

Additional Comments