Cluster Analysis for Beginners

Anna Shirokanova

19 02 2019

Why use cluster analysis?

Clustering algorithms (CA) are used to split a dataset into several groups, so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible - these are called ‘clusters’.

K Means Clustering

1 Reassign data points to the cluster whose centroid is closest.

2 Calculate new centroid of each cluster.

Example: Those irises must be clustered!

Let’s begin with a dataset with metric variables only.

Useful sources: https://www.stat.berkeley.edu/~s133/Cluster2a.html https://www.r-bloggers.com/k-means-clustering-in-r/

library(datasets)
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
library(psych)
desc <- describeBy(iris, iris$Species, mat = TRUE) 
desc[1:6, 1:6]
##               item     group1 vars  n  mean        sd
## Sepal.Length1    1     setosa    1 50 5.006 0.3524897
## Sepal.Length2    2 versicolor    1 50 5.936 0.5161711
## Sepal.Length3    3  virginica    1 50 6.588 0.6358796
## Sepal.Width1     4     setosa    2 50 3.428 0.3790644
## Sepal.Width2     5 versicolor    2 50 2.770 0.3137983
## Sepal.Width3     6  virginica    2 50 2.974 0.3224966
tapply(iris$Petal.Length, iris$Species, mean)
##     setosa versicolor  virginica 
##      1.462      4.260      5.552
tapply(iris$Petal.Width, iris$Species, mean)
##     setosa versicolor  virginica 
##      0.246      1.326      2.026

Petal.Length and Petal.Width are similar within the same species but vary considerably between different species:

library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + 
  geom_point() +
  theme_classic()

PARTITIONING: k means

set.seed(20) # to ensure reproducibility (could be omitted, the number will be random)
irisCluster <- kmeans(iris[, 3:4], 
                      3, # we know there are 3 species
                      nstart = 20 # R will try 20 different random starting assignments 
                      # and then select the one with the lowest within cluster variation.
)

Why did we use only these two variables [,3:4]? They differentiate between the species the best.

library(lattice)
splom(~ iris, groups = iris$Species, auto.key = TRUE)

Table the results of clustering:

table(irisCluster$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48         4
##   3      0          2        46

We can use the ‘elbow method’ to determine empirically the optimal k.

To do this, we run several models with k = [1:10] and compare the within-cluster sum of squares to the cluster centroid in each case. The smaller this indicator, the better (= high intra-cluster similarity, low inter-cluster similarity). The optimal number of clusters (k) is where the ‘elbow’ occurs, i.e. a k where the within sum of squares decreases dramatically.

library(purrr)
tot_within <- map_dbl(1:10, function(k){
  model <- kmeans(x = iris[,3:4], centers = k)
  model$tot.withinss
})
scree_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_within
)
ggplot(scree_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10) +
  theme_classic()

According to our scree plot (elbow plot), the optimal k is 2 or 3.

Library clusters allows us to represent (with the aid of principal components analysis, PCA) the cluster solution in two dimensions:

library(cluster)
clusplot(iris, irisCluster$cluster, main = '2D representation of the Cluster solution',
         color = TRUE, shade = TRUE,
         labels = 2, # all points and ellipses are labelled in the plot
         lines = 0) # no lines between the centres of ellipses

This plot shows that we can easily differentiate cluster 1 from clusters 2 and 3, but the latter two are mixed. Therefore, we can classify not every observation correctly.

HIERARCHICAL:

clusters by agglomerative nesting, agnes - Agglomerative Hierarchical clustering algorithm

iris.use <- subset(iris, select = -Species)
d <- dist(iris.use, method = 'euclidean') 

The choice of metric may have a large impact depending on what data you have.

See an example here:

Three clustering methods for the target-looking data.

Three clustering methods for the target-looking data.

Source of graph

What distance to choose?

The right choice depends on the measurement type of variables.

The right choice depends on the measurement type of variables.

distance metric description
euclidean - usual distance between the two vectors (interval vars)
manhattan - absolute distance between the two vectors
canberra - for non-negative values (e.g., counts)
minkowski - can be used for both ordinal and quantitative variables
gower - the weighted mean of the contributions of each variable (for mixed types)
binary - Jaccard distance for dichotomous variables

More about distances: http://www.sthda.com/english/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling

You can produce a plot with distances between observations and a dendrogram:

heatmap(as.matrix(d), symm = T,
        distfun = function(x) as.dist(x)) 

How many clusters can you see here, 2 or 3? Those large squares are to help you decide on how many clusters to retain.

Let’s try various clustering methods based on Ward’s algorithm, average linkage, and complete linkage:

1 Get the cluster solution

library(cluster)
h.fit <- agnes(d, method = "ward") # principle: minimal average distance between clusters
h2.fit<- agnes(d, method = "average") # average distance between an observation and existing clusters
h3.fit<- agnes(d, method = "complete") # maximal distance between an observation and existing clusters

For the brief comparison of methods, see: https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf

2 In this case when we know the species, we can compare methods by their classification quality:

Recall: https://en.wikipedia.org/wiki/Confusion_matrix

table(cutree(h.fit, 3), iris$Species) # 16 incorrect
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         49        15
##   3      0          1        35
table(cutree(h2.fit, 3), iris$Species) # 14 incorrect
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         50        14
##   3      0          0        36
table(cutree(h3.fit, 3), iris$Species) # bad! 49+27 incorrect
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         23        49
##   3      0         27         1

In our case, the average linkage method (2nd table) seems to work best as it produces the smallest number of false positives and false negatives. Let’s visualise the result!

Dendrograms to visualize results

cl <- cutree(h2.fit, k = 3)
library(dplyr)
iris <- mutate(iris, cluster = cl) # added cluster membership as a column to the data frame
library(magrittr)
iris %>%
  group_by(cluster) %>% 
  summarise_all(funs(mean(.))) # mean values of all variables by cluster
## # A tibble: 3 x 6
##   cluster Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##     <int>        <dbl>       <dbl>        <dbl>       <dbl>   <dbl>
## 1       1         5.01        3.43         1.46       0.246      NA
## 2       2         5.93        2.76         4.41       1.44       NA
## 3       3         6.85        3.08         5.79       2.10       NA
library(psych)
describeBy(iris, iris$cluster) # for categorical variables
## 
##  Descriptive statistics by group 
## group: 1
##              vars  n mean   sd median trimmed  mad min max range skew
## Sepal.Length    1 50 5.01 0.35    5.0    5.00 0.30 4.3 5.8   1.5 0.11
## Sepal.Width     2 50 3.43 0.38    3.4    3.42 0.37 2.3 4.4   2.1 0.04
## Petal.Length    3 50 1.46 0.17    1.5    1.46 0.15 1.0 1.9   0.9 0.10
## Petal.Width     4 50 0.25 0.11    0.2    0.24 0.00 0.1 0.6   0.5 1.18
## Species*        5 50 1.00 0.00    1.0    1.00 0.00 1.0 1.0   0.0  NaN
## cluster         6 50 1.00 0.00    1.0    1.00 0.00 1.0 1.0   0.0  NaN
##              kurtosis   se
## Sepal.Length    -0.45 0.05
## Sepal.Width      0.60 0.05
## Petal.Length     0.65 0.02
## Petal.Width      1.26 0.01
## Species*          NaN 0.00
## cluster           NaN 0.00
## -------------------------------------------------------- 
## group: 2
##              vars  n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 64 5.93 0.49    5.9    5.93 0.44 4.9 7.0   2.1 -0.01
## Sepal.Width     2 64 2.76 0.30    2.8    2.77 0.30 2.0 3.4   1.4 -0.32
## Petal.Length    3 64 4.41 0.51    4.5    4.45 0.59 3.0 5.1   2.1 -0.62
## Petal.Width     4 64 1.44 0.29    1.4    1.42 0.22 1.0 2.4   1.4  0.64
## Species*        5 64 2.22 0.42    2.0    2.15 0.00 2.0 3.0   1.0  1.33
## cluster         6 64 2.00 0.00    2.0    2.00 0.00 2.0 2.0   0.0   NaN
##              kurtosis   se
## Sepal.Length    -0.36 0.06
## Sepal.Width     -0.40 0.04
## Petal.Length    -0.27 0.06
## Petal.Width      0.36 0.04
## Species*        -0.24 0.05
## cluster           NaN 0.00
## -------------------------------------------------------- 
## group: 3
##              vars  n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 36 6.85 0.51    6.7    6.83 0.44 6.1 7.9   1.8  0.54
## Sepal.Width     2 36 3.08 0.30    3.0    3.06 0.30 2.5 3.8   1.3  0.49
## Petal.Length    3 36 5.79 0.46    5.7    5.75 0.44 5.1 6.9   1.8  0.68
## Petal.Width     4 36 2.10 0.26    2.1    2.11 0.30 1.4 2.5   1.1 -0.49
## Species*        5 36 3.00 0.00    3.0    3.00 0.00 3.0 3.0   0.0   NaN
## cluster         6 36 3.00 0.00    3.0    3.00 0.00 3.0 3.0   0.0   NaN
##              kurtosis   se
## Sepal.Length    -0.98 0.08
## Sepal.Width      0.23 0.05
## Petal.Length    -0.29 0.08
## Petal.Width     -0.32 0.04
## Species*          NaN 0.00
## cluster           NaN 0.00
library(dendextend)
dend <- as.dendrogram(h2.fit)
dend_col <- color_branches(dend, k = 3)
plot(dend_col)

In this way you have cut the dendrogram to obtain three clusters.

A simpler way is to draw the boxes around clusters. The ‘banner plot’ below is rarely used, so you can ignore it for not. The white line between the read lines indicates that a cluster is well separated from the rest of clusters.

plot(h2.fit)

rect.hclust(h2.fit, k = 3, border = "green")

See: https://datascience.stackexchange.com/questions/18860/how-to-interpret-agglomerative-coefficient-agnes-function

Alternative functions for hierarchical clustering:

H.fit <- hclust(d, method="ward.D2") # agnes(*, method="ward") <=> hclust(*, "ward.D2")
plot(H.fit) # display dendogram
rect.hclust(H.fit, k = 3, border="green") # dendogram with borders around 3 clusters

Another possibility is to cut at a height level. The height in a dendrogram is interpreted as a combination of method and distance. Here, the dendrogram results from a Euclidean distances’s matrix and Ward’s clustering. Cutting the dendrogram at the height of 10 means that the minimal error sum of squares (the function optimised in Ward’s method) for Euclidean distances between any two observations will not exceed 10.

cl2 <- cutree(H.fit, h = 10)
dend <- as.dendrogram(H.fit)
dend_col <- color_branches(dend, h = 10)
plot(dend_col)

horizontal dendrogram:

labs = paste("iris_", 1:150, sep = "") # new labels - optional!
par(mar = c(3,1,1,5)) 
plot(as.dendrogram(H.fit), horiz = T)

Better dendrograms are possible: http://stackoverflow.com/questions/14118033/horizontal-dendrogram-in-r-with-labels

Clustering Mixed Data Types in R

Clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest.

See: https://www.r-bloggers.com/clustering-mixed-data-types-in-r/

For illustration, the publicly available College dataset found in the ISLR package will be used, which has various statistics of US Colleges from 1995 (N = 777). To highlight the challenge of handling mixed data types, variables that are both categorical and continuous will be used and are listed below:

Continuous variables: Acceptance rate, Out of state students’ tuition, Number of new students enrolled

Categorical variables: Whether a college is public/private, Whether a college is elite, defined as having more than 50% of new students who graduated in the top 10% of their high school class

set.seed(1588) # for reproducibility

library(ISLR) # for college dataset
library(Rtsne) # for t-SNE plot
library(tidyverse) # for %>% 
  1. Clean the data:
college_clean = College %>%
  mutate(name = row.names(.), #make college names a variable
         accept_rate = Accept/Apps,#ratio of apps to accepted
         isElite = cut(Top10perc,
                       breaks = c(0, 50, 100),
                       labels = c("Not Elite", "Elite"),
                       include.lowest = TRUE)) %>%
  mutate(isElite = factor(isElite)) %>% #label colleges with many good students
  select(name, accept_rate, Outstate, Enroll,  #"elites"
         Grad.Rate, Private, isElite)
glimpse(college_clean)
## Observations: 777
## Variables: 7
## $ name        <chr> "Abilene Christian University", "Adelphi University"…
## $ accept_rate <dbl> 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.756476…
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868…
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472…
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, …
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
## $ isElite     <fct> Not Elite, Not Elite, Not Elite, Elite, Not Elite, N…
  1. Then we have to use a distance metric that can handle mixed data types.

Create a distance object (takinge out the name of the school in the 1st columns)

library(cluster)
gower_dist <- daisy(college_clean[, -1], # this is a distances object
                    metric = "gower",
                    type = list(logratio = 3))
summary(gower_dist)
## 301476 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00186 0.10344 0.23587 0.23145 0.32714 0.77735 
## Metric :  mixed ;  Types = I, I, I, I, N, N 
## Number of objects : 777

Check attributes to ensure the correct methods are being used (I = interval, N = nominal) Note that despite logratio being called, the type remains coded as “I”

  1. We can print out the most similar and dissimilar pair in the data to see if it makes sense
gower_mat <- as.matrix(gower_dist) #this is a 777x777 matrix

Most similar pair of colleges:

college_clean[
  which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]),
        arr.ind = TRUE)[1, ], ]
##                            name accept_rate Outstate Enroll Grad.Rate
## 682 University of St. Thomas MN   0.8784638    11712    828        89
## 284     John Carroll University   0.8711276    11700    820        89
##     Private   isElite
## 682     Yes Not Elite
## 284     Yes Not Elite

Most dissimilar pair of colleges:

college_clean[
  which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
        arr.ind = TRUE)[1, ], ]
##                                        name accept_rate Outstate Enroll
## 673 University of Sci. and Arts of Oklahoma   0.9824561     3687    208
## 251                      Harvard University   0.1561486    18485   1606
##     Grad.Rate Private   isElite
## 673        43      No Not Elite
## 251       100     Yes     Elite

Choose the algorithm.

We will use partitioning around medoids (PAM) as it is one of the options for handling a custom distance matrix.

Q: What is PAM?

A: PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to - similar to k-means. It works well with n < 10,000 observations.

Q: How many clusters should be retained in the solution?

A: Look at the silhouette width. This is an internal validation metric which is an aggregated measure of how similar an observation is to its own cluster compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.

  1. Calculate silhouette width for many k using PAM:
sil_width <- c(NA)

for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

Plot sihouette width (higher is better):

plot(1:10, sil_width,
     xlab = "Number of clusters", xaxt='n',
     ylab = "Silhouette Width")
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)

Conclusion: select 3 clusters, as this solution has highest width.

  1. Interpret the solution:

can interpret the clusters by running summary on each cluster

pam_fit <- pam(gower_dist, diss = TRUE, k = 3)
pam_results <- college_clean %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
pam_results$the_summary
## [[1]]
##   accept_rate        Outstate         Enroll         Grad.Rate     
##  Min.   :0.3283   Min.   : 2340   Min.   :  35.0   Min.   : 15.00  
##  1st Qu.:0.7225   1st Qu.: 8842   1st Qu.: 194.8   1st Qu.: 56.00  
##  Median :0.8004   Median :10905   Median : 308.0   Median : 67.50  
##  Mean   :0.7820   Mean   :11200   Mean   : 418.6   Mean   : 66.97  
##  3rd Qu.:0.8581   3rd Qu.:13240   3rd Qu.: 484.8   3rd Qu.: 78.25  
##  Max.   :1.0000   Max.   :21700   Max.   :4615.0   Max.   :118.00  
##  Private        isElite       cluster 
##  No :  0   Not Elite:500   Min.   :1  
##  Yes:500   Elite    :  0   1st Qu.:1  
##                            Median :1  
##                            Mean   :1  
##                            3rd Qu.:1  
##                            Max.   :1  
## 
## [[2]]
##   accept_rate        Outstate         Enroll         Grad.Rate     
##  Min.   :0.1545   Min.   : 5224   Min.   : 137.0   Min.   : 54.00  
##  1st Qu.:0.4135   1st Qu.:13850   1st Qu.: 391.0   1st Qu.: 77.00  
##  Median :0.5329   Median :17238   Median : 601.0   Median : 89.00  
##  Mean   :0.5392   Mean   :16225   Mean   : 882.5   Mean   : 84.78  
##  3rd Qu.:0.6988   3rd Qu.:18590   3rd Qu.:1191.0   3rd Qu.: 94.00  
##  Max.   :0.9605   Max.   :20100   Max.   :4893.0   Max.   :100.00  
##  Private       isElite      cluster 
##  No : 4   Not Elite: 0   Min.   :2  
##  Yes:65   Elite    :69   1st Qu.:2  
##                          Median :2  
##                          Mean   :2  
##                          3rd Qu.:2  
##                          Max.   :2  
## 
## [[3]]
##   accept_rate        Outstate         Enroll       Grad.Rate     
##  Min.   :0.3746   Min.   : 2580   Min.   : 153   Min.   : 10.00  
##  1st Qu.:0.6423   1st Qu.: 5295   1st Qu.: 694   1st Qu.: 46.00  
##  Median :0.7458   Median : 6598   Median :1302   Median : 54.50  
##  Mean   :0.7315   Mean   : 6698   Mean   :1615   Mean   : 55.42  
##  3rd Qu.:0.8368   3rd Qu.: 7748   3rd Qu.:2184   3rd Qu.: 65.00  
##  Max.   :1.0000   Max.   :15516   Max.   :6392   Max.   :100.00  
##  Private        isElite       cluster 
##  No :208   Not Elite:199   Min.   :3  
##  Yes:  0   Elite    :  9   1st Qu.:3  
##                            Median :3  
##                            Mean   :3  
##                            3rd Qu.:3  
##                            Max.   :3
  1. Interpretation

Hint: medoids serve as exemplars of each cluster

college_clean[pam_fit$medoids, ]
##                              name accept_rate Outstate Enroll Grad.Rate
## 492         Saint Francis College   0.7877629    10880    284        69
## 38                Barnard College   0.5616987    17926    531        91
## 234 Grand Valley State University   0.7525653     6108   1561        57
##     Private   isElite
## 492     Yes Not Elite
## 38      Yes     Elite
## 234      No Not Elite
  1. Visualisation for delivery of results:

To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE. It tries to preserve local structure so as to make clusters visible in 2D or 3D.

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = college_clean$name)

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

Have you noticed the small group?

It is made up of the larger, more competitive public schools not large enough to warrant an additional cluster according to silhouette width, but these 13 schools certainly have distinct characteristics http://colleges.usnews.rankingsandreviews.com/best-colleges/william-and-mary-3705

tsne_data %>%
  filter(X > -10 & X < 0,   #be careful here, cluster coordinates may differ
         Y > -30 & Y < -25) %>% #careful
  left_join(college_clean, by = "name") %>%
  collect %>%
  .[["name"]]
##  [1] "College of William and Mary"                
##  [2] "Georgia Institute of Technology"            
##  [3] "SUNY at Binghamton"                         
##  [4] "SUNY College at Geneseo"                    
##  [5] "Trenton State College"                      
##  [6] "University of California at Berkeley"       
##  [7] "University of California at Irvine"         
##  [8] "University of Florida"                      
##  [9] "University of Illinois - Urbana"            
## [10] "University of Michigan at Ann Arbor"        
## [11] "University of Minnesota at Morris"          
## [12] "University of North Carolina at Chapel Hill"
## [13] "University of Virginia"

Home task:

try classifying US states by crime types: data(USArrests)

data(USArrests)
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
describe(USArrests)
##          vars  n   mean    sd median trimmed    mad  min   max range  skew
## Murder      1 50   7.79  4.36   7.25    7.53   5.41  0.8  17.4  16.6  0.37
## Assault     2 50 170.76 83.34 159.00  168.47 110.45 45.0 337.0 292.0  0.22
## UrbanPop    3 50  65.54 14.47  66.00   65.88  17.79 32.0  91.0  59.0 -0.21
## Rape        4 50  21.23  9.37  20.10   20.36   8.60  7.3  46.0  38.7  0.75
##          kurtosis    se
## Murder      -0.95  0.62
## Assault     -1.15 11.79
## UrbanPop    -0.87  2.05
## Rape         0.08  1.32

This is it! Have fun with clusters and mind the distances between variables.