Anna Shirokanova
19 02 2019
Clustering algorithms (CA) are used to split a dataset into several groups, so that the objects in the same group are as similar as possible and the objects in different groups are as dissimilar as possible - these are called ‘clusters’.
1 k-means clustering, a partitioning method used for splitting a dataset into a set of k clusters.
2 hierarchical clustering, an alternative approach to k-means clustering for identifying clustering in the dataset by using pairwise distance matrix between observations as clustering criteria.
1 Reassign data points to the cluster whose centroid is closest.
2 Calculate new centroid of each cluster.
Let’s begin with a dataset with metric variables only.
Useful sources: https://www.stat.berkeley.edu/~s133/Cluster2a.html https://www.r-bloggers.com/k-means-clustering-in-r/
library(datasets)
head(iris, 3)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
library(psych)
desc <- describeBy(iris, iris$Species, mat = TRUE)
desc[1:6, 1:6]## item group1 vars n mean sd
## Sepal.Length1 1 setosa 1 50 5.006 0.3524897
## Sepal.Length2 2 versicolor 1 50 5.936 0.5161711
## Sepal.Length3 3 virginica 1 50 6.588 0.6358796
## Sepal.Width1 4 setosa 2 50 3.428 0.3790644
## Sepal.Width2 5 versicolor 2 50 2.770 0.3137983
## Sepal.Width3 6 virginica 2 50 2.974 0.3224966
tapply(iris$Petal.Length, iris$Species, mean)## setosa versicolor virginica
## 1.462 4.260 5.552
tapply(iris$Petal.Width, iris$Species, mean)## setosa versicolor virginica
## 0.246 1.326 2.026
Petal.Length and Petal.Width are similar within the same species but vary considerably between different species:
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +
geom_point() +
theme_classic()set.seed(20) # to ensure reproducibility (could be omitted, the number will be random)
irisCluster <- kmeans(iris[, 3:4],
3, # we know there are 3 species
nstart = 20 # R will try 20 different random starting assignments
# and then select the one with the lowest within cluster variation.
)Why did we use only these two variables [,3:4]? They differentiate between the species the best.
library(lattice)
splom(~ iris, groups = iris$Species, auto.key = TRUE)Table the results of clustering:
table(irisCluster$cluster, iris$Species)##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 4
## 3 0 2 46
We can use the ‘elbow method’ to determine empirically the optimal k.
To do this, we run several models with k = [1:10] and compare the within-cluster sum of squares to the cluster centroid in each case. The smaller this indicator, the better (= high intra-cluster similarity, low inter-cluster similarity). The optimal number of clusters (k) is where the ‘elbow’ occurs, i.e. a k where the within sum of squares decreases dramatically.
library(purrr)
tot_within <- map_dbl(1:10, function(k){
model <- kmeans(x = iris[,3:4], centers = k)
model$tot.withinss
})
scree_df <- data.frame(
k = 1:10,
tot_withinss = tot_within
)
ggplot(scree_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10) +
theme_classic()According to our scree plot (elbow plot), the optimal k is 2 or 3.
Library clusters allows us to represent (with the aid of principal components analysis, PCA) the cluster solution in two dimensions:
library(cluster)
clusplot(iris, irisCluster$cluster, main = '2D representation of the Cluster solution',
color = TRUE, shade = TRUE,
labels = 2, # all points and ellipses are labelled in the plot
lines = 0) # no lines between the centres of ellipsesThis plot shows that we can easily differentiate cluster 1 from clusters 2 and 3, but the latter two are mixed. Therefore, we can classify not every observation correctly.
iris.use <- subset(iris, select = -Species)
d <- dist(iris.use, method = 'euclidean') The choice of metric may have a large impact depending on what data you have.
See an example here:
Three clustering methods for the target-looking data.
The right choice depends on the measurement type of variables.
| distance | metric description |
|---|---|
| euclidean - | usual distance between the two vectors (interval vars) |
| manhattan - | absolute distance between the two vectors |
| canberra - | for non-negative values (e.g., counts) |
| minkowski - | can be used for both ordinal and quantitative variables |
| gower - | the weighted mean of the contributions of each variable (for mixed types) |
| binary - | Jaccard distance for dichotomous variables |
More about distances: http://www.sthda.com/english/wiki/clarifying-distance-measures-unsupervised-machine-learning#distances-and-scaling
You can produce a plot with distances between observations and a dendrogram:
heatmap(as.matrix(d), symm = T,
distfun = function(x) as.dist(x)) How many clusters can you see here, 2 or 3? Those large squares are to help you decide on how many clusters to retain.
1 Get the cluster solution
library(cluster)
h.fit <- agnes(d, method = "ward") # principle: minimal average distance between clusters
h2.fit<- agnes(d, method = "average") # average distance between an observation and existing clusters
h3.fit<- agnes(d, method = "complete") # maximal distance between an observation and existing clustersFor the brief comparison of methods, see: https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf
2 In this case when we know the species, we can compare methods by their classification quality:
Recall: https://en.wikipedia.org/wiki/Confusion_matrix
table(cutree(h.fit, 3), iris$Species) # 16 incorrect##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 49 15
## 3 0 1 35
table(cutree(h2.fit, 3), iris$Species) # 14 incorrect##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 50 14
## 3 0 0 36
table(cutree(h3.fit, 3), iris$Species) # bad! 49+27 incorrect##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 23 49
## 3 0 27 1
In our case, the average linkage method (2nd table) seems to work best as it produces the smallest number of false positives and false negatives. Let’s visualise the result!
cl <- cutree(h2.fit, k = 3)
library(dplyr)
iris <- mutate(iris, cluster = cl) # added cluster membership as a column to the data frame
library(magrittr)
iris %>%
group_by(cluster) %>%
summarise_all(funs(mean(.))) # mean values of all variables by cluster## # A tibble: 3 x 6
## cluster Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.01 3.43 1.46 0.246 NA
## 2 2 5.93 2.76 4.41 1.44 NA
## 3 3 6.85 3.08 5.79 2.10 NA
library(psych)
describeBy(iris, iris$cluster) # for categorical variables##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 50 5.01 0.35 5.0 5.00 0.30 4.3 5.8 1.5 0.11
## Sepal.Width 2 50 3.43 0.38 3.4 3.42 0.37 2.3 4.4 2.1 0.04
## Petal.Length 3 50 1.46 0.17 1.5 1.46 0.15 1.0 1.9 0.9 0.10
## Petal.Width 4 50 0.25 0.11 0.2 0.24 0.00 0.1 0.6 0.5 1.18
## Species* 5 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN
## cluster 6 50 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN
## kurtosis se
## Sepal.Length -0.45 0.05
## Sepal.Width 0.60 0.05
## Petal.Length 0.65 0.02
## Petal.Width 1.26 0.01
## Species* NaN 0.00
## cluster NaN 0.00
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 64 5.93 0.49 5.9 5.93 0.44 4.9 7.0 2.1 -0.01
## Sepal.Width 2 64 2.76 0.30 2.8 2.77 0.30 2.0 3.4 1.4 -0.32
## Petal.Length 3 64 4.41 0.51 4.5 4.45 0.59 3.0 5.1 2.1 -0.62
## Petal.Width 4 64 1.44 0.29 1.4 1.42 0.22 1.0 2.4 1.4 0.64
## Species* 5 64 2.22 0.42 2.0 2.15 0.00 2.0 3.0 1.0 1.33
## cluster 6 64 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
## kurtosis se
## Sepal.Length -0.36 0.06
## Sepal.Width -0.40 0.04
## Petal.Length -0.27 0.06
## Petal.Width 0.36 0.04
## Species* -0.24 0.05
## cluster NaN 0.00
## --------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 36 6.85 0.51 6.7 6.83 0.44 6.1 7.9 1.8 0.54
## Sepal.Width 2 36 3.08 0.30 3.0 3.06 0.30 2.5 3.8 1.3 0.49
## Petal.Length 3 36 5.79 0.46 5.7 5.75 0.44 5.1 6.9 1.8 0.68
## Petal.Width 4 36 2.10 0.26 2.1 2.11 0.30 1.4 2.5 1.1 -0.49
## Species* 5 36 3.00 0.00 3.0 3.00 0.00 3.0 3.0 0.0 NaN
## cluster 6 36 3.00 0.00 3.0 3.00 0.00 3.0 3.0 0.0 NaN
## kurtosis se
## Sepal.Length -0.98 0.08
## Sepal.Width 0.23 0.05
## Petal.Length -0.29 0.08
## Petal.Width -0.32 0.04
## Species* NaN 0.00
## cluster NaN 0.00
library(dendextend)
dend <- as.dendrogram(h2.fit)
dend_col <- color_branches(dend, k = 3)
plot(dend_col)In this way you have cut the dendrogram to obtain three clusters.
A simpler way is to draw the boxes around clusters. The ‘banner plot’ below is rarely used, so you can ignore it for not. The white line between the read lines indicates that a cluster is well separated from the rest of clusters.
plot(h2.fit)rect.hclust(h2.fit, k = 3, border = "green")Alternative functions for hierarchical clustering:
H.fit <- hclust(d, method="ward.D2") # agnes(*, method="ward") <=> hclust(*, "ward.D2")
plot(H.fit) # display dendogram
rect.hclust(H.fit, k = 3, border="green") # dendogram with borders around 3 clustersAnother possibility is to cut at a height level. The height in a dendrogram is interpreted as a combination of method and distance. Here, the dendrogram results from a Euclidean distances’s matrix and Ward’s clustering. Cutting the dendrogram at the height of 10 means that the minimal error sum of squares (the function optimised in Ward’s method) for Euclidean distances between any two observations will not exceed 10.
cl2 <- cutree(H.fit, h = 10)
dend <- as.dendrogram(H.fit)
dend_col <- color_branches(dend, h = 10)
plot(dend_col)horizontal dendrogram:
labs = paste("iris_", 1:150, sep = "") # new labels - optional!
par(mar = c(3,1,1,5))
plot(as.dendrogram(H.fit), horiz = T)Better dendrograms are possible: http://stackoverflow.com/questions/14118033/horizontal-dendrogram-in-r-with-labels
Clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest.
See: https://www.r-bloggers.com/clustering-mixed-data-types-in-r/
For illustration, the publicly available College dataset found in the ISLR package will be used, which has various statistics of US Colleges from 1995 (N = 777). To highlight the challenge of handling mixed data types, variables that are both categorical and continuous will be used and are listed below:
Continuous variables: Acceptance rate, Out of state students’ tuition, Number of new students enrolled
Categorical variables: Whether a college is public/private, Whether a college is elite, defined as having more than 50% of new students who graduated in the top 10% of their high school class
set.seed(1588) # for reproducibility
library(ISLR) # for college dataset
library(Rtsne) # for t-SNE plot
library(tidyverse) # for %>% college_clean = College %>%
mutate(name = row.names(.), #make college names a variable
accept_rate = Accept/Apps,#ratio of apps to accepted
isElite = cut(Top10perc,
breaks = c(0, 50, 100),
labels = c("Not Elite", "Elite"),
include.lowest = TRUE)) %>%
mutate(isElite = factor(isElite)) %>% #label colleges with many good students
select(name, accept_rate, Outstate, Enroll, #"elites"
Grad.Rate, Private, isElite)
glimpse(college_clean)## Observations: 777
## Variables: 7
## $ name <chr> "Abilene Christian University", "Adelphi University"…
## $ accept_rate <dbl> 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.756476…
## $ Outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868…
## $ Enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 472…
## $ Grad.Rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74, …
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
## $ isElite <fct> Not Elite, Not Elite, Not Elite, Elite, Not Elite, N…
Create a distance object (takinge out the name of the school in the 1st columns)
library(cluster)
gower_dist <- daisy(college_clean[, -1], # this is a distances object
metric = "gower",
type = list(logratio = 3))
summary(gower_dist)## 301476 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00186 0.10344 0.23587 0.23145 0.32714 0.77735
## Metric : mixed ; Types = I, I, I, I, N, N
## Number of objects : 777
Check attributes to ensure the correct methods are being used (I = interval, N = nominal) Note that despite logratio being called, the type remains coded as “I”
gower_mat <- as.matrix(gower_dist) #this is a 777x777 matrixMost similar pair of colleges:
college_clean[
which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]),
arr.ind = TRUE)[1, ], ]## name accept_rate Outstate Enroll Grad.Rate
## 682 University of St. Thomas MN 0.8784638 11712 828 89
## 284 John Carroll University 0.8711276 11700 820 89
## Private isElite
## 682 Yes Not Elite
## 284 Yes Not Elite
Most dissimilar pair of colleges:
college_clean[
which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
arr.ind = TRUE)[1, ], ]## name accept_rate Outstate Enroll
## 673 University of Sci. and Arts of Oklahoma 0.9824561 3687 208
## 251 Harvard University 0.1561486 18485 1606
## Grad.Rate Private isElite
## 673 43 No Not Elite
## 251 100 Yes Elite
Choose the algorithm.
We will use partitioning around medoids (PAM) as it is one of the options for handling a custom distance matrix.
Q: What is PAM?
A: PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to - similar to k-means. It works well with n < 10,000 observations.
Q: How many clusters should be retained in the solution?
A: Look at the silhouette width. This is an internal validation metric which is an aggregated measure of how similar an observation is to its own cluster compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}Plot sihouette width (higher is better):
plot(1:10, sil_width,
xlab = "Number of clusters", xaxt='n',
ylab = "Silhouette Width")
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)Conclusion: select 3 clusters, as this solution has highest width.
can interpret the clusters by running summary on each cluster
pam_fit <- pam(gower_dist, diss = TRUE, k = 3)
pam_results <- college_clean %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary## [[1]]
## accept_rate Outstate Enroll Grad.Rate
## Min. :0.3283 Min. : 2340 Min. : 35.0 Min. : 15.00
## 1st Qu.:0.7225 1st Qu.: 8842 1st Qu.: 194.8 1st Qu.: 56.00
## Median :0.8004 Median :10905 Median : 308.0 Median : 67.50
## Mean :0.7820 Mean :11200 Mean : 418.6 Mean : 66.97
## 3rd Qu.:0.8581 3rd Qu.:13240 3rd Qu.: 484.8 3rd Qu.: 78.25
## Max. :1.0000 Max. :21700 Max. :4615.0 Max. :118.00
## Private isElite cluster
## No : 0 Not Elite:500 Min. :1
## Yes:500 Elite : 0 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
## [[2]]
## accept_rate Outstate Enroll Grad.Rate
## Min. :0.1545 Min. : 5224 Min. : 137.0 Min. : 54.00
## 1st Qu.:0.4135 1st Qu.:13850 1st Qu.: 391.0 1st Qu.: 77.00
## Median :0.5329 Median :17238 Median : 601.0 Median : 89.00
## Mean :0.5392 Mean :16225 Mean : 882.5 Mean : 84.78
## 3rd Qu.:0.6988 3rd Qu.:18590 3rd Qu.:1191.0 3rd Qu.: 94.00
## Max. :0.9605 Max. :20100 Max. :4893.0 Max. :100.00
## Private isElite cluster
## No : 4 Not Elite: 0 Min. :2
## Yes:65 Elite :69 1st Qu.:2
## Median :2
## Mean :2
## 3rd Qu.:2
## Max. :2
##
## [[3]]
## accept_rate Outstate Enroll Grad.Rate
## Min. :0.3746 Min. : 2580 Min. : 153 Min. : 10.00
## 1st Qu.:0.6423 1st Qu.: 5295 1st Qu.: 694 1st Qu.: 46.00
## Median :0.7458 Median : 6598 Median :1302 Median : 54.50
## Mean :0.7315 Mean : 6698 Mean :1615 Mean : 55.42
## 3rd Qu.:0.8368 3rd Qu.: 7748 3rd Qu.:2184 3rd Qu.: 65.00
## Max. :1.0000 Max. :15516 Max. :6392 Max. :100.00
## Private isElite cluster
## No :208 Not Elite:199 Min. :3
## Yes: 0 Elite : 9 1st Qu.:3
## Median :3
## Mean :3
## 3rd Qu.:3
## Max. :3
It seems as though Cluster 1 is mainly Private/Not Elite with medium levels of out of state tuition and smaller levels of enrollment.
Cluster 2, on the other hand, is mainly Private/Elite with lower levels of acceptance rates, high levels of out of state tuition, and high graduation rates.
Finally, cluster 3 is mainly Public/Not Elite with the lowest levels of tuition, largest levels of enrollment, and lowest graduation rate.
Hint: medoids serve as exemplars of each cluster
college_clean[pam_fit$medoids, ]## name accept_rate Outstate Enroll Grad.Rate
## 492 Saint Francis College 0.7877629 10880 284 69
## 38 Barnard College 0.5616987 17926 531 91
## 234 Grand Valley State University 0.7525653 6108 1561 57
## Private isElite
## 492 Yes Not Elite
## 38 Yes Elite
## 234 No Not Elite
To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE. It tries to preserve local structure so as to make clusters visible in 2D or 3D.
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = college_clean$name)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))Have you noticed the small group?
It is made up of the larger, more competitive public schools not large enough to warrant an additional cluster according to silhouette width, but these 13 schools certainly have distinct characteristics http://colleges.usnews.rankingsandreviews.com/best-colleges/william-and-mary-3705
tsne_data %>%
filter(X > -10 & X < 0, #be careful here, cluster coordinates may differ
Y > -30 & Y < -25) %>% #careful
left_join(college_clean, by = "name") %>%
collect %>%
.[["name"]]## [1] "College of William and Mary"
## [2] "Georgia Institute of Technology"
## [3] "SUNY at Binghamton"
## [4] "SUNY College at Geneseo"
## [5] "Trenton State College"
## [6] "University of California at Berkeley"
## [7] "University of California at Irvine"
## [8] "University of Florida"
## [9] "University of Illinois - Urbana"
## [10] "University of Michigan at Ann Arbor"
## [11] "University of Minnesota at Morris"
## [12] "University of North Carolina at Chapel Hill"
## [13] "University of Virginia"
try classifying US states by crime types: data(USArrests)
data(USArrests)
head(USArrests)## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
describe(USArrests)## vars n mean sd median trimmed mad min max range skew
## Murder 1 50 7.79 4.36 7.25 7.53 5.41 0.8 17.4 16.6 0.37
## Assault 2 50 170.76 83.34 159.00 168.47 110.45 45.0 337.0 292.0 0.22
## UrbanPop 3 50 65.54 14.47 66.00 65.88 17.79 32.0 91.0 59.0 -0.21
## Rape 4 50 21.23 9.37 20.10 20.36 8.60 7.3 46.0 38.7 0.75
## kurtosis se
## Murder -0.95 0.62
## Assault -1.15 11.79
## UrbanPop -0.87 2.05
## Rape 0.08 1.32
More practice: DataCamp Cluster Analaysis in R course https://www.datacamp.com/courses/cluster-analysis-in-r
More exercise: on European protein diet: https://rstudio-pubs-static.s3.amazonaws.com/33876_1d7794d9a86647ca90c4f182df93f0e8.html
More methods: density-based clustering in R (DBSCAN) 1 paper https://cran.r-project.org/web/packages/dbscan/vignettes/dbscan.pdf 2 script http://www.sthda.com/english/wiki/wiki.php?id_contents=7940
More resources: https://www.r-bloggers.com/hybrid-hierarchical-k-means-clustering-for-optimizing-clustering-outputs-unsupervised-machine-learning/