Clustering analysis is used to find groups of similar objects in a dataset. There are two main categories of clustering:
These clustering methods can be computed using the R packages stats (for k-means) and cluster (for pam, clara and fanny), but the workflow require multiple steps and multiple lines of R codes. In this chapter, we provide some easy-to-use functions for enhancing the workflow of clustering analyses and we implemented ggplot2 method for visualizing the results.
The following R packages are required in this chapter: * factoextra for enhanced clustering analyses and data visualization * cluster for computing the standard PAM, CLARA, FANNY, AGNES and DIANA clustering
1.factoextra can be installed as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
2.Install cluster:
install.packages("cluster")
3.Load required packages:
library(factoextra)
library(cluster)The built-in R dataset USArrests is used:
# Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
head(df)## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
This section describes two functions:
# Correlation-based distance method
res.dist <- get_dist(df, method = "pearson")
head(round(as.matrix(res.dist), 2))[, 1:6]## Alabama Alaska Arizona Arkansas California Colorado
## Alabama 0.00 0.71 1.45 0.09 1.87 1.69
## Alaska 0.71 0.00 0.83 0.37 0.81 0.52
## Arizona 1.45 0.83 0.00 1.18 0.29 0.60
## Arkansas 0.09 0.37 1.18 0.00 1.59 1.37
## California 1.87 0.81 0.29 1.59 0.00 0.11
## Colorado 1.69 0.52 0.60 1.37 0.11 0.00
# Visualize the dissimilarity matrix
fviz_dist(res.dist, lab_size = 8)The ordered dissimilarity matrix image (ODI) displays the clustering tendency of the dataset. Similar objects are close to one another. Red color corresponds to small distance and blue color indicates big distance between observation.
# Load and scale the dataset
data("USArrests")
df <- scale(USArrests)
# Compute dissimilarity matrix
res.dist <- dist(df, method = "euclidean")
# Compute hierarchical clustering
res.hc <- hclust(res.dist, method = "ward.D2")
# Visualize
plot(res.hc, cex = 1, col="blue")In this chapter, we provide the function eclust() [in factoextra] which provides several advantages:
eclust(x, FUNcluster = "kmeans", hc_metric = "euclidean", ...)
It includes also:
In this section we’ll show some examples for enhanced k-means clustering and hierarchical clustering. Note that the same analysis can be done for PAM, CLARA, FANNY, AGNES and DIANA.
# Enhanced k-means clustering
res.km <- eclust(df, "kmeans", nstart = 25)# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)# Silhouette plot
fviz_silhouette(res.km)## cluster size ave.sil.width
## 1 1 8 0.39
## 2 2 16 0.34
## 3 3 13 0.37
## 4 4 13 0.27
# Optimal number of clusters using gap statistics
res.km$nbclust## [1] 4
# Print result
res.km## K-means clustering with 4 clusters of sizes 8, 16, 13, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 4 0.6950701 1.0394414 0.7226370 1.27693964
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 4 2 2 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 3 4 2 3
## Kansas Kentucky Louisiana Maine Maryland
## 2 3 1 3 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 4 3 1 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 4 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 1 3 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 4 2 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 16.212213 11.952463 19.922437
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "clust_plot" "silinfo" "nbclust"
## [13] "data" "gap_stat"
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam fviz_silhouette(res.hc) # silhouette plot## cluster size ave.sil.width
## 1 1 19 0.26
## 2 2 19 0.28
## 3 3 12 0.43
fviz_cluster(res.hc) # scatter plotIt’s also possible to specify the number of clusters as follow:
eclust(df, "kmeans", k = 4)## K-means clustering with 4 clusters of sizes 13, 16, 13, 8
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 3 0.6950701 1.0394414 0.7226370 1.27693964
## 4 1.4118898 0.8743346 -0.8145211 0.01927104
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 4 3 3 4 3
## Colorado Connecticut Delaware Florida Georgia
## 3 2 2 3 4
## Hawaii Idaho Illinois Indiana Iowa
## 2 1 3 2 1
## Kansas Kentucky Louisiana Maine Maryland
## 2 1 4 1 3
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 3 1 4 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 3 1 2
## New Mexico New York North Carolina North Dakota Ohio
## 3 3 4 1 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 4
## South Dakota Tennessee Texas Utah Vermont
## 1 4 3 2 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 11.952463 16.212213 19.922437 8.316061
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "clust_plot" "silinfo" "nbclust"
## [13] "data"
This analysis has been performed using R software (R version 3.3.2 )