load libraries needed
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
load and prepare dataset
df <- USArrests
remove missing values
df <- na.omit(df)
scale each variable to have mean of 0 and standard deviation of 1
df <- scale(df)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
**Start clustering
Since we don’t know beforehand which method will produce the best clusters, we can write a short function to perform hierarchical clustering using several different methods.**
define linkage methods
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
function to compute agglomerative coefficient
ac <- function(x) {
agnes(df, method = x)$ac
}
calculate agglomerative coefficient for each clustering linkage method
sapply(m, ac)
## average single complete ward
## 0.7379371 0.6276128 0.8531583 0.9346210
We can see that Ward’s minimum variance method produces the highest agglomerative coefficient, thus we’ll use that as the method for our final hierarchical clustering
perform hierarchical clustering using Ward’s minimum variance and produce dendrogram
clust <- agnes(df, method = 'ward')
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")
As we move up the dendrogram from the bottom, observations that are similar to each other are fused together into a branch.
Determine the optimal numbers of clusters
calculate the gap statistics for each number of clusters
gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
Apply Cluster Labels to original dataset
Compute distance matrix
d <- dist(df, method = "euclidean")
perform heirarchical clustering using Ward’s method
final_clust <- hclust(d, method = "ward.D2")
cut the dendrogram into 4 clusters
groups <- cutree(final_clust, k=4)
table(groups)
## groups
## 1 2 3 4
## 7 12 19 12
append clusters labels to original data
final_data <- cbind(USArrests, clusters = groups)
head(final_data)
find the mean values for each clusters
aggregate(final_data, by = list(clusters = final_data$clusters),
mean)
The mean number of murders per 100,100 citizens among the state in cluster 1 is 146.67
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 251.28.
The mean percentage of residents living in an urban area among the states in cluster 1 is 54.28%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 21.68.