In this R Markdown session, I will use the built-in “USArrests” dataset and perform a hierarchical and k-means clustering. Clustering is effective in grouping data based on similar characteristics, as well as finding trends and patterns within the data.
The data set “USArrests” is a built-in R data package. The data contains arrest statistics per 100,000 residents in 1973. As such, the data is not relevant to today’s trends, and is used more as a tutorial and practice for statistical analysis.
The data contains the following variables:
The data can be loaded and viewed by the following code:
#Loads built-in "USArrests" data set.
data("USArrests")
#View Data.
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Before beginning the cluster analysis, the data is viewed for trends and patterns, as well as to see if there are any missing values or duplicates that can affect the analysis.
#Column names.
colnames(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
#Number of Columns.
ncol(USArrests)
## [1] 4
#Number of Rows.
nrow(USArrests)
## [1] 50
#Check if there are missing values.
sum(is.na(USArrests))
## [1] 0
#Create a variable storing the amount of duplicates in the data frame.
duplicates <- USArrests%>%duplicated()
#Displays how many duplicates are present in a table. If a value is not a duplicate, it is placed in 'FALSE'. If the value is a duplicate, it is placed in 'TRUE'.
duplicates_count <- duplicates%>%table()
duplicates_count
## .
## FALSE
## 50
The data appears clean, with no missing values or duplicates.
To make sure the data is standardized, the numeric values are scaled and saved for future use.
#Scales/standardizes the data.
USdata <- scale(USArrests)
#Views the scaled data.
head(USdata)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Now that the data is set, the clustering analysis can begin.
Hierarchical clustering considers all of the data points as a single cluster and separates the data points at each iteration. In the end, a dendrogram is created, which shows the clustering in a tree-based representation. There are different methods of hierarchical clustering. I will use average linkage and complete linkage clustering. Average linkage clustering considers the average distance between clusters, while complete linkage clustering considers largest distance between clusters.
The hierarchical clustering models are run:
#Complete Linkage Clustering Method
hcluster_com <- hclust(dist(USdata), method = "complete")
plot(hcluster_com, main = "Complete Linkage Dendrogram")
#Average Linkage Clustering Method
hcluster_ave <- hclust(dist(USdata), method = "average")
plot(hcluster_ave, main = "Average Linkage Dendrogram")
A border is placed around each cluster group for both the average linkage and complete linkage clusterings.
#Complete Linkage Clustering Method with borders around each cluster.
plot(hcluster_com, cex = 0.6, main = "Complete Linkage Dendrogram")
rect.hclust(hcluster_com, k = 4, border = 1:4)
#Average Linkage Clustering Method with borders around each cluster.
plot(hcluster_ave, cex = 0.6, main = "Average Linkage Dendrogram")
rect.hclust(hcluster_ave, k = 4, border = 1:4)
It appeared from the dendrograms that 4 clusters was the best choice for the groupings. Another way to determine the optimal number of clusters is the elbow method. Looking at the graph, the optimal number of clusters is where the elbow of the curve is located.
#Creates a plot showing the elbow method in determining the amount of optimal centers.
fviz_nbclust(USdata, FUN = hcut, method = "wss")
From the graph, it appears that 4 clusters is the correct optimal number of clusters for the hierarchical clustering model.
To see the amount of observations in each cluster, the cutree function is used.
#Identifies the sub-group for each observation among 4 clusters.
clust_grp <- cutree(hcluster_com, k = 4)
#Creates a table to show the amounts of observations in each of the 4 clusters.
table(clust_grp)
## clust_grp
## 1 2 3 4
## 8 11 21 10
The cutree function also assigns each observation its cluster. From this, the original table can be mutated to display their respective cluster group.
#Mutates the original table to display the cluster group for each state.
USArrests %>% mutate(cluster = clust_grp)%>%head()
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 1
## Arizona 8.1 294 80 31.0 2
## Arkansas 8.8 190 50 19.5 3
## California 9.0 276 91 40.6 2
## Colorado 7.9 204 78 38.7 2
Another form of clustering is k-means clustering. K-means clustering identifies the k number of centroids, then allocates each data point to the nearest cluster.
A k-means cluster model is run for grouping with 2, 3, 4, and 5 clusters:
#K-means clustering with 2, 3, 4 and 5 centers
kmeans.cluster_2 <- kmeans(USdata, centers = 2, nstart = 20)
kmeans.cluster_2
## K-means clustering with 2 clusters of sizes 20, 30
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.004934 1.0138274 0.1975853 0.8469650
## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 46.74796 56.11445
## (between_SS / total_SS = 47.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
kmeans.cluster_3 <- kmeans(USdata, centers = 3, nstart = 20)
kmeans.cluster_3
## K-means clustering with 3 clusters of sizes 13, 20, 17
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.9667633
## 2 1.0049340 1.0138274 0.1975853 0.8469650
## 3 -0.4469795 -0.3465138 0.4788049 -0.2571398
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 2 2 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 1 2 3 1
## Kansas Kentucky Louisiana Maine Maryland
## 3 1 2 1 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 1 2 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 2 1 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 2 1 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 2
## South Dakota Tennessee Texas Utah Vermont
## 1 2 2 3 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 11.95246 46.74796 19.62285
## (between_SS / total_SS = 60.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
kmeans.cluster_4 <- kmeans(USdata, centers = 4, nstart = 20)
kmeans.cluster_4
## K-means clustering with 4 clusters of sizes 13, 16, 8, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 3 1.4118898 0.8743346 -0.8145211 0.01927104
## 4 0.6950701 1.0394414 0.7226370 1.27693964
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 3 4 4 3 4
## Colorado Connecticut Delaware Florida Georgia
## 4 2 2 4 3
## Hawaii Idaho Illinois Indiana Iowa
## 2 1 4 2 1
## Kansas Kentucky Louisiana Maine Maryland
## 2 1 3 1 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 4 1 3 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 4 1 2
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 3 1 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 3
## South Dakota Tennessee Texas Utah Vermont
## 1 3 4 2 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 11.952463 16.212213 8.316061 19.922437
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
kmeans.cluster_5 <- kmeans(USdata, centers = 5, nstart = 20)
kmeans.cluster_5
## K-means clustering with 5 clusters of sizes 11, 7, 12, 10, 10
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 -0.1642225 -0.3658283 -0.2822467 -0.11697538
## 2 1.5803956 0.9662584 -0.7775109 0.04844071
## 3 0.7298036 1.1188219 0.7571799 1.32135653
## 4 -1.1727674 -1.2078573 -1.0045069 -1.10202608
## 5 -0.6286291 -0.4086988 0.9506200 -0.38883734
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 2 3 3 1 3
## Colorado Connecticut Delaware Florida Georgia
## 3 5 5 3 2
## Hawaii Idaho Illinois Indiana Iowa
## 5 4 3 1 4
## Kansas Kentucky Louisiana Maine Maryland
## 1 1 2 4 3
## Massachusetts Michigan Minnesota Mississippi Missouri
## 5 3 4 2 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 3 4 5
## New Mexico New York North Carolina North Dakota Ohio
## 3 3 2 4 5
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1 1 5 5 2
## South Dakota Tennessee Texas Utah Vermont
## 4 2 3 5 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 5 4 4 1
##
## Within cluster sum of squares by cluster:
## [1] 7.788275 6.128432 18.257332 7.443899 9.326266
## (between_SS / total_SS = 75.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#View the cluster identification for each observation
kmeans.cluster_2$cluster
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
kmeans.cluster_3$cluster
## Alabama Alaska Arizona Arkansas California
## 2 2 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 1 2 3 1
## Kansas Kentucky Louisiana Maine Maryland
## 3 1 2 1 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 1 2 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 2 1 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 2 1 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 2
## South Dakota Tennessee Texas Utah Vermont
## 1 2 2 3 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 1 1 3
kmeans.cluster_4$cluster
## Alabama Alaska Arizona Arkansas California
## 3 4 4 3 4
## Colorado Connecticut Delaware Florida Georgia
## 4 2 2 4 3
## Hawaii Idaho Illinois Indiana Iowa
## 2 1 4 2 1
## Kansas Kentucky Louisiana Maine Maryland
## 2 1 3 1 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 4 1 3 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 4 1 2
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 3 1 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 3
## South Dakota Tennessee Texas Utah Vermont
## 1 3 4 2 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 1 1 2
kmeans.cluster_5$cluster
## Alabama Alaska Arizona Arkansas California
## 2 3 3 1 3
## Colorado Connecticut Delaware Florida Georgia
## 3 5 5 3 2
## Hawaii Idaho Illinois Indiana Iowa
## 5 4 3 1 4
## Kansas Kentucky Louisiana Maine Maryland
## 1 1 2 4 3
## Massachusetts Michigan Minnesota Mississippi Missouri
## 5 3 4 2 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 3 4 5
## New Mexico New York North Carolina North Dakota Ohio
## 3 3 2 4 5
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1 1 5 5 2
## South Dakota Tennessee Texas Utah Vermont
## 4 2 3 5 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 5 4 4 1
#Creates cluster plot for 2, 3, 4, and 5 centers.
fviz_cluster(kmeans.cluster_2, data = USdata)
fviz_cluster(kmeans.cluster_3, data = USdata)
fviz_cluster(kmeans.cluster_4, data = USdata)
fviz_cluster(kmeans.cluster_5, data = USdata)
From the summaries and visuals, it’s hard to determine the optimal number of clusters to use. Two different plots can be used to determine the optimal number of clusters: the silhouette method and the elbow method.
#Creates a plot showing the silhouette and elbow methods in determining the amount of optimal centers.
fviz_nbclust(USdata, FUN = kmeans, method = "silhouette")
fviz_nbclust(USdata, FUN = kmeans, method = "wss")
From the methods plotted, it appears the optimal number of clusters is 2. This differs from the hierarchical clustering method, which had 4 as the optimal number of clusters.
This R Markdown session demonstrated hierarchical clustering and k-means clustering using the built-in “USArrests” data set. From hierarchical clustering, it was determined the optimal number of clusters was 4, while k-means clustering determined the optimal number of clusters was 2. It is up to the user to determine the amount of clusters to use. For me, I would use 4 clusters when comparing the data based on the sum of squares value and the created plot.
Thank you for viewing this R Markdown session.