In this R Markdown session, I will use the built-in “USArrests” dataset and perform a hierarchical and k-means clustering. Clustering is effective in grouping data based on similar characteristics, as well as finding trends and patterns within the data.

About the Data

The data set “USArrests” is a built-in R data package. The data contains arrest statistics per 100,000 residents in 1973. As such, the data is not relevant to today’s trends, and is used more as a tutorial and practice for statistical analysis.

The data contains the following variables:

Murder: Murder arrests (per 100,000)
Assault: Assault arrests (per 100,000)
UrbanPop: Percent urban population
Rape: Rape arrests (per 100,000)

The data can be loaded and viewed by the following code:

#Loads built-in "USArrests" data set.
data("USArrests")
#View Data.
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Data Cleaning

Before beginning the cluster analysis, the data is viewed for trends and patterns, as well as to see if there are any missing values or duplicates that can affect the analysis.

#Column names.
colnames(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

#Number of Columns.
ncol(USArrests)

## [1] 4

#Number of Rows.
nrow(USArrests)

## [1] 50

#Check if there are missing values.
sum(is.na(USArrests))

## [1] 0

#Create a variable storing the amount of duplicates in the data frame.
duplicates <- USArrests%>%duplicated()
#Displays how many duplicates are present in a table. If a value is not a duplicate, it is placed in 'FALSE'. If the value is a duplicate, it is placed in 'TRUE'.
duplicates_count <- duplicates%>%table()
duplicates_count

## .
## FALSE 
##    50

The data appears clean, with no missing values or duplicates.

To make sure the data is standardized, the numeric values are scaled and saved for future use.

#Scales/standardizes the data.
USdata <- scale(USArrests)
#Views the scaled data.
head(USdata)

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Now that the data is set, the clustering analysis can begin.

Hierarchical Clustering

Hierarchical clustering considers all of the data points as a single cluster and separates the data points at each iteration. In the end, a dendrogram is created, which shows the clustering in a tree-based representation. There are different methods of hierarchical clustering. I will use average linkage and complete linkage clustering. Average linkage clustering considers the average distance between clusters, while complete linkage clustering considers largest distance between clusters.

The hierarchical clustering models are run:

#Complete Linkage Clustering Method
hcluster_com <- hclust(dist(USdata), method = "complete")
plot(hcluster_com, main = "Complete Linkage Dendrogram")

#Average Linkage Clustering Method
hcluster_ave <- hclust(dist(USdata), method = "average")
plot(hcluster_ave, main = "Average Linkage Dendrogram")

A border is placed around each cluster group for both the average linkage and complete linkage clusterings.

#Complete Linkage Clustering Method with borders around each cluster.
plot(hcluster_com, cex = 0.6, main = "Complete Linkage Dendrogram")
rect.hclust(hcluster_com, k = 4, border = 1:4)

#Average Linkage Clustering Method with borders around each cluster.
plot(hcluster_ave, cex = 0.6, main = "Average Linkage Dendrogram")
rect.hclust(hcluster_ave, k = 4, border = 1:4)

It appeared from the dendrograms that 4 clusters was the best choice for the groupings. Another way to determine the optimal number of clusters is the elbow method. Looking at the graph, the optimal number of clusters is where the elbow of the curve is located.

#Creates a plot showing the elbow method in determining the amount of optimal centers.
fviz_nbclust(USdata, FUN = hcut, method = "wss")

From the graph, it appears that 4 clusters is the correct optimal number of clusters for the hierarchical clustering model.

To see the amount of observations in each cluster, the cutree function is used.

#Identifies the sub-group for each observation among 4 clusters.
clust_grp <- cutree(hcluster_com, k = 4)
#Creates a table to show the amounts of observations in each of the 4 clusters.
table(clust_grp)

## clust_grp
##  1  2  3  4 
##  8 11 21 10

The cutree function also assigns each observation its cluster. From this, the original table can be mutated to display their respective cluster group.

#Mutates the original table to display the cluster group for each state. 
USArrests %>% mutate(cluster = clust_grp)%>%head()

##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       1
## Arizona       8.1     294       80 31.0       2
## Arkansas      8.8     190       50 19.5       3
## California    9.0     276       91 40.6       2
## Colorado      7.9     204       78 38.7       2

K-Means Clustering

Another form of clustering is k-means clustering. K-means clustering identifies the k number of centroids, then allocates each data point to the nearest cluster.

A k-means cluster model is run for grouping with 2, 3, 4, and 5 clusters:

#K-means clustering with 2, 3, 4 and 5 centers
kmeans.cluster_2 <- kmeans(USdata, centers = 2, nstart = 20)
kmeans.cluster_2

## K-means clustering with 2 clusters of sizes 20, 30
## 
## Cluster means:
##      Murder    Assault   UrbanPop       Rape
## 1  1.004934  1.0138274  0.1975853  0.8469650
## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## 
## Within cluster sum of squares by cluster:
## [1] 46.74796 56.11445
##  (between_SS / total_SS =  47.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

kmeans.cluster_3 <- kmeans(USdata, centers = 3, nstart = 20)
kmeans.cluster_3

## K-means clustering with 3 clusters of sizes 13, 20, 17
## 
## Cluster means:
##       Murder    Assault   UrbanPop       Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.9667633
## 2  1.0049340  1.0138274  0.1975853  0.8469650
## 3 -0.4469795 -0.3465138  0.4788049 -0.2571398
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              1              2              3              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              1              2              1              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              1              2              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              2              1              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              2              1              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              2              2              3              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              1              1              3 
## 
## Within cluster sum of squares by cluster:
## [1] 11.95246 46.74796 19.62285
##  (between_SS / total_SS =  60.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

kmeans.cluster_4 <- kmeans(USdata, centers = 4, nstart = 20)
kmeans.cluster_4

## K-means clustering with 4 clusters of sizes 13, 16, 8, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 2 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 3  1.4118898  0.8743346 -0.8145211  0.01927104
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              3              4              4              3              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              2              2              4              3 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              1              4              2              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              1              3              1              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              4              1              3              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              4              1              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              3              1              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              3 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              3              4              2              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              1              1              2 
## 
## Within cluster sum of squares by cluster:
## [1] 11.952463 16.212213  8.316061 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

kmeans.cluster_5 <- kmeans(USdata, centers = 5, nstart = 20)
kmeans.cluster_5

## K-means clustering with 5 clusters of sizes 11, 7, 12, 10, 10
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1 -0.1642225 -0.3658283 -0.2822467 -0.11697538
## 2  1.5803956  0.9662584 -0.7775109  0.04844071
## 3  0.7298036  1.1188219  0.7571799  1.32135653
## 4 -1.1727674 -1.2078573 -1.0045069 -1.10202608
## 5 -0.6286291 -0.4086988  0.9506200 -0.38883734
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              3              3              1              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              3              5              5              3              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              5              4              3              1              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              1              2              4              3 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              5              3              4              2              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              3              4              5 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              3              3              2              4              5 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              5              5              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              2              3              5              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              5              4              4              1 
## 
## Within cluster sum of squares by cluster:
## [1]  7.788275  6.128432 18.257332  7.443899  9.326266
##  (between_SS / total_SS =  75.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#View the cluster identification for each observation
kmeans.cluster_2$cluster

##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2

kmeans.cluster_3$cluster

##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              2              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              1              2              3              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              1              2              1              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              1              2              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              2              1              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              2              1              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              2              2              3              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              1              1              3

kmeans.cluster_4$cluster

##        Alabama         Alaska        Arizona       Arkansas     California 
##              3              4              4              3              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              2              2              4              3 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              1              4              2              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              1              3              1              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              4              1              3              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              4              1              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              3              1              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              3 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              3              4              2              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              1              1              2

kmeans.cluster_5$cluster

##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              3              3              1              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              3              5              5              3              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              5              4              3              1              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              1              2              4              3 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              5              3              4              2              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              3              4              5 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              3              3              2              4              5 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              5              5              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              2              3              5              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              5              4              4              1

#Creates cluster plot for 2, 3, 4, and 5 centers. 
fviz_cluster(kmeans.cluster_2, data = USdata)

fviz_cluster(kmeans.cluster_3, data = USdata)

fviz_cluster(kmeans.cluster_4, data = USdata)

fviz_cluster(kmeans.cluster_5, data = USdata)

From the summaries and visuals, it’s hard to determine the optimal number of clusters to use. Two different plots can be used to determine the optimal number of clusters: the silhouette method and the elbow method.

#Creates a plot showing the silhouette and elbow methods in determining the amount of optimal centers.
fviz_nbclust(USdata, FUN = kmeans, method = "silhouette")

fviz_nbclust(USdata, FUN = kmeans, method = "wss")

From the methods plotted, it appears the optimal number of clusters is 2. This differs from the hierarchical clustering method, which had 4 as the optimal number of clusters.

Conclusion

This R Markdown session demonstrated hierarchical clustering and k-means clustering using the built-in “USArrests” data set. From hierarchical clustering, it was determined the optimal number of clusters was 4, while k-means clustering determined the optimal number of clusters was 2. It is up to the user to determine the amount of clusters to use. For me, I would use 4 clusters when comparing the data based on the sum of squares value and the created plot.

Thank you for viewing this R Markdown session.

Clustering of US Arrests Data Set

Travis Gubbe

November 22, 2022

About the Data

Data Cleaning

Hierarchical Clustering

K-Means Clustering

Conclusion

END