The centroid based clustering analysis of the U.S. crime rates and urban population data showed the dataset contained two and four distinct clusters models. When k=2, the clusters are divided between mostly the North and the South. The Southern states, Cluster 1, had higher crime rates than the Northern States, Cluster 2. When k=4, We see the clusters separated by geographic regions in the United States. Cluster 1 is primarily the Northern states of the Midwest, Cluster 3 is mostly the Southeast, and Cluster 4 is mostly the Southwest. States in Cluster 2 are sporadic spread throughout the United States. Crime rates in Clusters 1 and 2 are closer in values than Clusters 3 and 4. Crime rates in Clusters 3 and 4 have higher rates than Clusters 1 and 2. The k means cluster model for crime rates and urban population averages for the two models are used to classify States within the country. This model allows for U.S. Government analysts to dedicate resources to help states combat crime. From the data, the groups of states are differentiated by their varying crime rates. Each state cluster had different levels of crime for rape, assault, and murder. The guidelines for future state crime rate classifications are determined by the attribute averages for the model. The current crime rates within the states are affected by world events, new laws, or a myriad other external factor. We recommend revisiting this analysis on years since 1973 and every five years after this analysis to validate the k means model. We also recommend analyzing the various states within each cluster. For example, determining the way each state voted in the most recent election, poverty rates, and education rates to name a few. This additional information could allow us to discover new trends with crime rates and growth in urban populations.
Individual states have investigated possible causes of crime rates within their own state as an aid to curb crime. U.S. Government analysts are interested in if there are similarities between states when it comes to crime rates. The objective of this analysis is to determine if states can be grouped together by their urban population and crime statistics. This clustering will aid in possible legislative actions and lower crime rates.
The critical variables measured for each State’s crime statistics are rape, murder, assault rates and the urban population in each state. The data was collected in 1973 form all fifty states. The crime rates are measured in per 100,000 residents and the urban population is measured in a percentage of the population living in urban areas. The dataset is built-in R and named USArrests. There is only one observation from each state for a total of fifty observations.
This analysis revolves around the k-means cluster analysis of the crime statistics and urban populations within each of the fifty states. The model analyzed the statistics of each state’s crime rates and urban populations. The model used murder, assault, rape, and urban population statistics to calculate the Euclidean distances from the cluster mean, or center of the cluster. The aids in determining the number of clusters. Stability tests for the k-means assess a good fit for the number of clusters.
Before performing the cluster analysis, calculated correlation coefficients and scatter plots, shown as ellipses, confirm positive linear relationships between the crime statistics and urban population. A missingness map shows no data is missing from the dataset. The Kendall correlation distance measures the Euclidean distance between each of the State’s statistics.
library(cluster)
library(tidyverse)
library(factoextra)
library(ClusterR)
library(Amelia)
library(gridExtra)
library(clValid)
library(corrplot)
library(maps)
library(cowplot)
df_raw<-USArrests
res<-cor(df_raw)
corrplot(res,type="upper",method = "ellipse")
missmap(df_raw,col=c('yellow','black'),y.at=1,y.labels='',legend=TRUE)
df_scale<-as.data.frame(scale(df_raw))
distance <- get_dist(df_scale)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))+
ggtitle("Kendall Correlation Distance")
Histograms and box plots display the distribution of the data and identify possible outliers.
Below are histograms for the measurements in each state. We note the data is relatively normally distributed with possible outliers.
df_scale %>%
gather(Attributes, value, 1:4) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_histogram(colour="black", bins = 15, show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="State Crime Statistics in 1973 - Histograms")+
theme_classic()
The following box plots below identify outliers within each measurement. There are only two outliers identified the rape crime statistics and are relatively near most of the data. The identified outliers are retained since the data set is small.
df_scale %>%
gather(Attributes, values) %>%
ggplot(aes(x=Attributes, y=values, fill=Attributes)) +
geom_boxplot(show.legend=FALSE) +
labs(x="Values", y=" ",
title="State Crime Statistics in 1973 - Boxplot")+
theme_bw() +
theme(axis.title.y=element_blank(),
axis.title.x=element_blank()) +
coord_flip()
The cluster analysis is an unsupervised model that has no target and returns a model based upon the data. The main goal of the unsupervised model is to understand the trends of the the crime rates in each state and their urban populations. The specific cluster model analyzed is the centroid-based clustering model. The centroid based model divides the data into a set number of clusters while minimizing the error functions. The error functions analyzed are the average proportion of non-overlap (APN), average distance (AD), average distance between means (ADM), and figure of merit (FOM).
We first determine the optimal k, or number of clusters, by calculating the silhouettes and gap statistics of each cluster size. The silhouette is defined as how similar an object is to the entire cluster compared to the other clusters. We see in the first plot, k=2 and k=5 have local maximums. However, when plotted k=5 clusters have some overlaps and therefore, not a good choice for number of clusters.
opt <- Optimal_Clusters_KMeans(df_raw, max_clusters = 5, max_iters = 25,
plot_clusters = TRUE,criterion = 'silhouette')
We also calculate the gap statistic for each cluster. The gap statistic compares the variation among the data within each cluster and their expected values. The second plot shows the gap statistics for each cluster. The error bars on each point show the range in variation values from the cluster data and expected values. Clusters k=4 has the local maximum gap statistic. We choose k=2, k=3, and k=4 as the number of clusters since none of the clusters overlap when plotted.
gap_stat <- clusGap(df_scale, FUN = kmeans, nstart = 25, K.max = 5, B = 50)
fviz_gap_stat(gap_stat,maxSE = list(method = "globalmax", SE.factor = 1))
Now we look at the stability of the clusters. The stability of the clusters considers the APN, AD, ADM, and FOM. The APN measures the average proportion of observations not placed within the same cluster. AD measures the average distance between observations in each cluster. ADM measures the average distance between cluster centers. Finally, FOM measures the average variance within the cluster. 2 clusters is most stable when measuring APN and ADM. The values are small and close to zero. 5 clusters appear as a stable number as well, but when plotted, the clusters overlap, and not a good candidate for the cluster number.
stabil<-clValid(df_scale,nClust=2:5,clMethods="kmeans",validation = "stability")
optimalScores(stabil)
## Score Method Clusters
## APN 0.05333333 kmeans 2
## AD 1.43110654 kmeans 5
## ADM 0.16520029 kmeans 2
## FOM 0.65343054 kmeans 5
Below are the cluster plots we discussed in previous paragraphs. When 5 clusters are chosen, we see the overlap between clusters 3 and 4. There is no overlap among the rest of the cluster options.
k2 <- kmeans(scale(df_scale), centers = 2,nstart = 25)
p2<-fviz_cluster(k2, geom="point", data = df_scale, palette = "Set1",
main = "Cluster of States, k=2")
k3 <- kmeans(scale(df_scale), centers = 3,nstart = 25)
p3<-fviz_cluster(k3, geom="point", data = df_scale, palette = "Set1",
main = "Cluster of States, k=3")
k4 <- kmeans(scale(df_scale), centers = 4,nstart = 25)
p4<-fviz_cluster(k4, geom="point", data = df_scale, palette = "Set1",
main = "Cluster of States, k=4")
k5 <- kmeans(scale(df_scale), centers = 5,nstart = 25)
p5<-fviz_cluster(k5, geom="point", data = df_scale, palette = "Set1",
main = "Cluster of States, k=5")
grid.arrange(p2,p3,p4,p5,ncol=2)
Below are the means of each cluster for each analysis with k=2 through k=4 since they are the most stable and showed no overlaps between clusters when plotted.
USArrests %>%
mutate(Cluster = k2$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 2 x 5
## Cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 12.2 255. 68.4 29.2
## 2 2 4.87 114. 63.6 15.9
USArrests %>%
mutate(Cluster = k3$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 3 x 5
## Cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 12.2 255. 68.4 29.2
## 2 2 3.6 78.5 52.1 12.2
## 3 3 5.84 142. 72.5 18.8
USArrests %>%
mutate(Cluster = k4$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 4 x 5
## Cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 3.6 78.5 52.1 12.2
## 2 2 5.66 139. 73.9 18.8
## 3 3 13.9 244. 53.8 21.4
## 4 4 10.8 257. 76 33.2
Below are United States maps and each state is colored based on its cluster. As you can see for k=2, the clusters are divided between mostly the North and the South. The Southern states, Cluster 1, had higher crime rates than the Northern States, Cluster 2. When we look at when k=3, Cluster 1 from k=3 is Cluster 1 in k=2 and Cluster 2 from k=2 was broken up into two groups when k=3. Cluster3 has slightly higher crime rates than Cluster 2. Finally, we look at the model when k=4. We see the clusters separated by geographic region. Cluster 1 is primarily the Northern states of the Midwest, Cluster 3 is mostly the Southeast, and Cluster 4 is mostly the South West. States in Cluster 2 are sporadic spread throughout the United States. Crime rates in Clusters 1 and 2 are close in values. Crime rates in Clusters 3 and 4 are close in values as well, but higher than Clusters 1 and 2.
clst2<-as.data.frame(k2$cluster)
clst3<-as.data.frame(k3$cluster)
clst4<-as.data.frame(k4$cluster)
crimes2 <- data.frame(state = tolower(rownames(clst2)),clst2)
colnames(crimes2)<-c("state","2 Clusters")
crimesm2 <- tidyr::gather(crimes2, variable, value, -state)
crimes3 <- data.frame(state = tolower(rownames(clst3)),clst3)
colnames(crimes3)<-c("state","3 Clusters")
crimesm3 <- tidyr::gather(crimes3, variable, value, -state)
crimes4 <- data.frame(state = tolower(rownames(clst4)),clst4)
colnames(crimes4)<-c("state","4 Clusters")
crimesm4 <- tidyr::gather(crimes4, variable, value, -state)
states_map <- map_data("state")
g2<-ggplot(crimesm2, aes(map_id = state)) +
geom_map(aes(fill = factor(value)), map = states_map) +
scale_fill_brewer(palette="Set1")+
expand_limits(x = states_map$long, y = states_map$lat) +
borders("state",colour="black")+
labs(fill="Cluster",title="Cluster for Each State, k=2")+
theme_void()
g3<-ggplot(crimesm3, aes(map_id = state)) +
geom_map(aes(fill = factor(value)), map = states_map) +
scale_fill_brewer(palette="Set1")+
expand_limits(x = states_map$long, y = states_map$lat) +
borders("state",colour="black")+
labs(fill="Cluster",title="Cluster for Each State, k=3")+
theme_void()
g4<-ggplot(crimesm4, aes(map_id = state)) +
geom_map(aes(fill = factor(value)), map = states_map) +
scale_fill_brewer(palette="Set1")+
expand_limits(x = states_map$long, y = states_map$lat) +
borders("state",colour="black")+
labs(fill="Cluster",title="Cluster for Each State, k=4")+
theme_void()
plot_grid(g2,g3,g4,align = "v", nrow = 2, ncol=2)
Based on the stability analyses and plotting the graphs for various k values, the best models to analyze U.S. crime statistics is the centroid based clustering model by k-means for k=2 and k=4. The k-means model for k=2 proved the most stable when APN, ADM, and silhouette values were calculated. The k-means model for k=4 was the most stable when the gap statistics were calculated. Below are the sizes of each cluster of states for k=2 and k=4.
clst2_name<-c("Cluster 1","Cluster 2")
cbind(clst2_name,k2$size)
## clst2_name
## [1,] "Cluster 1" "20"
## [2,] "Cluster 2" "30"
clst4_name<-c("Cluster 1","Cluster 2","Cluster 3","Cluster 4")
cbind(clst4_name,k4$size)
## clst4_name
## [1,] "Cluster 1" "13"
## [2,] "Cluster 2" "16"
## [3,] "Cluster 3" "8"
## [4,] "Cluster 4" "13"
The centroid based clustering analysis of the U.S. crime rates and urban population data showed the dataset contained two and four distinct clusters models. When k=2, the clusters are divided between mostly the North and the South. The Southern states, Cluster 1, had higher crime rates than the Northern States, Cluster 2. When k=4, We see the clusters separated by geographic regions in the United States. Cluster 1 is primarily the Northern states of the Midwest, Cluster 3 is mostly the Southeast, and Cluster 4 is mostly the Southwest. States in Cluster 2 are sporadic spread throughout the United States. Crime rates in Clusters 1 and 2 are closer in values than Clusters 3 and 4. Crime rates in Clusters 3 and 4 have higher rates than Clusters 1 and 2. Below are the means for each cluster for k=2 and k=4.
USArrests %>%
mutate(Cluster = k2$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 2 x 5
## Cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 12.2 255. 68.4 29.2
## 2 2 4.87 114. 63.6 15.9
USArrests %>%
mutate(Cluster = k4$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 4 x 5
## Cluster Murder Assault UrbanPop Rape
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 3.6 78.5 52.1 12.2
## 2 2 5.66 139. 73.9 18.8
## 3 3 13.9 244. 53.8 21.4
## 4 4 10.8 257. 76 33.2
The above crime rate and urban population averages for the two models are used to classify States within the country. This model allows for U.S. Government analysts to dedicate resources to help states combat crime. From the data, the groups of states are differentiated by their varying crime rates. Each state cluster had different levels of crime for rape, assault, and murder. The guidelines for future state crime rate classifications are determined by the attribute averages for the model.
The current crime rates within the states are affected by world events, new laws, or a myriad other external factor. We recommend revisiting this analysis on years since 1973 and every five years after this analysis to validate the k means model. We also recommend analyzing the various states within each cluster. For example, determining the way each state voted in the most recent election, poverty rates, and education rates to name a few. This additional information could allow us to discover new trends with crime rates and growth in urban populations.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_1.1.1 maps_3.4.0 corrplot_0.92 clValid_0.7
## [5] gridExtra_2.3 Amelia_1.8.0 Rcpp_1.0.8 ClusterR_1.2.6
## [9] gtools_3.9.2 factoextra_1.0.7 forcats_0.5.1 stringr_1.4.0
## [13] dplyr_1.0.8 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [17] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1 cluster_2.1.2
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.2 lubridate_1.8.0 RColorBrewer_1.1-2 httr_1.4.2
## [5] tools_4.1.1 backports_1.4.1 bslib_0.3.1 utf8_1.2.2
## [9] R6_2.5.1 DBI_1.1.2 colorspace_2.0-3 withr_2.5.0
## [13] tidyselect_1.1.2 compiler_4.1.1 cli_3.2.0 rvest_1.0.2
## [17] xml2_1.3.3 labeling_0.4.2 sass_0.4.0 scales_1.1.1
## [21] digest_0.6.29 foreign_0.8-82 rmarkdown_2.12 pkgconfig_2.0.3
## [25] htmltools_0.5.2 dbplyr_2.1.1 fastmap_1.1.0 highr_0.9
## [29] rlang_1.0.2 readxl_1.3.1 rstudioapi_0.13 jquerylib_0.1.4
## [33] farver_2.1.0 generics_0.1.2 jsonlite_1.8.0 car_3.0-12
## [37] magrittr_2.0.2 munsell_0.5.0 fansi_1.0.2 abind_1.4-5
## [41] lifecycle_1.0.1 stringi_1.7.6 yaml_2.3.5 carData_3.0-5
## [45] plyr_1.8.6 grid_4.1.1 ggrepel_0.9.1 crayon_1.5.0
## [49] haven_2.4.3 hms_1.1.1 knitr_1.37 pillar_1.7.0
## [53] ggpubr_0.4.0 ggsignif_0.6.3 reshape2_1.4.4 reprex_2.0.1
## [57] glue_1.6.2 evaluate_0.15 modelr_0.1.8 vctrs_0.3.8
## [61] tzdb_0.2.0 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
## [65] xfun_0.30 broom_0.7.12 rstatix_0.7.0 class_7.3-20
## [69] gmp_0.6-4 ellipsis_0.3.2