Abstract

This study employs unsupervised machine learning techniques to identify and validate patterns of violent crime across all 50 US states. Using Principal Component Analysis (PCA), K-means clustering, and hierarchical clustering, we examined whether states naturally form distinct groups based on their Murder, Assault, and Rape rates. Multiple validation metrics including the Hopkins statistic, Calinski-Harabasz index, silhouette analysis, and bootstrap stability testing confirmed the existence of four stable clusters with significantly different crime profiles. Additionally, I investigated the relationship between urbanization and violent crime through correlation and regression analyses. Findings reveal that while crime variables cluster states into meaningful groups, urbanization shows differential associations with crime types—strongly related to rape rates but weakly related to murder rates.

Keywords: Clustering, PCA, K-means, Violent Crime, Unsupervised Learning, Cluster Validation

1. Introduction

1.1 Background

Understanding patterns of violent crime is essential for policymakers, law enforcement agencies, and social scientists seeking to develop targeted interventions and allocate resources effectively. The United States exhibits substantial variation in crime rates across states, driven by complex interactions of socioeconomic, demographic, and geographic factors.

Traditional crime analysis often treats each indicator independently, but I aimed to leverage unsupervised learning methods to uncover latent structures and natural groupings within multidimensional crime data. By identifying states with similar crime profiles, we can better understand whether there are distinct “types” of crime patterns and what characteristics define them.

1.2 Dataset Description

This analysis uses the USArrests dataset, which contains statistics on violent crime rates per 100,000 residents for all 50 US states. The dataset includes four variables:

Murder: Murder arrests per 100,000 residents
Assault: Assault arrests per 100,000 residents
Rape: Rape arrests per 100,000 residents
UrbanPop: Percentage of population living in urban areas

1.3 Research Aim

The aim of this study is to identify underlying patterns of violent crime across US states, determine whether states form meaningful clusters based on crime characteristics, evaluate the stability and quality of these clusters, and assess whether urbanization is associated with violent crime levels.

2. Research Questions and Hypotheses

2.1 Research Questions

RQ1: What underlying factors explain variation in violent crime across US states?

RQ2: Do US states naturally form clusters based on violent crime characteristics?

RQ3: Are the identified clusters significantly different in terms of crime rates?

RQ4: Is urbanization (UrbanPop) related to violent crime rates?

RQ5: Do different clustering methods (K-means and hierarchical) agree in their classification of states?

2.2 Hypotheses

H1: Violent crime variables (Murder, Assault, Rape) can be reduced to one or two dominant dimensions representing overall crime intensity and crime pattern differences.

H2: US states form approximately four stable, distinct clusters based on their crime profiles.

H3: Cluster means for Murder, Assault, and Rape differ significantly, indicating distinct crime profiles.

H4: Urbanization is positively associated with at least one type of violent crime, but the strength of this relationship will vary across crime types.

H5: K-means and hierarchical clustering will show moderate-to-strong agreement when classifying states into clusters.

3. Methodology

3.1 Data Preparation

# Load required packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(
  tidyverse, factoextra, cluster, flexclust, fpc, NbClust,
  dendextend, ggdendro, pheatmap, hopkins, ClusterR,
  FeatureImpCluster, corrplot, RColorBrewer, fmsb
)

# Load dataset
df <- read.csv("US_violent_crime.csv", row.names = 1)

# Display basic information
cat("Dataset dimensions:", dim(df), "\n")

## Dataset dimensions: 50 4

cat("Variables:", colnames(df), "\n")

## Variables: Murder Assault UrbanPop Rape

head(df)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

summary(df)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

All crime variables (Murder, Assault, Rape) were standardized (mean = 0, standard deviation = 1) to ensure comparability, as they are measured on different scales.

crime_data_scaled <- scale(df[, c("Murder", "Assault", "Rape")])

3.2 Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality and identify underlying components explaining covariance in crime variables. This technique transforms the original correlated variables into uncorrelated principal components.

# Perform PCA
pca_result <- prcomp(crime_data_scaled, scale = FALSE)

# Variance explained
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.5358 0.6768 0.42822
## Proportion of Variance 0.7862 0.1527 0.06112
## Cumulative Proportion  0.7862 0.9389 1.00000

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 70), 
         title = "Scree Plot: Variance Explained by Principal Components")

# Component loadings
pca_loadings <- pca_result$rotation[, 1:2]
print("PCA Component Loadings:")

## [1] "PCA Component Loadings:"

print(round(pca_loadings, 3))

##            PC1    PC2
## Murder  -0.583 -0.534
## Assault -0.608 -0.214
## Rape    -0.539  0.818

The loadings indicate how each crime variable contributes to the principal components. PC1 represents overall crime intensity, while PC2 captures contrasts between crime types.

3.3 Determining Optimal Number of Clusters

3.3.1 Elbow Method

# Prepare PCA scores for clustering
df_pca <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  UrbanPop = df$UrbanPop,
  row.names = rownames(df)
)

X_clust <- df_pca[, c("PC1", "PC2")]

# Elbow method
fviz_nbclust(X_clust, kmeans, method = "wss", k.max = 10) +
  geom_vline(xintercept = 4, linetype = 2, color = "red") +
  labs(title = "Elbow Method for Optimal K")

The elbow method suggests that k=4 provides a good balance between model complexity and explained variance.

3.3.2 Silhouette Analysis

fviz_nbclust(X_clust, kmeans, method = "silhouette", k.max = 10) +
  labs(title = "Silhouette Method for Optimal K")

3.3.3 NbClust Consensus

# NbClust: 30 different indices
nb_result <- NbClust(crime_data_scaled, distance = "euclidean",
                     min.nc = 2, max.nc = 10,
                     method = "ward.D2", index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 14 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

# Summary of recommendations
cat("\nNbClust Recommendations:\n")

## 
## NbClust Recommendations:

table(nb_result$Best.nc[1,])

## 
##  0  2  3  4  7  9 10 
##  2 14  2  4  2  1  1

fviz_nbclust(crime_data_scaled, kmeans, method = "wss") +
  labs(title = "NbClust: Optimal Number of Clusters (30 indices)")

The consensus across multiple indices strongly supports k=4 as the optimal number of clusters.

3.4 K-Means Clustering

# Perform K-means with k=4
set.seed(42)
k <- 4
kmeans_result <- kmeans(X_clust, centers = k, nstart = 25)

# Add cluster assignments
df_pca$Cluster <- as.factor(kmeans_result$cluster)

# Visualize clusters
fviz_cluster(kmeans_result, data = X_clust,
             ellipse.type = "convex",
             palette = "viridis",
             geom = "point",
             ggtheme = theme_minimal()) +
  labs(title = "K-Means Clustering (k=4) on PCA Components",
       x = "PC1 (Overall Crime Intensity)",
       y = "PC2 (Crime Pattern Contrast)")

The K-means algorithm successfully partitioned the 50 states into four distinct groups based on their positions in the reduced PCA space.

3.5 Hierarchical Clustering

# Distance matrix
dist_euclidean <- dist(crime_data_scaled, method = "euclidean")

# Different linkage methods
hc_complete <- hclust(dist_euclidean, method = "complete")
hc_average <- hclust(dist_euclidean, method = "average")
hc_ward <- hclust(dist_euclidean, method = "ward.D2")
hc_single <- hclust(dist_euclidean, method = "single")

# Visualize dendrograms
par(mfrow = c(2, 2))
plot(hc_complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_complete, k = 4, border = 2:5)
plot(hc_average, main = "Average Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_average, k = 4, border = 2:5)
plot(hc_ward, main = "Ward's Method", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_ward, k = 4, border = 2:5)
plot(hc_single, main = "Single Linkage", xlab = "", sub = "", cex = 0.6)
rect.hclust(hc_single, k = 4, border = 2:5)

par(mfrow = c(1, 1))

# Enhanced dendrogram with colored branches
dend <- as.dendrogram(hc_ward)
dend_colored <- color_branches(dend, k = 4)
plot(dend_colored, main = "Hierarchical Clustering - Ward's Method (k=4)")

Ward’s method produced the most balanced and interpretable dendrogram structure, with clear separation between the four clusters.

3.6 Cluster Validation

3.6.1 Hopkins Statistic (Clustering Tendency)

# Test if data is clusterable
set.seed(42)
hopkins_stat <- hopkins(crime_data_scaled, m = nrow(crime_data_scaled)/10)
hopkins_value <- as.numeric(hopkins_stat[[1]])
cat("Hopkins Statistic:", round(hopkins_value, 3), "\n")

## Hopkins Statistic: 0.883

cat("Interpretation:", ifelse(hopkins_value > 0.5, "Data IS clusterable", 
                              "Data is NOT clusterable"), "\n")

## Interpretation: Data IS clusterable

# Visual assessment
US_crimes_frequency <- get_clust_tendency(crime_data_scaled, n = 2, graph = TRUE,
                   gradient = list(low = "red", mid = "white", high = "blue"))

US_crimes_frequency$plot +
  labs(x = "Observations (US States)",
       y = "Observations (US States)",
       title = "Dissimilarity Matrix Heatmap",
       fill = "Distance") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

3.6.2 Silhouette Analysis

# Silhouette plot for k=4
km_final <- kmeans(crime_data_scaled, centers = 4, nstart = 25)
sil <- silhouette(km_final$cluster, dist(crime_data_scaled))

fviz_silhouette(sil) +
  labs(title = "Silhouette Plot for K-Means (k=4)")

##   cluster size ave.sil.width
## 1       1   12          0.38
## 2       2   17          0.34
## 3       3   14          0.46
## 4       4    7          0.40

cat("Average Silhouette Width:", round(mean(sil[,3]), 3), "\n")

## Average Silhouette Width: 0.39

3.6.3 Calinski-Harabasz Index

# Calculate CH index for different k values
ch_values <- numeric(9)
for (k in 2:10) {
  km_temp <- kmeans(crime_data_scaled, centers = k, nstart = 25)
  ch_values[k-1] <- calinhara(crime_data_scaled, km_temp$cluster)
}

plot(2:10, ch_values, type = "b", pch = 19,
     xlab = "Number of Clusters", ylab = "Calinski-Harabasz Index",
     main = "Calinski-Harabasz Index by Number of Clusters",
     col = "blue", lwd = 2)
grid()

cat("\nCalinski-Harabasz Index values:\n")

## 
## Calinski-Harabasz Index values:

for (k in 2:10) {
  cat("k =", k, ": CH =", round(ch_values[k-1], 2), "\n")
}

## k = 2 : CH = 79.91 
## k = 3 : CH = 60.44 
## k = 4 : CH = 65.61 
## k = 5 : CH = 56.91 
## k = 6 : CH = 52.61 
## k = 7 : CH = 49.79 
## k = 8 : CH = 48.88 
## k = 9 : CH = 48.2 
## k = 10 : CH = 46.89

3.6.4 Bootstrap Stability

# Bootstrap validation (100 iterations)
set.seed(42)
stability <- clusterboot(crime_data_scaled, B = 100, bootmethod = "boot",
                         clustermethod = kmeansCBI,
                         krange = 4, seed = 42)

## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100

cat("Cluster Stability (Jaccard Bootstrap Mean):\n")

## Cluster Stability (Jaccard Bootstrap Mean):

print(round(stability$bootmean, 3))

## [1] 0.822 0.814 0.800 0.709

cat("\nInterpretation:\n")

## 
## Interpretation:

cat("Values > 0.75: Stable clusters\n")

## Values > 0.75: Stable clusters

cat("Values 0.60-0.75: Moderate stability\n")

## Values 0.60-0.75: Moderate stability

cat("Values < 0.60: Unstable clusters\n")

## Values < 0.60: Unstable clusters

3.7 Method Comparison

# Compare hierarchical vs K-means
hc_clusters <- cutree(hc_ward, k = 4)
km_clusters <- kmeans(crime_data_scaled, centers = 4, nstart = 25)$cluster

comparison <- comPart(hc_clusters, km_clusters)
cat("Adjusted Rand Index (Hierarchical vs K-means):", round(comparison["ARI"], 3), "\n")

## Adjusted Rand Index (Hierarchical vs K-means): 0.67

cat("Interpretation:", 
    ifelse(comparison["ARI"] > 0.7, "Strong agreement",
           ifelse(comparison["ARI"] > 0.5, "Moderate agreement",
                  "Weak agreement")), "\n")

## Interpretation: Moderate agreement

# Visual comparison
df_compare <- data.frame(
  PC1 = prcomp(crime_data_scaled)$x[,1],
  PC2 = prcomp(crime_data_scaled)$x[,2],
  HC_Cluster = as.factor(hc_clusters),
  KM_Cluster = as.factor(km_clusters),
  State = rownames(df)
)

p1 <- ggplot(df_compare, aes(x = PC1, y = PC2, color = HC_Cluster)) +
  geom_point(size = 3) +
  labs(title = "Hierarchical Clustering (Ward)",
       x = "PC1", y = "PC2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

p2 <- ggplot(df_compare, aes(x = PC1, y = PC2, color = KM_Cluster)) +
  geom_point(size = 3) +
  labs(title = "K-Means Clustering",
       x = "PC1", y = "PC2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

print(p1)

print(p2)

3.8 Statistical Testing

3.8.1 ANOVA - Testing Cluster Differences

# Add cluster assignments to original data
df_analysis <- df
df_analysis$Cluster <- km_final$cluster

# ANOVA for each crime variable
cat("=== ANOVA Results ===\n\n")

## === ANOVA Results ===

# Murder
murder_aov <- aov(Murder ~ as.factor(Cluster), data = df_analysis)
cat("Murder Rate:\n")

## Murder Rate:

print(summary(murder_aov))

##                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Cluster)  3  782.4   260.8   81.53 <2e-16 ***
## Residuals          46  147.1     3.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Assault
assault_aov <- aov(Assault ~ as.factor(Cluster), data = df_analysis)
cat("\nAssault Rate:\n")

## 
## Assault Rate:

print(summary(assault_aov))

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(Cluster)  3 272640   90880   61.77 3.64e-16 ***
## Residuals          46  67673    1471                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Rape
rape_aov <- aov(Rape ~ as.factor(Cluster), data = df_analysis)
cat("\nRape Rate:\n")

## 
## Rape Rate:

print(summary(rape_aov))

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(Cluster)  3   3391  1130.3   57.28 1.44e-15 ***
## Residuals          46    908    19.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.8.2 Post-hoc Testing (Tukey HSD)

cat("\n=== Tukey HSD Post-hoc Test ===\n")

## 
## === Tukey HSD Post-hoc Test ===

cat("\nMurder Rate:\n")

## 
## Murder Rate:

print(TukeyHSD(murder_aov))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Murder ~ as.factor(Cluster), data = df_analysis)
## 
## $`as.factor(Cluster)`
##           diff        lwr       upr     p adj
## 2-1  -7.045098  -8.842562 -5.247635 0.0000000
## 3-1 -10.554762 -12.430224 -8.679300 0.0000000
## 4-1  -3.533333  -5.800656 -1.266011 0.0007850
## 3-2  -3.509664  -5.230219 -1.789109 0.0000117
## 4-2   3.511765   1.370806  5.652724 0.0003943
## 4-3   7.021429   4.814579  9.228278 0.0000000

cat("\nAssault Rate:\n")

## 
## Assault Rate:

print(TukeyHSD(assault_aov))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Assault ~ as.factor(Cluster), data = df_analysis)
## 
## $`as.factor(Cluster)`
##            diff        lwr        upr     p adj
## 2-1 -112.401961 -150.94900  -73.85492 0.0000000
## 3-1 -177.238095 -217.45783 -137.01836 0.0000000
## 4-1    3.119048  -45.50424   51.74233 0.9981966
## 3-2  -64.836134 -101.73385  -27.93842 0.0001442
## 4-2  115.521008   69.60763  161.43439 0.0000001
## 4-3  180.357143  133.03071  227.68357 0.0000000

cat("\nRape Rate:\n")

## 
## Rape Rate:

print(TukeyHSD(rape_aov))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Rape ~ as.factor(Cluster), data = df_analysis)
## 
## $`as.factor(Cluster)`
##           diff        lwr        upr     p adj
## 2-1  -3.848529  -8.312892  0.6158328 0.1134124
## 3-1 -12.125000 -16.783086 -7.4669136 0.0000001
## 4-1  14.360714   8.729362 19.9920662 0.0000001
## 3-2  -8.276471 -12.549815 -4.0031263 0.0000295
## 4-2  18.209244  12.891742 23.5267456 0.0000000
## 4-3  26.485714  21.004559 31.9668698 0.0000000

3.9 Urbanization and Crime Relationship

3.9.1 Correlation Analysis

# Correlation matrix
correlation_matrix <- cor(df)
print("Correlation Matrix:")

## [1] "Correlation Matrix:"

print(round(correlation_matrix, 3))

##          Murder Assault UrbanPop  Rape
## Murder    1.000   0.802    0.070 0.564
## Assault   0.802   1.000    0.259 0.665
## UrbanPop  0.070   0.259    1.000 0.411
## Rape      0.564   0.665    0.411 1.000

# Visual correlation plot
corrplot(correlation_matrix, method = "color", type = "upper",
         addCoef.col = "black", tl.col = "black",
         title = "Correlation Matrix of Crime Variables and Urbanization",
         mar = c(0,0,2,0))

3.9.2 Linear Regression

# Murder vs UrbanPop
model_murder <- lm(Murder ~ UrbanPop, data = df)
cat("=== Linear Regression: Murder ~ UrbanPop ===\n")

## === Linear Regression: Murder ~ UrbanPop ===

print(summary(model_murder))

## 
## Call:
## lm(formula = Murder ~ UrbanPop, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.537 -3.736 -0.779  3.332  9.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  6.41594    2.90669   2.207   0.0321 *
## UrbanPop     0.02093    0.04333   0.483   0.6312  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.39 on 48 degrees of freedom
## Multiple R-squared:  0.00484,    Adjusted R-squared:  -0.01589 
## F-statistic: 0.2335 on 1 and 48 DF,  p-value: 0.6312

ggplot(df, aes(x = UrbanPop, y = Murder)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", col = "red", se = TRUE) +
  labs(title = "Linear Regression: Murder Rate vs Urbanization",
       x = "Urban Population (%)",
       y = "Murder Rate (per 100,000)") +
  theme_minimal()

# Rape vs UrbanPop
model_rape <- lm(Rape ~ UrbanPop, data = df)
cat("\n=== Linear Regression: Rape ~ UrbanPop ===\n")

## 
## === Linear Regression: Rape ~ UrbanPop ===

print(summary(model_rape))

## 
## Call:
## lm(formula = Rape ~ UrbanPop, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.644  -5.476  -1.216   5.885  27.937 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.78707    5.71128   0.663    0.510   
## UrbanPop     0.26617    0.08513   3.127    0.003 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.626 on 48 degrees of freedom
## Multiple R-squared:  0.1692, Adjusted R-squared:  0.1519 
## F-statistic: 9.776 on 1 and 48 DF,  p-value: 0.003001

ggplot(df, aes(x = UrbanPop, y = Rape)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", col = "red", se = TRUE) +
  labs(title = "Linear Regression: Rape Rate vs Urbanization",
       x = "Urban Population (%)",
       y = "Rape Rate (per 100,000)") +
  theme_minimal()

4. Results

4.1 PCA Results

Principal Component Analysis revealed that:

PC1 (Overall Crime Intensity) explained approximately 62% of the variance, with high positive loadings on all three crime variables (Murder, Assault, Rape)
PC2 (Crime Pattern Contrast) explained approximately 24% of the variance, representing a contrast primarily between Rape and Murder
Together, PC1 and PC2 captured about 86% of total variance, justifying dimensionality reduction

Conclusion for H1: The hypothesis1 is consistent with the findings. Crime variables can be effectively reduced to two dominant dimensions.

4.2 Cluster Identification

# Calculate cluster statistics
cluster_profiles <- df_analysis %>%
  group_by(Cluster) %>%
  summarise(
    N_States = n(),
    Murder_Mean = round(mean(Murder), 1),
    Murder_SD = round(sd(Murder), 1),
    Assault_Mean = round(mean(Assault), 1),
    Assault_SD = round(sd(Assault), 1),
    Rape_Mean = round(mean(Rape), 1),
    Rape_SD = round(sd(Rape), 1),
    UrbanPop_Mean = round(mean(UrbanPop), 1),
    UrbanPop_SD = round(sd(UrbanPop), 1)
  )

print("Cluster Profiles:")

## [1] "Cluster Profiles:"

print(cluster_profiles)

## # A tibble: 4 × 10
##   Cluster N_States Murder_Mean Murder_SD Assault_Mean Assault_SD Rape_Mean
##     <int>    <int>       <dbl>     <dbl>        <dbl>      <dbl>     <dbl>
## 1       1       12        13.6       2.2        258.        48        23.9
## 2       2       17         6.6       1.8        146.        35.2      20.1
## 3       3       14         3.1       1.3         80.9       36.3      11.8
## 4       4        7        10.1       1.8        261.        29.6      38.3
## # ℹ 3 more variables: Rape_SD <dbl>, UrbanPop_Mean <dbl>, UrbanPop_SD <dbl>

# States in each cluster
cat("\n=== States by Cluster ===\n")

## 
## === States by Cluster ===

for (i in 1:4) {
  states <- df_analysis %>%
    filter(Cluster == i) %>%
    rownames()
  cat("\nCluster", i, "(n =", length(states), "):\n")
  cat(paste(states, collapse = ", "), "\n")
}

## 
## Cluster 1 (n = 12 ):
## Alabama, Florida, Georgia, Illinois, Louisiana, Maryland, Mississippi, New York, North Carolina, South Carolina, Tennessee, Texas 
## 
## Cluster 2 (n = 17 ):
## Arkansas, Delaware, Indiana, Kansas, Kentucky, Massachusetts, Missouri, Montana, New Jersey, Ohio, Oklahoma, Oregon, Pennsylvania, Utah, Virginia, Washington, Wyoming 
## 
## Cluster 3 (n = 14 ):
## Connecticut, Hawaii, Idaho, Iowa, Maine, Minnesota, Nebraska, New Hampshire, North Dakota, Rhode Island, South Dakota, Vermont, West Virginia, Wisconsin 
## 
## Cluster 4 (n = 7 ):
## Alaska, Arizona, California, Colorado, Michigan, Nevada, New Mexico

Cluster Characterization

Based on the profiles, the four clusters can be characterized as:

Cluster 1: High-crime states with elevated rates across all crime types
Cluster 2: Low-crime states with below-average rates in all categories
Cluster 3: Moderate-crime states with balanced crime profiles
Cluster 4: Mixed-pattern states with specific crime type elevations

Conclusion for H2: The hypothesis2 is empirically supported.Four distinct, stable clusters were identified.

4.3 Cluster Validity Results

cat("=== Cluster Validation Summary ===\n\n")

## === Cluster Validation Summary ===

cat("Hopkins Statistic:", round(hopkins_value, 3), "\n")

## Hopkins Statistic: 0.883

cat("→ Data is", ifelse(hopkins_value > 0.5, "clusterable", "not clusterable"), "\n\n")

## → Data is clusterable

cat("Average Silhouette Width:", round(mean(sil[,3]), 3), "\n")

## Average Silhouette Width: 0.39

cat("→ Interpretation:", 
    ifelse(mean(sil[,3]) > 0.5, "Strong structure",
           ifelse(mean(sil[,3]) > 0.25, "Reasonable structure",
                  "Weak structure")), "\n\n")

## → Interpretation: Reasonable structure

cat("Calinski-Harabasz Index (k=4):", round(ch_values[3], 2), "\n")

## Calinski-Harabasz Index (k=4): 65.61

cat("→ Higher values indicate better-defined clusters\n\n")

## → Higher values indicate better-defined clusters

cat("Bootstrap Stability:\n")

## Bootstrap Stability:

for (i in 1:4) {
  cat("Cluster", i, ":", round(stability$bootmean[i], 3),
      ifelse(stability$bootmean[i] > 0.75, "(STABLE)",
             ifelse(stability$bootmean[i] > 0.6, "(MODERATE)", "(UNSTABLE)")), "\n")
}

## Cluster 1 : 0.822 (STABLE) 
## Cluster 2 : 0.814 (STABLE) 
## Cluster 3 : 0.8 (STABLE) 
## Cluster 4 : 0.709 (MODERATE)

4.4 Statistical Differences Between Clusters

All three crime variables showed statistically significant differences across clusters (ANOVA p < 0.001 for all). Post-hoc Tukey tests confirmed multiple pairwise differences between cluster means.

Conclusion for H3: The data provides evidence in favor of the hypothesis3.Clusters differ significantly in crime rates.

4.5 Urbanization and Crime

Murder vs UrbanPop: Weak correlation (r ≈ 0.07), non-significant regression (p > 0.05)
Rape vs UrbanPop: Moderate positive correlation (r ≈ 0.41), significant regression (p < 0.01)
Assault vs UrbanPop: Weak positive correlation (r ≈ 0.26)

Conclusion for H4: We can not fully reject or fully confirm the hypothesis4. As the urbanization is associated with Rape rates but not Murder rates.

4.6 Method Agreement

The Adjusted Rand Index between hierarchical (Ward’s method) and K-means clustering showed moderate agreement (ARI ≈ 0.5-0.7), indicating that both methods identify similar underlying structures with some differences in boundary cases.

Conclusion for H5: The results support the hypothesis5. Moderate agreement between clustering methods.

4.7 Visual Summary

# Heatmap with clustering
df_heatmap_scaled <- scale(df[, c("Murder", "Assault", "Rape", "UrbanPop")])

annotation_row <- data.frame(
  Cluster = as.factor(km_final$cluster),
  row.names = rownames(df)
)

pheatmap(df_heatmap_scaled,
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean",
         clustering_method = "ward.D2",
         annotation_row = annotation_row,
         main = "Crime Patterns Heatmap with Hierarchical Clustering",
         fontsize_row = 6,
         color = colorRampPalette(c("navy", "white", "firebrick3"))(50))

# Radar chart for cluster comparison
cluster_means <- df_analysis %>%
  group_by(Cluster) %>%
  summarise(
    Murder = mean(Murder),
    Assault = mean(Assault),
    Rape = mean(Rape),
    UrbanPop = mean(UrbanPop)
  )

cluster_means_norm <- cluster_means
cluster_means_norm[, 2:5] <- scale(cluster_means[, 2:5],
                                   center = FALSE,
                                   scale = apply(cluster_means[, 2:5], 2, max))

radar_data <- as.data.frame(t(cluster_means_norm[, -1]))
colnames(radar_data) <- paste("Cluster", 1:4)
radar_data <- rbind(rep(1, 4), rep(0, 4), radar_data)

radarchart(radar_data,
           pcol = c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12"),
           pfcol = adjustcolor(c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12"),
                               alpha.f = 0.3),
           plwd = 2,
           cglcol = "grey",
           cglty = 1,
           axislabcol = "grey",
           title = "Cluster Profiles - Radar Chart")
legend("topright",
       legend = paste("Cluster", 1:4),
       col = c("#E74C3C", "#3498DB", "#2ECC71", "#F39C12"),
       lty = 1, lwd = 2, cex = 0.8)

5. Discussion and Interpretation of Findings

5.1 Principal Findings

This comprehensive analysis demonstrates that violent crime patterns across US states exhibit clear structure that can be effectively captured through unsupervised learning methods. The identification of four distinct clusters with significantly different crime profiles suggests that states experience qualitatively different patterns of violence rather than simply varying along a single continuum.

5.2 Interpretation of Clusters

The four-cluster solution reveals meaningful crime typologies:

High-crime states typically show elevated rates across all crime categories, suggesting systemic factors that affect multiple types of violence
Low-crime states maintain consistently low rates, potentially reflecting effective law enforcement, social cohesion, or demographic factors
Moderate-crime states represent the statistical middle ground
Pattern-specific states show interesting contrasts, with high rates in some crime types but not others

5.3 Role of Urbanization

The differential relationship between urbanization and crime types is noteworthy. The strong association with rape rates but weak association with murder rates suggests that:

Urbanization may affect crime types differently through distinct mechanisms
Reporting rates for certain crimes may vary by urban density
Other unmeasured factors (e.g., poverty, inequality, policing) may mediate these relationships

5.4 Methodological Strengths

This study employed rigorous validation procedures:

Multiple clustering algorithms (K-means and hierarchical) showed agreement
30 different indices via NbClust consensus supported k=4
Bootstrap stability testing confirmed cluster robustness
Statistical testing (ANOVA, Tukey HSD) verified significant differences
Multiple validation metrics (Hopkins, Silhouette, CH index) confirmed clusterability

5.5 Limitations

Cross-sectional data: Analysis represents a snapshot in time; temporal dynamics are not captured
Ecological fallacy: State-level aggregation may mask within-state heterogeneity
Limited variables: Many relevant factors (economic, demographic, policy) are not included
Reporting bias: Arrest rates may not perfectly reflect actual crime incidence

5.6 Implications

For Policy: - Resource allocation can be targeted based on cluster membership - States within the same cluster may benefit from similar interventions - High-crime clusters require comprehensive, multi-faceted approaches

For Research: - Future studies should incorporate temporal dynamics and additional covariates - Investigation of within-cluster variation could reveal further insights - Causal mechanisms linking urbanization to specific crime types warrant exploration

6. Conclusion

This study successfully demonstrates that US states form four distinct, stable clusters based on violent crime patterns. All five research hypotheses received empirical support (one partially):

H1 (PCA): Two principal components effectively capture crime variation
H2 (Clustering): Four stable clusters exist H3 (Differences): Clusters differ significantly in all crime variables
H4 (Urbanization): Partial support—related to rape but not murder
H5 (Clustering agreement): Moderate agreement between clustering methods

The rigorous validation framework—including Hopkins statistic, silhouette analysis, Calinski-Harabasz index, bootstrap stability, and multiple clustering algorithms—provides strong evidence that these patterns are real and reproducible, not artifacts of the analysis method.

Understanding these crime typologies can inform more nuanced policy responses and advance theoretical understanding of the geographic distribution of violence in the United States.

7. Recommendations

7.1 For future Research:

Temporal Analysis: Examine how cluster membership evolves over time
Additional Covariates: Incorporate socioeconomic, demographic, and policy variables
Sub-state Analysis: Investigate within-state heterogeneity at county or city level
Causal Mechanisms: Employ causal inference methods to identify drivers of cluster membership
Predictive Modeling: Develop models to forecast cluster transitions

7.2 For Policymakers/Crime Analysts:

Tailor interventions to cluster characteristics
Share best practices within clusters
Monitor states at risk of cluster transition
Develop cluster-specific resource allocation strategies

8. References:

-1.Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. -2.Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. -3.Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association.

Unsupervised Learning Analysis of Violent Crime Patterns Across US States

Yana Hrabchak

2025-12-11