Research Question: “How can the LAPD identify unique crime ‘fingerprints’ by integrating spatial hotspots, temporal patterns, and victim demographics to transition from reactive to proactive policing?”
This analysis aims to move from broad patrolling to Precision Policing. By segmenting crimes into specific “Risk Fingerprints,” we provide the LAPD with actionable intelligence to deploy specialized units where they are most needed.
This dataset contains 50,000 crime incident reports from the Los Angeles Police Department (LAPD) covering January 2020 to 2025. Each record includes detailed information about crime types, locations, victim demographics, and case outcomes.
This dataset contains 50,000 crime incident reports from the Los Angeles Police Department (LAPD) covering January 2020 to present. Each record includes detailed information about crime types, locations, victim demographics, and case outcomes.
To ensure the effectiveness of Dimensionality Reduction (PCA), this study focuses on a subset of key numerical features:
To ensure high-quality analytical results, we filter out invalid records (Vict Age = 0, LAT/LON = 0). A synchronized sample of 10,000 records is used to maintain consistency across all models.
library(dplyr)
library(readr)
library(ggplot2)
library(factoextra)
library(cluster)
library(stringr)
library(kableExtra)
library(gridExtra)data <- read_csv("Crime_Data_from_2020_to_Present.csv")
# Label Cleaning Function
clean_labels <- function(x) {
x <- str_replace_all(x, "ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT", "Aggravated Assault")
x <- str_replace_all(x, "THEFT-GRAND.*", "Grand Theft")
x <- str_replace_all(x, "BURGLARY FROM VEHICLE", "Vehicle Burglary")
x <- str_replace_all(x, "BATTERY - SIMPLE ASSAULT", "Simple Battery")
x <- str_trunc(x, 25)
return(x)
}PCA identifies the uncorrelated “Principal Components” that drive the variance in Los Angeles crime data.
The Scree Plot displays the percentage of variance captured by each component.
pca_model <- prcomp(crime_scaled, center = FALSE, scale. = FALSE)
fviz_eig(pca_model, addlabels = TRUE, barfill = "#2E86C1", barcolor = "#2E86C1")The Scree Plot reveals that the first two components (PC1 and PC2) capture the vast majority of the variance in the dataset. There is a clear “elbow” after the second component, suggesting that while LA crime is complex, it can be effectively categorized into two primary strategic dimensions: Spatial Orientation and Socio-Temporal Behavior. By focusing on these two dimensions, the LAPD can reduce noise and focus on the core drivers of public safety.
The Loading Matrix reveals the “weight” of each original variable within the components.
pca_loadings <- as.data.frame(pca_model$rotation)
pca_loadings %>% kable() %>% kable_styling(full_width = T, bootstrap_options = "striped")| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| TIME OCC | -0.0290747 | 0.8369274 | -0.5462383 | -0.0181937 |
| Vict Age | -0.1521750 | -0.5439782 | -0.8237374 | -0.0488578 |
| LAT | -0.6958388 | 0.0500773 | 0.1371855 | -0.7031932 |
| LON | 0.7012886 | -0.0336535 | -0.0652726 | -0.7090848 |
PC1 (Geospatial Dimension): Accounted for by high loadings in LAT (-0.69) and LON (0.70). This axis essentially maps the city’s geography. The opposite signs indicate that PC1 differentiates crimes occurring in the Western coastal districts (like Pacific) from those in the Inland/Central areas.
PC2 (Socio-Temporal Dimension): Dominated by TIME OCC (0.83) and Vict Age (-0.54). The inverse relationship between these two variables is critical: it suggests that as the day progresses into the late evening, the victim profile shifts toward a younger demographic, which aligns with our findings in the high-friction Central area.
This plot shows which variables contribute most significantly to the first two dimensions.
fviz_pca_var(pca_model, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)The contribution plot confirms the dominance of geographic variables. LAT and LON show the longest vectors, indicating they are the strongest drivers of the first dimension (PC1). Interestingly, TIME OCC and Vict Age contribute significantly to the second dimension (PC2) but are oriented in nearly opposite directions. This confirms a “Inverse Demographic-Temporal Relationship”: certain crimes involving younger victims are highly time-dependent (nightlife/evening), while crimes involving older victims (fraud/identity theft) follow a different, less time-volatile pattern.
set.seed(123)
pca_scores <- pca_model$x[, 1:2]
p1 <- fviz_nbclust(pca_scores, kmeans, method = "wss") + labs(subtitle = "Elbow Method")
p2 <- fviz_nbclust(pca_scores, kmeans, method = "silhouette") + labs(subtitle = "Silhouette Method")
grid.arrange(p1, p2, ncol=2)Both the Elbow Method and the Silhouette Score converge on k=3 as the optimal number of clusters. The Elbow plot shows a significant drop in total within-cluster sum of squares (WSS) up to 3 clusters, after which the gain becomes marginal. The Silhouette plot confirms that at k=3, the clusters achieve the best balance between internal cohesion and external separation, providing a clear mandate for the three-profile policing strategy.
set.seed(123)
km_res <- kmeans(pca_scores, centers = 3, nstart = 25)
set.seed(123)
pam_res <- pam(pca_scores, k = 3)
p3 <- fviz_cluster(km_res, data = pca_scores, geom = "point", ellipse.type = "convex",
palette = "jco", main = "K-means Profile Clusters")
p4 <- fviz_cluster(pam_res, palette = "Set2", main = "PAM Robustness Check")
grid.arrange(p3, p4, ncol=2)The K-means and PAM visualizations display three well-defined “Crime Ellipses.” Cluster 1 (Pacific) and Cluster 2 (Van Nuys) are separated primarily along the horizontal axis (PC1 - Geography), reflecting the vast physical distance between these divisions. Cluster 3 (Central) is separated along the vertical axis (PC2 - Behavior), highlighting that the “Urban Friction” in the city center is driven more by victim age and time of day than by pure geography alone.
crime_results <- crime_sample %>%
mutate(Cluster = as.factor(km_res$cluster)) %>%
mutate(`Crm Cd Desc` = clean_labels(`Crm Cd Desc`))
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
summary_table <- crime_results %>%
group_by(Cluster) %>%
summarise( Avg_Age = round(mean(`Vict Age`), 1), Avg_Time = round(mean(`TIME OCC`), 0), Main_Area = get_mode(`AREA NAME`), Top_Crime = get_mode(`Crm Cd Desc`), Volume = n() )
summary_table %>% kable() %>% kable_styling(full_width = T, bootstrap_options = "striped")| Cluster | Avg_Age | Avg_Time | Main_Area | Top_Crime | Volume |
|---|---|---|---|---|---|
| 1 | 46.1 | 754 | Pacific | THEFT OF IDENTITY | 3126 |
| 2 | 41.9 | 1378 | Van Nuys | THEFT OF IDENTITY | 2964 |
| 3 | 32.6 | 1743 | Central | Simple Battery | 3910 |
By mapping the clusters back to the original data, we identify three distinct “Crime Fingerprints” that dictate specific tactical responses:
Cluster 1: Senior Identity Vulnerability (The Pacific Profile)
Cluster 2: Evening Identity Risk (The Van Nuys Profile)
Cluster 3: Young Urban Friction (The Central Profile)
The core objective of this study was to transition the LAPD from Reactive Response to Data-Driven Proactive Policing. By integrating PCA and K-means Clustering, we have successfully decoded the “DNA” of LA crime into three actionable segments.
By adopting this Fingerprinting Model, the LAPD can move beyond “chasing sirens.” This data-driven roadmap allows for Precision Policing, ensuring the right resources are deployed to the right demographic at the right time.