1. Introduction

Research Question: “How can the LAPD identify unique crime ‘fingerprints’ by integrating spatial hotspots, temporal patterns, and victim demographics to transition from reactive to proactive policing?”

This analysis aims to move from broad patrolling to Precision Policing. By segmenting crimes into specific “Risk Fingerprints,” we provide the LAPD with actionable intelligence to deploy specialized units where they are most needed.

2. Data Source and Overview

2.1 Data Source

This dataset contains 50,000 crime incident reports from the Los Angeles Police Department (LAPD) covering January 2020 to 2025. Each record includes detailed information about crime types, locations, victim demographics, and case outcomes.

Primary Source: data.lacity.org
Dataset Curator: Hammad Zafar (2025)
Access Date: December 2025
Kaggle Link: Crime Data Set

2.2 Data Overview

This dataset contains 50,000 crime incident reports from the Los Angeles Police Department (LAPD) covering January 2020 to present. Each record includes detailed information about crime types, locations, victim demographics, and case outcomes.

To ensure the effectiveness of Dimensionality Reduction (PCA), this study focuses on a subset of key numerical features:

Temporal: Time of occurrence.
Demographic: Victim’s age.
Geospatial: Latitude (LAT) and Longitude (LON) of the crime scene.

3. Data Preparation

To ensure high-quality analytical results, we filter out invalid records (Vict Age = 0, LAT/LON = 0). A synchronized sample of 10,000 records is used to maintain consistency across all models.

library(dplyr)
library(readr)
library(ggplot2)
library(factoextra)
library(cluster)
library(stringr)
library(kableExtra)
library(gridExtra)

data <- read_csv("Crime_Data_from_2020_to_Present.csv")

# Label Cleaning Function
clean_labels <- function(x) {
  x <- str_replace_all(x, "ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT", "Aggravated Assault")
  x <- str_replace_all(x, "THEFT-GRAND.*", "Grand Theft")
  x <- str_replace_all(x, "BURGLARY FROM VEHICLE", "Vehicle Burglary")
  x <- str_replace_all(x, "BATTERY - SIMPLE ASSAULT", "Simple Battery")
  x <- str_trunc(x, 25)
  return(x)
}

set.seed(123)
crime_sample <- data %>%
  filter(`Vict Age` > 0 & LAT != 0 & LON != 0) %>%
  filter(!is.na(`TIME OCC`), !is.na(`Vict Age`), !is.na(LAT), !is.na(LON)) %>%
  sample_n(10000)

crime_numeric <- crime_sample %>% select(`TIME OCC`, `Vict Age`, LAT, LON)
crime_scaled <- scale(crime_numeric)

4. Dimensionality Reduction (PCA)

PCA identifies the uncorrelated “Principal Components” that drive the variance in Los Angeles crime data.

4.1 Scree Plot

The Scree Plot displays the percentage of variance captured by each component.

pca_model <- prcomp(crime_scaled, center = FALSE, scale. = FALSE)
fviz_eig(pca_model, addlabels = TRUE, barfill = "#2E86C1", barcolor = "#2E86C1")

The Scree Plot reveals that the first two components (PC1 and PC2) capture the vast majority of the variance in the dataset. There is a clear “elbow” after the second component, suggesting that while LA crime is complex, it can be effectively categorized into two primary strategic dimensions: Spatial Orientation and Socio-Temporal Behavior. By focusing on these two dimensions, the LAPD can reduce noise and focus on the core drivers of public safety.

4.2 Loading Matrix Analysis

The Loading Matrix reveals the “weight” of each original variable within the components.

pca_loadings <- as.data.frame(pca_model$rotation)
pca_loadings %>% kable() %>% kable_styling(full_width = T, bootstrap_options = "striped")

	PC1	PC2	PC3	PC4
TIME OCC	-0.0290747	0.8369274	-0.5462383	-0.0181937
Vict Age	-0.1521750	-0.5439782	-0.8237374	-0.0488578
LAT	-0.6958388	0.0500773	0.1371855	-0.7031932
LON	0.7012886	-0.0336535	-0.0652726	-0.7090848

PC1 (Geospatial Dimension): Accounted for by high loadings in LAT (-0.69) and LON (0.70). This axis essentially maps the city’s geography. The opposite signs indicate that PC1 differentiates crimes occurring in the Western coastal districts (like Pacific) from those in the Inland/Central areas.

PC2 (Socio-Temporal Dimension): Dominated by TIME OCC (0.83) and Vict Age (-0.54). The inverse relationship between these two variables is critical: it suggests that as the day progresses into the late evening, the victim profile shifts toward a younger demographic, which aligns with our findings in the high-friction Central area.

4.3 Variable Contribution Plot

This plot shows which variables contribute most significantly to the first two dimensions.

fviz_pca_var(pca_model, col.var = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

The contribution plot confirms the dominance of geographic variables. LAT and LON show the longest vectors, indicating they are the strongest drivers of the first dimension (PC1). Interestingly, TIME OCC and Vict Age contribute significantly to the second dimension (PC2) but are oriented in nearly opposite directions. This confirms a “Inverse Demographic-Temporal Relationship”: certain crimes involving younger victims are highly time-dependent (nightlife/evening), while crimes involving older victims (fraud/identity theft) follow a different, less time-volatile pattern.

5. Cluster Analysis: Decoding Crime Fingerprints

5.1 Optimal K Selection: Elbow & Silhouette

set.seed(123)
pca_scores <- pca_model$x[, 1:2]

p1 <- fviz_nbclust(pca_scores, kmeans, method = "wss") + labs(subtitle = "Elbow Method")
p2 <- fviz_nbclust(pca_scores, kmeans, method = "silhouette") + labs(subtitle = "Silhouette Method")
grid.arrange(p1, p2, ncol=2)

Both the Elbow Method and the Silhouette Score converge on k=3 as the optimal number of clusters. The Elbow plot shows a significant drop in total within-cluster sum of squares (WSS) up to 3 clusters, after which the gain becomes marginal. The Silhouette plot confirms that at k=3, the clusters achieve the best balance between internal cohesion and external separation, providing a clear mandate for the three-profile policing strategy.

5.2 Robustness Check: K-means vs. PAM

set.seed(123)
km_res <- kmeans(pca_scores, centers = 3, nstart = 25)

set.seed(123)
pam_res <- pam(pca_scores, k = 3)

p3 <- fviz_cluster(km_res, data = pca_scores, geom = "point", ellipse.type = "convex", 
                   palette = "jco", main = "K-means Profile Clusters")
p4 <- fviz_cluster(pam_res, palette = "Set2", main = "PAM Robustness Check")
grid.arrange(p3, p4, ncol=2)

The K-means and PAM visualizations display three well-defined “Crime Ellipses.” Cluster 1 (Pacific) and Cluster 2 (Van Nuys) are separated primarily along the horizontal axis (PC1 - Geography), reflecting the vast physical distance between these divisions. Cluster 3 (Central) is separated along the vertical axis (PC2 - Behavior), highlighting that the “Urban Friction” in the city center is driven more by victim age and time of day than by pure geography alone.

6. Crime Fingerprint Profiling

crime_results <- crime_sample %>%
  mutate(Cluster = as.factor(km_res$cluster)) %>%
  mutate(`Crm Cd Desc` = clean_labels(`Crm Cd Desc`))

get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

summary_table <- crime_results %>%
  group_by(Cluster) %>%
  summarise( Avg_Age = round(mean(`Vict Age`), 1), Avg_Time = round(mean(`TIME OCC`), 0), Main_Area = get_mode(`AREA NAME`), Top_Crime = get_mode(`Crm Cd Desc`), Volume = n() )

summary_table %>% kable() %>% kable_styling(full_width = T, bootstrap_options = "striped")

Cluster	Avg_Age	Avg_Time	Main_Area	Top_Crime	Volume
1	46.1	754	Pacific	THEFT OF IDENTITY	3126
2	41.9	1378	Van Nuys	THEFT OF IDENTITY	2964
3	32.6	1743	Central	Simple Battery	3910

Deep Dive into Cluster Characteristics

By mapping the clusters back to the original data, we identify three distinct “Crime Fingerprints” that dictate specific tactical responses:

Cluster 1: Senior Identity Vulnerability (The Pacific Profile)

Fingerprint: This cluster features the highest average victim age (46.1) and is concentrated in the Pacific area. The primary threat is Identity Theft.
Interpretation: This is not a “street-level” crime profile but a sophisticated, predatory one targeting older residents.
Action: Deployment of Cyber-Crime and Fraud Prevention units to the Pacific division to conduct senior-focused digital safety workshops.

Cluster 2: Evening Identity Risk (The Van Nuys Profile)

Fingerprint: Victims are middle-aged (Avg: 41.9), with crimes occurring later in the evening (Avg Time: 1378).
Interpretation: This cluster represents a “Working-Age” risk profile. Identity theft here likely occurs during business transactions or afternoon digital activity.
Action: Increased surveillance of commercial and business hubs in Van Nuys, focusing on secure transaction awareness.

Cluster 3: Young Urban Friction (The Central Profile)

Fingerprint: This is the most “physical” cluster. It involves the youngest victim demographic (Avg: 32.6) and focuses on Simple Battery.
Interpretation: Concentrated in the high-density Central area, this cluster reflects physical altercations in public spaces or transit hubs, peaking around 5:40 PM (Avg Time: 1743).
Action: High-visibility foot patrols in Central Los Angeles during the evening rush hour to de-escalate physical conflicts and provide a deterrent for simple battery incidents.

7. Conclusion

Objective Alignment

The core objective of this study was to transition the LAPD from Reactive Response to Data-Driven Proactive Policing. By integrating PCA and K-means Clustering, we have successfully decoded the “DNA” of LA crime into three actionable segments.

Strategic Recommendations

Specialized Patrols: Deploy high-visibility foot patrols in Central specifically during the 17:00–18:00 window to mitigate “Urban Friction” (Battery).
Targeted Prevention: Reallocate resources in Pacific and Van Nuys toward Cyber-Crime and Fraud workshops rather than traditional physical surveillance.
Resource Optimization: Prioritize de-escalation training for the Central division, which handles the highest volume of physical incidents (3,910 cases).

Closing Thought

By adopting this Fingerprinting Model, the LAPD can move beyond “chasing sirens.” This data-driven roadmap allows for Precision Policing, ensuring the right resources are deployed to the right demographic at the right time.

Strategic Crime Fingerprinting: Integrating PCA and Cluster Analysis

Thi Yen Nhi Pham

2026-02-20