Code
library(tidyverse)
library(lubridate)
library(scales)
library(ggplot2)
library(broom)
library(knitr)
library(kableExtra)
library(patchwork)Crime in the United States is a multifaceted social phenomenon shaped by geography, economics, seasonality, and policing policy. This project analyzes monthly crime statistics drawn from the FBI’s Real-Time Crime Index (RTCI), a dataset that aggregates voluntary submissions from local law enforcement agencies across the country. The data spans January 2018 through early 2026, covering crimes across seven categories: murder, rape, robbery, aggravated assault, burglary, theft, and motor vehicle theft.
Research Questions:
The dataset was sourced from the AH Datalytics Real-Time Crime Index (source link), which compiles FBI NIBRS and UCR agency-level monthly reports into a unified CSV format updated regularly.
library(tidyverse)
library(lubridate)
library(scales)
library(ggplot2)
library(broom)
library(knitr)
library(kableExtra)
library(patchwork)crime_raw <- read_csv("C:/Users/amanu/Downloads/final_sample.csv", show_col_types = FALSE)
# Preview structure
glimpse(crime_raw)Rows: 74,413
Columns: 34
$ Month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
$ Year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 201…
$ Date <chr> "January 2018", "February 2018", "March…
$ Agency <chr> "Agencies of < 100K", "Agencies of < 10…
$ State <chr> "Nationwide", "Nationwide", "Nationwide…
$ Region <chr> "Other", "Other", "Other", "Other", "Ot…
$ Agency_State <chr> "Agencies of < 100K, Nationwide", "Agen…
$ Murder <dbl> 63, 24, 41, 45, 44, 37, 45, 53, 45, 46,…
$ Rape <dbl> 566, 470, 545, 513, 632, 609, 599, 626,…
$ Robbery <dbl> 977, 760, 791, 868, 834, 848, 906, 920,…
$ `Aggravated Assault` <dbl> 2293, 2053, 2393, 2492, 2750, 2876, 285…
$ Burglary <dbl> 4796, 4145, 4514, 4467, 4938, 4695, 519…
$ Theft <dbl> 23337, 19892, 22067, 22221, 24310, 2399…
$ `Motor Vehicle Theft` <dbl> 2872, 2452, 2565, 2450, 2708, 2590, 281…
$ `Violent Crime` <dbl> 3899, 3307, 3770, 3918, 4260, 4370, 440…
$ `Property Crime` <dbl> 31005, 26489, 29146, 29138, 31956, 3128…
$ Murder_mvs_12mo <dbl> 537, 511, 516, 518, 518, 509, 506, 524,…
$ Burglary_mvs_12mo <dbl> 63155, 62661, 62399, 61737, 61108, 6045…
$ Rape_mvs_12mo <dbl> 6289, 6330, 6371, 6323, 6369, 6415, 646…
$ Robbery_mvs_12mo <dbl> 11752, 11658, 11561, 11500, 11302, 1116…
$ `Aggravated Assault_mvs_12mo` <dbl> 29808, 29830, 29808, 29906, 30027, 3027…
$ `Motor Vehicle Theft_mvs_12mo` <dbl> 32905, 32894, 32938, 32719, 32656, 3245…
$ Theft_mvs_12mo <dbl> 298448, 296650, 295528, 293932, 292343,…
$ `Violent Crime_mvs_12mo` <dbl> 48386, 48329, 48256, 48247, 48216, 4836…
$ `Property Crime_mvs_12mo` <dbl> 394508, 392205, 390865, 388388, 386107,…
$ Source.Link <chr> "https://ah-datalytics.github.io/rtci/l…
$ Source.Type <chr> "Aggregate", "Aggregate", "Aggregate", …
$ Source.Method <chr> "All agencies with complete data throug…
$ FBI.Population.Covered <dbl> 16311785, 16311785, 16311785, 16311785,…
$ Number.of.Agencies <dbl> 235, 235, 235, 235, 235, 235, 235, 235,…
$ Latitude <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Longitude <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Comment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ `Last Updated` <chr> "2026-03-23 11:37:36 EDT", "2026-03-23 …
The raw dataset contains 74413 rows and 34 columns, representing monthly crime observations from agencies across 48 states and the District of Columbia.
# Count missing values per column
missing_summary <- crime_raw |>
summarise(across(everything(), ~sum(is.na(.)))) |>
pivot_longer(everything(), names_to = "Column", values_to = "Missing") |>
filter(Missing > 0) |>
arrange(desc(Missing))
kable(missing_summary, caption = "Columns with Missing Values") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Column | Missing |
|---|---|
| Latitude | 74413 |
| Longitude | 74413 |
| Comment | 70666 |
| Violent Crime | 21160 |
| Property Crime | 21160 |
| Violent Crime_mvs_12mo | 21160 |
| Property Crime_mvs_12mo | 21160 |
| Rape_mvs_12mo | 1654 |
| Rape | 1541 |
| Aggravated Assault_mvs_12mo | 1171 |
| Aggravated Assault | 1058 |
| Burglary_mvs_12mo | 790 |
| Theft_mvs_12mo | 781 |
| Motor Vehicle Theft_mvs_12mo | 705 |
| Robbery_mvs_12mo | 694 |
| Burglary | 677 |
| Theft | 669 |
| Robbery | 582 |
| Motor Vehicle Theft | 582 |
| Murder_mvs_12mo | 218 |
| Murder | 105 |
crime_clean <- crime_raw |>
# 1. Parse Date column into proper date type
mutate(
Date_parsed = mdy(Date),
Year = as.integer(Year),
Month = as.integer(Month)
) |>
# 2. Remove rows where both Violent Crime and Property Crime are NA
# (these are structurally incomplete agency records)
filter(!(is.na(`Violent Crime`) & is.na(`Property Crime`))) |>
# 3. Drop columns with 100% missingness (Latitude, Longitude)
select(-Latitude, -Longitude) |>
# 4. Remove the Comment column (>95% missing, no analytical value)
select(-Comment) |>
# 5. Standardize Region: relabel "Other" to "Nationwide/Aggregate"
mutate(
Region = if_else(Region == "Other", "Nationwide/Aggregate", Region),
Region = factor(Region, levels = c("Northeast", "South", "Midwest", "West", "Nationwide/Aggregate"))
) |>
# 6. Rename columns with spaces for easier handling
rename(
Aggravated_Assault = `Aggravated Assault`,
Motor_Vehicle_Theft = `Motor Vehicle Theft`,
Violent_Crime = `Violent Crime`,
Property_Crime = `Property Crime`
)
cat("Rows after cleaning:", nrow(crime_clean), "\n")Rows after cleaning: 53253
cat("Columns after cleaning:", ncol(crime_clean), "\n")Columns after cleaning: 32
# Full-sample national aggregate rows (one per month-agency combination for national rollup)
national <- crime_clean |>
filter(Agency == "Full Sample", !is.na(Violent_Crime))
# Agency-level data (exclude aggregate rows), with complete violent crime
agency_data <- crime_clean |>
filter(Agency != "Full Sample",
!is.na(Violent_Crime),
Region != "Nationwide/Aggregate")
cat("National aggregate rows:", nrow(national), "\n")National aggregate rows: 4559
cat("Agency-level rows:", nrow(agency_data), "\n")Agency-level rows: 48306
annual_national <- national |>
group_by(Year) |>
summarise(
Total_Violent = sum(Violent_Crime, na.rm = TRUE),
Total_Property = sum(Property_Crime, na.rm = TRUE),
Total_Murder = sum(Murder, na.rm = TRUE),
.groups = "drop"
) |>
filter(Year < 2026) # 2026 is partial year
kable(annual_national,
col.names = c("Year", "Total Violent Crime", "Total Property Crime", "Total Murders"),
caption = "Annual National Crime Totals (Full Sample Agencies)",
format.args = list(big.mark = ",")) |>
kable_styling(bootstrap_options = c("striped", "hover"))| Year | Total Violent Crime | Total Property Crime | Total Murders |
|---|---|---|---|
| 2,018 | 1,109,962 | 5,491,626 | 15,576 |
| 2,019 | 1,101,358 | 5,392,292 | 15,970 |
| 2,020 | 1,168,586 | 5,049,744 | 21,272 |
| 2,021 | 1,192,344 | 5,116,386 | 22,382 |
| 2,022 | 1,210,802 | 5,520,632 | 21,600 |
| 2,023 | 1,198,842 | 5,528,252 | 19,302 |
| 2,024 | 1,154,780 | 5,107,256 | 16,494 |
| 2,025 | 1,032,358 | 4,481,350 | 13,574 |
monthly_avg <- national |>
group_by(Month) |>
summarise(
Avg_Violent = mean(Violent_Crime, na.rm = TRUE),
Avg_Property = mean(Property_Crime, na.rm = TRUE),
.groups = "drop"
) |>
mutate(Month_Name = month(Month, label = TRUE, abbr = TRUE))
monthly_avg |>
kable(col.names = c("Month #", "Avg Violent Crime", "Avg Property Crime", "Month"),
caption = "Average Monthly Crime Counts Across All Years",
digits = 0,
format.args = list(big.mark = ",")) |>
kable_styling(bootstrap_options = c("striped", "hover"))| Month # | Avg Violent Crime | Avg Property Crime | Month |
|---|---|---|---|
| 1 | 1,868 | 9,053 | Jan |
| 2 | 1,701 | 8,289 | Feb |
| 3 | 1,921 | 8,844 | Mar |
| 4 | 1,953 | 8,689 | Apr |
| 5 | 2,193 | 9,346 | May |
| 6 | 2,185 | 9,295 | Jun |
| 7 | 2,271 | 9,767 | Jul |
| 8 | 2,205 | 9,750 | Aug |
| 9 | 2,108 | 9,327 | Sep |
| 10 | 2,112 | 9,700 | Oct |
| 11 | 1,915 | 9,094 | Nov |
| 12 | 1,914 | 9,442 | Dec |
We model Violent Crime as a function of its three sub-components: Murder, Robbery, and Aggravated Assault — plus Month as a seasonal control.
Because Violent Crime is defined as the sum of these three components plus rape, including all four would create perfect multicollinearity. We therefore use Murder, Robbery, and Aggravated Assault as predictors and use Month as a seasonal nuisance variable.
reg_data <- national |>
select(Violent_Crime, Murder, Robbery, Aggravated_Assault, Month, Year) |>
filter(
!is.na(Violent_Crime),
!is.na(Murder),
!is.na(Robbery),
!is.na(Aggravated_Assault)
)
cat("Regression dataset rows:", nrow(reg_data), "\n")Regression dataset rows: 4559
model <- lm(Violent_Crime ~ Murder + Robbery + Aggravated_Assault + Month,
data = reg_data)
summary(model)
Call:
lm(formula = Violent_Crime ~ Murder + Robbery + Aggravated_Assault +
Month, data = reg_data)
Residuals:
Min 1Q Median 3Q Max
-695.68 -12.77 1.67 15.34 733.43
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.332458 2.285138 6.272 3.89e-10 ***
Murder 0.451636 0.054950 8.219 2.65e-16 ***
Robbery 1.146789 0.003121 367.451 < 2e-16 ***
Aggravated_Assault 1.085604 0.001660 654.152 < 2e-16 ***
Month -2.064781 0.309521 -6.671 2.85e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 72.69 on 4554 degrees of freedom
Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 1.057e+07 on 4 and 4554 DF, p-value: < 2.2e-16
Based on the fitted coefficients:
\[ \hat{Violent\_Crime} = \beta_0 + \beta_1 \cdot Murder + \beta_2 \cdot Robbery + \beta_3 \cdot Aggravated\_Assault + \beta_4 \cdot Month + \varepsilon \] \[ \hat{Violent\_Crime} = 14.33 + 0.4516 \cdot Murder + 1.1468 \cdot Robbery + 1.0856 \cdot Aggravated\_Assault - 2.0648 \cdot Month \]
coefs <- coef(model)
cat(sprintf(
"Violent_Crime = %.2f + %.4f·Murder + %.4f·Robbery + %.4f·Aggravated_Assault + %.4f·Month\n",
coefs["(Intercept)"], coefs["Murder"], coefs["Robbery"],
coefs["Aggravated_Assault"], coefs["Month"]
))Violent_Crime = 14.33 + 0.4516·Murder + 1.1468·Robbery + 1.0856·Aggravated_Assault + -2.0648·Month
model_stats <- glance(model)
kable(
model_stats |> select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs),
col.names = c("R²", "Adj. R²", "RMSE", "F-Statistic", "p-value", "df", "Observations"),
caption = "Multiple Linear Regression Model Summary",
digits = 4
) |>
kable_styling(bootstrap_options = c("striped", "hover"))| R² | Adj. R² | RMSE | F-Statistic | p-value | df | Observations |
|---|---|---|---|---|---|---|
| 0.9999 | 0.9999 | 72.6944 | 10568530 | 0 | 4 | 4559 |
Interpretation:
tidy(model) |>
kable(
col.names = c("Term", "Estimate", "Std. Error", "t-Statistic", "p-value"),
caption = "Regression Coefficients",
digits = 4
) |>
kable_styling(bootstrap_options = c("striped", "hover")) |>
row_spec(which(tidy(model)$p.value < 0.05), bold = TRUE, color = "white", background = "#2c7bb6")| Term | Estimate | Std. Error | t-Statistic | p-value |
|---|---|---|---|---|
| (Intercept) | 14.3325 | 2.2851 | 6.2720 | 0 |
| Murder | 0.4516 | 0.0549 | 8.2191 | 0 |
| Robbery | 1.1468 | 0.0031 | 367.4506 | 0 |
| Aggravated_Assault | 1.0856 | 0.0017 | 654.1516 | 0 |
| Month | -2.0648 | 0.3095 | -6.6709 | 0 |
par(mfrow = c(2, 2))
plot(model, which = 1:4)par(mfrow = c(1, 1))Diagnostic Interpretation:
# Annual regional property crime totals
regional_annual <- agency_data |>
filter(Year <= 2025) |>
group_by(Year, Region) |>
summarise(
Total_Property_Crime = sum(Property_Crime, na.rm = TRUE),
Total_Violent_Crime = sum(Violent_Crime, na.rm = TRUE),
.groups = "drop"
)
# Custom color palette — NOT ggplot defaults
region_colors <- c(
"Northeast" = "#1a6b8a",
"South" = "#e05c2a",
"Midwest" = "#4caf72",
"West" = "#9b59b6"
)
ggplot(regional_annual, aes(x = Year, y = Total_Property_Crime / 1000,
color = Region, group = Region)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3, aes(shape = Region)) +
# Shade the COVID era
annotate("rect", xmin = 2020, xmax = 2021.5,
ymin = -Inf, ymax = Inf,
alpha = 0.08, fill = "#888888") +
annotate("text", x = 2020.75, y = Inf, vjust = 1.5,
label = "COVID-19\nPeriod", size = 3, color = "#555555") +
scale_color_manual(values = region_colors,
name = "U.S. Census Region") +
scale_shape_manual(values = c(16, 17, 15, 18),
name = "U.S. Census Region") +
scale_x_continuous(breaks = 2018:2025) +
scale_y_continuous(labels = label_comma(suffix = "K")) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(color = "#555555", size = 11),
legend.position = "right",
panel.grid.minor = element_blank(),
axis.text.x = element_text(angle = 30, hjust = 1)
) +
labs(
title = "Annual Property Crime Totals by U.S. Census Region (2018–2025)",
subtitle = "Reporting agencies with complete monthly data · Counts in thousands",
x = "Year",
y = "Total Property Crimes (Thousands)",
caption = "Source: FBI Real-Time Crime Index via AH Datalytics · github.com/ah-datalytics/rtci"
)heatmap_data <- national |>
filter(Year >= 2018, Year <= 2025) |>
group_by(Year, Month) |>
summarise(Total_Violent = sum(Violent_Crime, na.rm = TRUE), .groups = "drop") |>
mutate(Month_Name = factor(month.abb[Month], levels = month.abb))
ggplot(heatmap_data, aes(x = Month_Name, y = factor(Year), fill = Total_Violent)) +
geom_tile(color = "white", linewidth = 0.4) +
scale_fill_gradient2(
low = "#f7f7f7",
mid = "#74add1",
high = "#d73027",
midpoint = median(heatmap_data$Total_Violent, na.rm = TRUE),
labels = label_comma(),
name = "Violent\nCrimes"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
axis.text = element_text(size = 10),
panel.grid = element_blank()
) +
labs(
title = "Monthly Violent Crime Heatmap: National Aggregate (2018–2025)",
subtitle = "Darker red = more violent crimes; blue = fewer",
x = "Month",
y = "Year",
caption = "Source: FBI Real-Time Crime Index via AH Datalytics · Full-Sample national aggregate rows only."
)set.seed(42)
scatter_data <- agency_data |>
filter(
Year <= 2025,
Violent_Crime > 0, Property_Crime > 0,
Violent_Crime < 5000, Property_Crime < 30000
) |>
slice_sample(n = 3000)
ggplot(scatter_data, aes(x = Violent_Crime, y = Property_Crime,
color = Region, alpha = 0.6)) +
geom_point(size = 1.2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.1, alpha = 1) +
scale_color_manual(values = region_colors, name = "U.S. Census Region") +
scale_x_continuous(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
guides(alpha = "none") +
facet_wrap(~Region, ncol = 2, scales = "free") +
theme_light(base_size = 12) +
theme(
strip.background = element_rect(fill = "#2c3e50"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(face = "bold")
) +
labs(
title = "Violent Crime vs. Property Crime by Region (Sample of Agency-Month Observations)",
subtitle = "Linear trend lines fitted per region",
x = "Monthly Violent Crime Count",
y = "Monthly Property Crime Count",
caption = "Source: FBI RTCI via AH Datalytics · Random sample of 3,000 observations · Extreme outliers removed for readability."
)The raw dataset (final_sample__1_.csv) contained 74,413 rows and 34 columns, spanning monthly crime observations from January 2018 through early 2026 across approximately 693 unique law enforcement agencies in 48 states.
Key cleaning decisions made:
1. Handling missing values. Two columns — Latitude and Longitude — were entirely null (100% missing) across all 74,413 rows and were dropped entirely. The Comment column was over 95% null with no consistent values, so it was also removed. Violent Crime and Property Crime were missing in approximately 21,160 rows; inspection revealed these corresponded to agencies that only reported individual crime subcategories without providing totals. Rather than imputing these as sums (which would require assuming no unreported subcategories), rows missing both composite scores were dropped for the regression and visualization requiring these fields, while the subcategory-level data were retained for other analyses.
2. Date parsing. The Date column was stored as a character string (e.g., "January 2018"). This was parsed into a proper R date object using lubridate::mdy() for accurate time-series ordering. The Year and Month columns were cast to integers for grouping operations.
3. Agency segmentation. The dataset contains two conceptually distinct row types: individual agency submissions and a special Agency == "Full Sample" aggregate that combines all agencies with complete data in a given month. These were separated into two subsets — national for the aggregate and agency_data for individual agencies — to avoid double-counting in visualizations.
4. Region recoding. The Region variable contained a catch-all value "Other", applied to rows associated with nationwide or cross-regional aggregates. This was relabeled "Nationwide/Aggregate" and converted to a factor with logical level ordering to improve clarity in plots.
5. Column renaming. Columns with spaces (e.g., Aggravated Assault, Violent Crime) were renamed using underscores to facilitate R’s formula syntax in regression modeling.
After cleaning, the dataset retained 53,253 rows for analysis (agency-level data with complete crime counts).
Regional Property Crime Trends (2018–2025): The South consistently reports the highest total property crime counts of any region, followed closely by the West. This reflects both higher populations in Southern states and the large geographic footprint of agencies in the dataset from California and Texas. A striking observation is the sharp drop in property crime across all regions in 2020, coinciding with the COVID-19 pandemic — lockdowns reduced retail activity and public movement, which likely suppressed crimes like theft and burglary. A partial rebound is visible through 2021–2022, though totals have not returned to pre-pandemic levels in the Northeast and Midwest as of 2025.
Seasonal Heatmap: The heatmap reveals a clear summer peak in violent crime in most years, with July and August consistently among the highest-crime months nationally. The year 2020 shows notably lighter colors in the spring months (March–May), again reflecting pandemic disruption, before returning to typical summer peaks in June–August 2020.
Violent vs. Property Crime Scatter: Across all regions, there is a strong positive correlation between violent and property crime counts — unsurprising, since high-crime jurisdictions tend to see elevated rates across all categories. The Midwest shows a slightly steeper regression slope than the Northeast, suggesting that in Midwestern agencies, incremental increases in violent crime are associated with larger simultaneous increases in property crime.
Several analytical directions were explored but could not be fully completed within the scope of this project:
Agency-level regression with fixed effects: A more rigorous model would include agency-level fixed effects (or random effects) to control for the fact that agencies vary enormously in size, reporting completeness, and local crime environment. A multilevel model (lme4::lmer) was started but not completed due to convergence challenges with the unbalanced panel structure.
Per-capita normalization: Ideally, crime counts would be divided by FBI.Population.Covered to produce per-capita rates, enabling fairer comparisons between large metropolitan agencies and small-town departments. Population coverage was available but varied across time for the same agency, complicating the normalization.
Interactive visualization: A plotly or leaflet map showing per-state crime rates would have been informative, but Latitude and Longitude were entirely missing, and the State field would require joining with a geographic shapefile — a task deferred for future work.
Time-series forecasting: Given the monthly structure of the data, an ARIMA or ETS model predicting future violent crime counts would be a natural extension, especially given the visible seasonal patterns in the heatmap.
Dataset source: AH Datalytics Real-Time Crime Index — https://realtimecrimeindex.com/]