Top 15 states that have the highest frequency of arrests, and is there a predominant gender among those apprehended
This project explores enforcement trends across various U.S. states using a data set containing information on individual arrests. My project focuses on identifying geographic hot spots and demographic trends using variables such as “apprehension_state” and “gender”.
The data set is a comprehensive record of law enforcement actions, consisting of 31 variables and over 600,000 cases. For this study, I am focusing on the geographic location of the arrest and the demographic classification of the apprehended individuals. Understanding these patterns is crucial for identifying regional resource allocation and the demographic impacts of enforcement policies.
Dataset Source:
https://deportationdata.org/data/processed/ice.html
Data Analysis
In this section, I perform a bivariate categorical analysis to examine the frequency of arrests by location and gender. I will generate a ranked bar chart for the top 15 states and a frequency plot for gender distribution to address the research question
# Loading necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) # Required for comma formatting
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(readxl)
# Importing the dataset
arrest_data <- read_xlsx("arrests-latest.xlsx", col_types = "text")
# Viewing dimensions and structure
dim(arrest_data)
## [1] 713464 28
str(arrest_data)
## tibble [713,464 × 28] (S3: tbl_df/tbl/data.frame)
## $ apprehension_date : chr [1:713464] "44835" "44835" "44835" "44835" ...
## $ apprehension_type : chr [1:713464] NA NA NA NA ...
## $ apprehension_state : chr [1:713464] "TEXAS" NA NA "FLORIDA" ...
## $ apprehension_aor : chr [1:713464] "San Antonio Area of Responsibility" "Phoenix Area of Responsibility" "Phoenix Area of Responsibility" "Miami Area of Responsibility" ...
## $ final_program : chr [1:713464] "Non-Detained Docket Control" "Detained Docket Control" "Detained Docket Control" "287G Program" ...
## $ arresting_agency : chr [1:713464] "ICE" "ICE" "ICE" "ICE" ...
## $ apprehension_method : chr [1:713464] "ERO Reprocessed Arrest" "ERO Reprocessed Arrest" "ERO Reprocessed Arrest" "287(g) Program" ...
## $ apprehension_criminality : chr [1:713464] "3 Other Immigration Violator" "1 Convicted Criminal" "3 Other Immigration Violator" "2 Pending Criminal Charges" ...
## $ case_status : chr [1:713464] "A-Proceedings Terminated" "8-Excluded/Removed - Inadmissibility" "8-Excluded/Removed - Inadmissibility" "8-Excluded/Removed - Inadmissibility" ...
## $ case_category : chr [1:713464] "[8B] Excludable / Inadmissible - Under Adjudication by IJ" "[8C] Excludable / Inadmissible - Administrative Final Order Issued" "[8C] Excludable / Inadmissible - Administrative Final Order Issued" "[8C] Excludable / Inadmissible - Administrative Final Order Issued" ...
## $ departure_country : chr [1:713464] NA "PERU" "COLOMBIA" "HONDURAS" ...
## $ final_order_yes_no : chr [1:713464] "NO" "YES" "YES" "YES" ...
## $ birth_year : chr [1:713464] "1974" "1970" "1984" "1996" ...
## $ citizenship_country : chr [1:713464] "CUBA" "PERU" "COLOMBIA" "HONDURAS" ...
## $ gender : chr [1:713464] "Male" "Female" "Female" "Male" ...
## $ departed_date : chr [1:713464] NA "45105" "44959" "44904" ...
## $ final_order_date : chr [1:713464] NA "45077" "44944" "44862" ...
## $ apprehension_site_landmark: chr [1:713464] "WALKINS AT SAN ANTONIO" NA NA "JAC GENERAL AREA, NON-SPECIFIC" ...
## $ operation : chr [1:713464] "-" NA NA "HQ Tracking of Processing done by 287(g) authorized State & Local LEA's." ...
## $ toa_current_duty_site : chr [1:713464] "ERO - San Antonio, TX Field Office" "ELOY, AZ, SERVICE PROCESSING CENTER (DOCKET CONTROL OFFICE)" "ELOY, AZ, SERVICE PROCESSING CENTER (DOCKET CONTROL OFFICE)" "Jacksonville Sheriff's Office" ...
## $ case_criminality : chr [1:713464] "3 Other Immigration Violator" "1 Convicted Criminal" "3 Other Immigration Violator" "2 Pending Criminal Charges" ...
## $ case_threat_level : chr [1:713464] "NA" "1" "NA" "NA" ...
## $ unique_identifier : chr [1:713464] "bad3911f91e15e59572fe96a226c1fdbcf82d799" "dafdb5d0e565d238b9093e99c42e305ac95ce2b4" "1db208374a6ce32dd03f1b6d876c9f692b756185" "50198f8883c9b118a1153af301c65bfef03bb058" ...
## $ apprehension_date_time : chr [1:713464] "44835.04861111111" "44835.0747337963" "44835.11998842593" "44835.12847222222" ...
## $ duplicate_likely : chr [1:713464] "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ file_original : chr [1:713464] "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" ...
## $ sheet_original : chr [1:713464] "FY2023" "FY2023" "FY2023" "FY2023" ...
## $ row_original : chr [1:713464] "124239" "145764" "19899" "53227" ...
glimpse(arrest_data)
## Rows: 713,464
## Columns: 28
## $ apprehension_date <chr> "44835", "44835", "44835", "44835", "44835"…
## $ apprehension_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ apprehension_state <chr> "TEXAS", NA, NA, "FLORIDA", "TEXAS", "TEXAS…
## $ apprehension_aor <chr> "San Antonio Area of Responsibility", "Phoe…
## $ final_program <chr> "Non-Detained Docket Control", "Detained Do…
## $ arresting_agency <chr> "ICE", "ICE", "ICE", "ICE", "ICE", "ICE", "…
## $ apprehension_method <chr> "ERO Reprocessed Arrest", "ERO Reprocessed …
## $ apprehension_criminality <chr> "3 Other Immigration Violator", "1 Convicte…
## $ case_status <chr> "A-Proceedings Terminated", "8-Excluded/Rem…
## $ case_category <chr> "[8B] Excludable / Inadmissible - Under Adj…
## $ departure_country <chr> NA, "PERU", "COLOMBIA", "HONDURAS", NA, NA,…
## $ final_order_yes_no <chr> "NO", "YES", "YES", "YES", "NO", "NO", "NO"…
## $ birth_year <chr> "1974", "1970", "1984", "1996", "1972", "19…
## $ citizenship_country <chr> "CUBA", "PERU", "COLOMBIA", "HONDURAS", "CU…
## $ gender <chr> "Male", "Female", "Female", "Male", "Female…
## $ departed_date <chr> NA, "45105", "44959", "44904", NA, NA, NA, …
## $ final_order_date <chr> NA, "45077", "44944", "44862", NA, NA, NA, …
## $ apprehension_site_landmark <chr> "WALKINS AT SAN ANTONIO", NA, NA, "JAC GENE…
## $ operation <chr> "-", NA, NA, "HQ Tracking of Processing don…
## $ toa_current_duty_site <chr> "ERO - San Antonio, TX Field Office", "ELOY…
## $ case_criminality <chr> "3 Other Immigration Violator", "1 Convicte…
## $ case_threat_level <chr> "NA", "1", "NA", "NA", "NA", "3", "NA", "3"…
## $ unique_identifier <chr> "bad3911f91e15e59572fe96a226c1fdbcf82d799",…
## $ apprehension_date_time <chr> "44835.04861111111", "44835.0747337963", "4…
## $ duplicate_likely <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE"…
## $ file_original <chr> "2026-ICLI-00005_Arrests_FY23_20260311_Reda…
## $ sheet_original <chr> "FY2023", "FY2023", "FY2023", "FY2023", "FY…
## $ row_original <chr> "124239", "145764", "19899", "53227", "8809…
summary(arrest_data)
## apprehension_date apprehension_type apprehension_state apprehension_aor
## Length:713464 Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## final_program arresting_agency apprehension_method
## Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## apprehension_criminality case_status case_category
## Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## departure_country final_order_yes_no birth_year citizenship_country
## Length:713464 Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## gender departed_date final_order_date
## Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## apprehension_site_landmark operation toa_current_duty_site
## Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## case_criminality case_threat_level unique_identifier
## Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## apprehension_date_time duplicate_likely file_original
## Length:713464 Length:713464 Length:713464
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## sheet_original row_original
## Length:713464 Length:713464
## Class :character Class :character
## Mode :character Mode :character
Data Manipulation and Cleaning
I used four dplyr functions; filter, mutate, group_by and summarize, to clean the data. I removed missing “NA” strings, standardized text capitalization, and converted state names into two-letter acronyms to ensure the visualizations were readable.
## Data Manipulation (cleaning to remove all NAs)
arrest_data <- arrest_data %>%
# Removing true NAs and empty spaces
filter(!is.na(apprehension_state), !is.na(gender)) %>%
mutate(apprehension_state = str_trim(str_to_title(as.character(apprehension_state))),
gender = str_trim(str_to_title(as.character(gender)))) %>%
# Removing "NA" or "Unknown" strings
filter(apprehension_state != "Na", apprehension_state != "",
gender != "Na", gender != "Unknown", gender != "") %>%
# Converting to Acronyms and dropping unused factor levels
mutate(state_code = state.abb[match(apprehension_state, state.name)]) %>%
mutate(state_code = ifelse(is.na(state_code), apprehension_state, state_code)) %>%
mutate(gender = fct_drop(as.factor(gender)))
# Summary for Top 15 plot
state_top15 <- arrest_data %>%
group_by(state_code) %>%
summarize(total_arrests = n(), .groups = 'drop') %>%
slice_max(total_arrests, n = 15) %>%
arrange(desc(total_arrests))
Visualizations
# Plot 1: Top 15 States with Comma Formatting
ggplot(state_top15, aes(x = reorder(state_code, -total_arrests), y = total_arrests)) +
geom_bar(stat = "identity", fill = "steelblue", width = 0.7) +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Top 15 States with Most Arrests", x = "State Code", y = "Total Arrests")
Plot 1: Uses a bar chart to rank states. I used scale_y_continuous(labels = scales::comma) to make the large arrest numbers (like 150,000) easier to read.
# Plot 2: Gender Distribution with modern after_stat(count)
ggplot(arrest_data, aes(x = gender, fill = gender)) +
geom_bar(width = 0.6) +
geom_text(stat = 'count', aes(label = scales::comma(after_stat(count))), vjust = -0.5) +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
labs(title = "Gender Distribution of Apprehensions", x = "Gender", y = "Total Count")
Plot 2: Compares the two gender categories. I added geom_text() with after_stat (count) to place the exact count directly above each bar, this improved clarity on the predominance of one group over the other and eliminated cluster.
C. Statistical Analysis
I conducted a Pearson’s Chi-Squared Test of Independence. This test was used to determine if the relationship between two categorical variables (State and Gender) is statistically significant or just due to random chance.
Hypotheses:
H(0): There is no association between the state of apprehension and gender (Independence).
H(a): There is a significant association between the state of apprehension and gender (Dependence).
# 1. Creating the table
contingency_table <- table(arrest_data$apprehension_state, arrest_data$gender)
# 2. Performing the test without simulation (default)
chi_test_result <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
# 3. Viewing the results
print(chi_test_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 13981, df = 62, p-value < 2.2e-16
Interpretation
The test resulted in an X-squared value of 13981 and a p-value < 2.2e-16. Since the p-value is significantly lower than P<0.05, I reject the null hypothesis H(0). This indicates that gender distribution is not uniform across states, suggesting a significant dependency between location and demographic trends.
Practical Meaning
This result suggests that the gender split isn’t the same everywhere due to the wild variance observed in the data set.
D. Conclusion and Future Directions
The analysis identifies Texas (TX) as the primary hotspot with 153,526 arrests, and Males as the predominant group with 516,022 cases. These findings highlight a massive concentration of enforcement in specific border regions and a significant gender imbalance in apprehension data.
Future research, i would incorporate the case_threat_level variable to investigate if the high volume of male arrests in states like Texas corresponds to specific security risk categories or if policy changes are needed for more balanced resource allocation