Data 101 Project 2

Introduction

Top 15 states that have the highest frequency of arrests, and is there a predominant gender among those apprehended

This project explores enforcement trends across various U.S. states using a data set containing information on individual arrests. My project focuses on identifying geographic hot spots and demographic trends using variables such as “apprehension_state” and “gender”.

The data set is a comprehensive record of law enforcement actions, consisting of 31 variables and over 600,000 cases. For this study, I am focusing on the geographic location of the arrest and the demographic classification of the apprehended individuals. Understanding these patterns is crucial for identifying regional resource allocation and the demographic impacts of enforcement policies.

Dataset Source:

https://deportationdata.org/data/processed/ice.html

Data Analysis

In this section, I perform a bivariate categorical analysis to examine the frequency of arrests by location and gender. I will generate a ranked bar chart for the top 15 states and a frequency plot for gender distribution to address the research question

# Loading necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales) # Required for comma formatting

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(readxl)
# Importing the dataset
arrest_data <- read_xlsx("arrests-latest.xlsx", col_types = "text")

# Viewing dimensions and structure
dim(arrest_data)

## [1] 713464     28

str(arrest_data)

## tibble [713,464 × 28] (S3: tbl_df/tbl/data.frame)
##  $ apprehension_date         : chr [1:713464] "44835" "44835" "44835" "44835" ...
##  $ apprehension_type         : chr [1:713464] NA NA NA NA ...
##  $ apprehension_state        : chr [1:713464] "TEXAS" NA NA "FLORIDA" ...
##  $ apprehension_aor          : chr [1:713464] "San Antonio Area of Responsibility" "Phoenix Area of Responsibility" "Phoenix Area of Responsibility" "Miami Area of Responsibility" ...
##  $ final_program             : chr [1:713464] "Non-Detained Docket Control" "Detained Docket Control" "Detained Docket Control" "287G Program" ...
##  $ arresting_agency          : chr [1:713464] "ICE" "ICE" "ICE" "ICE" ...
##  $ apprehension_method       : chr [1:713464] "ERO Reprocessed Arrest" "ERO Reprocessed Arrest" "ERO Reprocessed Arrest" "287(g) Program" ...
##  $ apprehension_criminality  : chr [1:713464] "3 Other Immigration Violator" "1 Convicted Criminal" "3 Other Immigration Violator" "2 Pending Criminal Charges" ...
##  $ case_status               : chr [1:713464] "A-Proceedings Terminated" "8-Excluded/Removed - Inadmissibility" "8-Excluded/Removed - Inadmissibility" "8-Excluded/Removed - Inadmissibility" ...
##  $ case_category             : chr [1:713464] "[8B] Excludable / Inadmissible - Under Adjudication by IJ" "[8C] Excludable / Inadmissible - Administrative Final Order Issued" "[8C] Excludable / Inadmissible - Administrative Final Order Issued" "[8C] Excludable / Inadmissible - Administrative Final Order Issued" ...
##  $ departure_country         : chr [1:713464] NA "PERU" "COLOMBIA" "HONDURAS" ...
##  $ final_order_yes_no        : chr [1:713464] "NO" "YES" "YES" "YES" ...
##  $ birth_year                : chr [1:713464] "1974" "1970" "1984" "1996" ...
##  $ citizenship_country       : chr [1:713464] "CUBA" "PERU" "COLOMBIA" "HONDURAS" ...
##  $ gender                    : chr [1:713464] "Male" "Female" "Female" "Male" ...
##  $ departed_date             : chr [1:713464] NA "45105" "44959" "44904" ...
##  $ final_order_date          : chr [1:713464] NA "45077" "44944" "44862" ...
##  $ apprehension_site_landmark: chr [1:713464] "WALKINS AT SAN ANTONIO" NA NA "JAC GENERAL AREA, NON-SPECIFIC" ...
##  $ operation                 : chr [1:713464] "-" NA NA "HQ Tracking of Processing done by 287(g) authorized State & Local LEA's." ...
##  $ toa_current_duty_site     : chr [1:713464] "ERO - San Antonio, TX Field Office" "ELOY, AZ, SERVICE PROCESSING CENTER (DOCKET CONTROL OFFICE)" "ELOY, AZ, SERVICE PROCESSING CENTER (DOCKET CONTROL OFFICE)" "Jacksonville Sheriff's Office" ...
##  $ case_criminality          : chr [1:713464] "3 Other Immigration Violator" "1 Convicted Criminal" "3 Other Immigration Violator" "2 Pending Criminal Charges" ...
##  $ case_threat_level         : chr [1:713464] "NA" "1" "NA" "NA" ...
##  $ unique_identifier         : chr [1:713464] "bad3911f91e15e59572fe96a226c1fdbcf82d799" "dafdb5d0e565d238b9093e99c42e305ac95ce2b4" "1db208374a6ce32dd03f1b6d876c9f692b756185" "50198f8883c9b118a1153af301c65bfef03bb058" ...
##  $ apprehension_date_time    : chr [1:713464] "44835.04861111111" "44835.0747337963" "44835.11998842593" "44835.12847222222" ...
##  $ duplicate_likely          : chr [1:713464] "FALSE" "FALSE" "FALSE" "FALSE" ...
##  $ file_original             : chr [1:713464] "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" "2026-ICLI-00005_Arrests_FY23_20260311_Redacted.xlsx" ...
##  $ sheet_original            : chr [1:713464] "FY2023" "FY2023" "FY2023" "FY2023" ...
##  $ row_original              : chr [1:713464] "124239" "145764" "19899" "53227" ...

glimpse(arrest_data)

## Rows: 713,464
## Columns: 28
## $ apprehension_date          <chr> "44835", "44835", "44835", "44835", "44835"…
## $ apprehension_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ apprehension_state         <chr> "TEXAS", NA, NA, "FLORIDA", "TEXAS", "TEXAS…
## $ apprehension_aor           <chr> "San Antonio Area of Responsibility", "Phoe…
## $ final_program              <chr> "Non-Detained Docket Control", "Detained Do…
## $ arresting_agency           <chr> "ICE", "ICE", "ICE", "ICE", "ICE", "ICE", "…
## $ apprehension_method        <chr> "ERO Reprocessed Arrest", "ERO Reprocessed …
## $ apprehension_criminality   <chr> "3 Other Immigration Violator", "1 Convicte…
## $ case_status                <chr> "A-Proceedings Terminated", "8-Excluded/Rem…
## $ case_category              <chr> "[8B] Excludable / Inadmissible - Under Adj…
## $ departure_country          <chr> NA, "PERU", "COLOMBIA", "HONDURAS", NA, NA,…
## $ final_order_yes_no         <chr> "NO", "YES", "YES", "YES", "NO", "NO", "NO"…
## $ birth_year                 <chr> "1974", "1970", "1984", "1996", "1972", "19…
## $ citizenship_country        <chr> "CUBA", "PERU", "COLOMBIA", "HONDURAS", "CU…
## $ gender                     <chr> "Male", "Female", "Female", "Male", "Female…
## $ departed_date              <chr> NA, "45105", "44959", "44904", NA, NA, NA, …
## $ final_order_date           <chr> NA, "45077", "44944", "44862", NA, NA, NA, …
## $ apprehension_site_landmark <chr> "WALKINS AT SAN ANTONIO", NA, NA, "JAC GENE…
## $ operation                  <chr> "-", NA, NA, "HQ Tracking of Processing don…
## $ toa_current_duty_site      <chr> "ERO - San Antonio, TX Field Office", "ELOY…
## $ case_criminality           <chr> "3 Other Immigration Violator", "1 Convicte…
## $ case_threat_level          <chr> "NA", "1", "NA", "NA", "NA", "3", "NA", "3"…
## $ unique_identifier          <chr> "bad3911f91e15e59572fe96a226c1fdbcf82d799",…
## $ apprehension_date_time     <chr> "44835.04861111111", "44835.0747337963", "4…
## $ duplicate_likely           <chr> "FALSE", "FALSE", "FALSE", "FALSE", "FALSE"…
## $ file_original              <chr> "2026-ICLI-00005_Arrests_FY23_20260311_Reda…
## $ sheet_original             <chr> "FY2023", "FY2023", "FY2023", "FY2023", "FY…
## $ row_original               <chr> "124239", "145764", "19899", "53227", "8809…

summary(arrest_data)

##  apprehension_date  apprehension_type  apprehension_state apprehension_aor  
##  Length:713464      Length:713464      Length:713464      Length:713464     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  final_program      arresting_agency   apprehension_method
##  Length:713464      Length:713464      Length:713464      
##  Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   
##  apprehension_criminality case_status        case_category     
##  Length:713464            Length:713464      Length:713464     
##  Class :character         Class :character   Class :character  
##  Mode  :character         Mode  :character   Mode  :character  
##  departure_country  final_order_yes_no  birth_year        citizenship_country
##  Length:713464      Length:713464      Length:713464      Length:713464      
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##     gender          departed_date      final_order_date  
##  Length:713464      Length:713464      Length:713464     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  apprehension_site_landmark  operation         toa_current_duty_site
##  Length:713464              Length:713464      Length:713464        
##  Class :character           Class :character   Class :character     
##  Mode  :character           Mode  :character   Mode  :character     
##  case_criminality   case_threat_level  unique_identifier 
##  Length:713464      Length:713464      Length:713464     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  apprehension_date_time duplicate_likely   file_original     
##  Length:713464          Length:713464      Length:713464     
##  Class :character       Class :character   Class :character  
##  Mode  :character       Mode  :character   Mode  :character  
##  sheet_original     row_original      
##  Length:713464      Length:713464     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Data Manipulation and Cleaning

I used four dplyr functions; filter, mutate, group_by and summarize, to clean the data. I removed missing “NA” strings, standardized text capitalization, and converted state names into two-letter acronyms to ensure the visualizations were readable.

## Data Manipulation (cleaning to remove all NAs) 
arrest_data <- arrest_data %>%
  # Removing true NAs and empty spaces
  filter(!is.na(apprehension_state), !is.na(gender)) %>% 
  mutate(apprehension_state = str_trim(str_to_title(as.character(apprehension_state))),
         gender = str_trim(str_to_title(as.character(gender)))) %>% 
  # Removing "NA" or "Unknown" strings
  filter(apprehension_state != "Na", apprehension_state != "",
         gender != "Na", gender != "Unknown", gender != "") %>% 
  # Converting to Acronyms and dropping unused factor levels
  mutate(state_code = state.abb[match(apprehension_state, state.name)]) %>% 
  mutate(state_code = ifelse(is.na(state_code), apprehension_state, state_code)) %>%
  mutate(gender = fct_drop(as.factor(gender)))

# Summary for Top 15 plot
state_top15 <- arrest_data %>%
  group_by(state_code) %>% 
  summarize(total_arrests = n(), .groups = 'drop') %>% 
  slice_max(total_arrests, n = 15) %>% 
  arrange(desc(total_arrests))

Visualizations

# Plot 1: Top 15 States with Comma Formatting
ggplot(state_top15, aes(x = reorder(state_code, -total_arrests), y = total_arrests)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.7) +
  scale_y_continuous(labels = scales::comma) + 
  theme_minimal() +
  labs(title = "Top 15 States with Most Arrests", x = "State Code", y = "Total Arrests")

Plot 1: Uses a bar chart to rank states. I used scale_y_continuous(labels = scales::comma) to make the large arrest numbers (like 150,000) easier to read.

# Plot 2: Gender Distribution with modern after_stat(count)
ggplot(arrest_data, aes(x = gender, fill = gender)) +
  geom_bar(width = 0.6) +
  geom_text(stat = 'count', aes(label = scales::comma(after_stat(count))), vjust = -0.5) + 
  scale_y_continuous(labels = scales::comma) + 
  theme_minimal() +
  labs(title = "Gender Distribution of Apprehensions", x = "Gender", y = "Total Count")

Plot 2: Compares the two gender categories. I added geom_text() with after_stat (count) to place the exact count directly above each bar, this improved clarity on the predominance of one group over the other and eliminated cluster.

C. Statistical Analysis

I conducted a Pearson’s Chi-Squared Test of Independence. This test was used to determine if the relationship between two categorical variables (State and Gender) is statistically significant or just due to random chance.

Hypotheses:

H(0): There is no association between the state of apprehension and gender (Independence).

H(a): There is a significant association between the state of apprehension and gender (Dependence).

# 1. Creating the table
contingency_table <- table(arrest_data$apprehension_state, arrest_data$gender)

# 2. Performing the test without simulation (default)
chi_test_result <- chisq.test(contingency_table)

## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect

# 3. Viewing the results
print(chi_test_result)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 13981, df = 62, p-value < 2.2e-16

Interpretation

The test resulted in an X-squared value of 13981 and a p-value < 2.2e-16. Since the p-value is significantly lower than P<0.05, I reject the null hypothesis H(0). This indicates that gender distribution is not uniform across states, suggesting a significant dependency between location and demographic trends.

Practical Meaning

This result suggests that the gender split isn’t the same everywhere due to the wild variance observed in the data set. This result led me to further researching other means to get more concrete results as my p value seemed off.

Re-Running Chi-Test to be More Specific to my Question

# Defining variables to be used in test
arrest_data_top15 <- arrest_data %>%
  filter(state_code %in% state_top15$state_code)


# 1. Creating the table using only the Top 15 states
contingency_table <- table(arrest_data_top15$state_code, arrest_data_top15$gender)

# 2. Performing the test
# Note: Using state_code (acronyms) makes the resulting table much cleaner
chi_test_result <- chisq.test(contingency_table)

# 3. Viewing the results
print(chi_test_result)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 9338.1, df = 14, p-value < 2.2e-16

Visualizations of Actual Information tested by updated Chi

# Joining the total_arrests count back to the filtered data for ordering
arrest_data_top15 <- arrest_data_top15 %>%
  left_join(state_top15, by = "state_code")

# Re-runin the plot
ggplot(arrest_data_top15, aes(x = reorder(state_code, -total_arrests), fill = gender)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(title = "Gender Proportions within Top 15 States",
       subtitle = "Visualizing the Dependency Tested by Chi-Square",
       x = "State Code", 
       y = "Percentage of Total Arrests",
       fill = "Gender")

Cramer’s V

library(lsr)

# Calculate Cramer's V
effect_size <- cramersV(contingency_table)
print(paste("Cramer's V:", round(effect_size, 4)))

## [1] "Cramer's V: 0.1442"

In the world of statistics, Cramer’s V measures the strength of the association between two categories on a scale from 0 to 1. For my project i will specifically be targeting the association between State and Gender.

Interpretation of my result of 0.1442:

1. The “Real-World” Strength

While my p-value said the relationship was “significant,” my Cramer’s V tells us the relationship is Small to Weak based on the table below.

0.00 – 0.10: Negligible / Very Weak
0.10 – 0.30: Small / Weak
0.30 – 0.50: Moderate
0.50 – 1.00: Strong

2. What this means for my project

Even though gender distribution is “statistically different” between states like Texas and California, the difference isn’t massive. While there is a mathematical dependency between location and gender, the dominance of male arrests is fairly consistent across the country.The state you are in only has a small impact on the gender of those being arrested.

3. Why did the p-value lie to me?

With 600,000+ rows, my Chi-Square test was “hypersensitive. It will find a significant result even if the difference is minute. Cramer’s V is the reality check that tells us the difference, while real, is relatively minor.

D. Conclusion and Future Directions

Based on just the Chi Test ( because that was what we’ve covered thus far in class), the analysis identifies Texas (TX) as the primary hot spot with 153,526 arrests, and Males as the predominant group with 516,022 cases. These findings highlight a massive concentration of enforcement in specific border regions and a significant gender imbalance in apprehension data. Post utilizing the Cramer’s V and my results showing that the state you are in only has a small impact on the gender of those being arrested, i am less inclined to say that their is a true correlation and the answer to me question would be a “no”.

Future research, i would incorporate the case_threat_level variable to investigate if the high volume of male arrests in states like Texas corresponds to specific security risk categories or if policy changes are needed for more balanced resource allocation. I will say that since my results are a bit skewed, i based on my Cramer’s V results, I might have to revisit the drawing board for a question with actual verifiable results.

Data 101 Project 2

Arinze Ugbah

Introduction