This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
Loading my libraries
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyverse)
Warning: package ‘ggplot2’ was built under R version 4.3.3── Attaching core tidyverse packages ───────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.4
✔ ggplot2 3.5.2 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.0── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(readxl)
library(readr)
library(stringr)
library(janitor)
Warning: package ‘janitor’ was built under R version 4.3.3
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
Here I’m importing the data, and I’m skipping 6 because the first 6 rows of the excel file have summary information, but the data itself doesn’t actually start. Then I’m using Summary to check what my time range is, looking specifically at the Apprehension column.
AdminArrest <- read_excel("/Users/eleanorprickettmorgan/Desktop/DeportationData/AdminArrest.xlsx", skip = 6)
View(AdminArrest)
summary(AdminArrest)
Apprehension Date Apprehension State Apprehension County Apprehension AOR Final Program
Min. :2023-09-01 00:00:00.00 Length:291722 Mode:logical Length:291722 Length:291722
1st Qu.:2024-04-19 06:28:15.00 Class :character NA's:291722 Class :character Class :character
Median :2024-12-18 17:00:00.00 Mode :character Mode :character Mode :character
Mean :2024-10-23 16:33:28.67
3rd Qu.:2025-05-07 10:15:40.50
Max. :2025-07-29 01:14:44.00
Final Program Group Apprehension Method Apprehension Criminality Case Status Case Category
Length:291722 Length:291722 Length:291722 Length:291722 Length:291722
Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character
Departed Date Departure Country Final Order Yes No Final Order Date
Min. :1923-09-14 00:00:00.00 Length:291722 Length:291722 Min. :1967-01-06 00:00:00.00
1st Qu.:2024-07-05 00:00:00.00 Class :character Class :character 1st Qu.:2019-06-05 00:00:00.00
Median :2025-02-10 00:00:00.00 Mode :character Mode :character Median :2024-04-16 00:00:00.00
Mean :2024-12-05 05:22:06.14 Mean :2021-03-03 06:16:09.34
3rd Qu.:2025-05-23 00:00:00.00 3rd Qu.:2025-02-27 00:00:00.00
Max. :2025-07-28 00:00:00.00 Max. :2025-07-29 00:00:00.00
NA's :123300 NA's :113703
Birth Date Birth Year Citizenship Country Gender Apprehension Site Landmark Alien File Number
Length:291722 Min. :1934 Length:291722 Length:291722 Length:291722 Length:291722
Class :character 1st Qu.:1983 Class :character Class :character Class :character Class :character
Mode :character Median :1991 Mode :character Mode :character Mode :character Mode :character
Mean :1990
3rd Qu.:1998
Max. :2025
EID Case ID EID Subject ID Unique Identifier
Length:291722 Length:291722 Length:291722
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
To check for duplicates I’m going to look at the unique identifier column.
AdminArrest %>%
group_by(`Unique Identifier`) %>%
filter(n()==2)
NA
I get back 27,214 rows where the Unique identifier repeated itself. But I can tell that there are certain instances where the same identifier appears twice with a different apprehension date. So I’m going to try again looking at Apprehension Date in addition to the Unique Identifier column.
AdminArrest %>%
group_by(`Unique Identifier`, `Apprehension Date`) %>%
filter(n()==2)
When I filter by both of these things I get 3,304 rows. That suggests that maybe certain individuals were apprehended twice, or have been in the system multiple times over the course of many years, this is closer to the true number of duplicates. So now I’m going to remove those potential duplicates using the distinct function.
cleaned_AdminArrest <- AdminArrest %>%
distinct(`Unique Identifier`,`Apprehension Date`, .keep_all = TRUE)
View(cleaned_AdminArrest)
When I run this I now have 289,927 entries instead of the original 291722. I would probably still ask an expert who works with this data if there are other ways I might not be catching duplicates but this makes the most sense to me.
To check for consistency issues, I’m going to look at the column names dealing primarily with characters. So for example for states I know that I want 50 ideally, and I would check by doing the following:
cleaned_AdminArrest %>%
group_by(`Apprehension State`) %>%
tally()
Here there are a couple consistency issues, some of the things included in States are three locations labeled “ARMED FORCES - EUROPE, ARMED FORCES - THE AMERICAS, ARMED SERVICES - THE PACIFIC.” I have no idea what differentiates armed forces vs. services, nor am I clear if these are arrests that took place on military bases or as a part of Military operations. Worth noting that we have 62 rows, instead of 50, which is partially because of territories under US control (i.e. Puerto Rico, Federated States of Micronesia, Guam, etc.) District of Columbia is also it’s own entry, and it’s not a state. There are also two lone entries in areas outside of US control, one just labeled “MEXICO,” the other labeled “TAMAULIPAS,” which is a territory in Mexico. Those seem like individual data entry issues or miscategorizatons. Aside from that, the states themselves are spelled in consistent ways.
Now I’m going to perform a similar test on the AOR column for consistency. From the ICE.gov website I’m expecting 25 AOR’s
cleaned_AdminArrest %>%
group_by(`Apprehension AOR`) %>%
tally()
NA
Here I got 27 rows back, one corresponded to N/A, with 5223 entries not assigned to a specific geography. There’s also a mysterious HQ Area of Responsibility, which is not listed online with only 50 entries. I would again have to ask someone who works with this data regularly what this means.
Now I’ll do the same with the final program column:
cleaned_AdminArrest %>%
group_by(`Final Program`) %>%
tally()
Here the most obvious issue which populates first is what’s the difference between “287G Program” and “287g Task Force.” Because my project is in California which is (allegedly) not participating in 287g, I’m not worried about that. I would want to double check with an expert that some of these things do not refer to the same practical categories. I am concerned about the 487 Juveniles listed here.
Because this is ICE data, I would assume everything in the final program group column is ICE, but I’m just double checking.
cleaned_AdminArrest %>%
group_by(`Final Program Group`) %>%
tally()
That produced just one row, so that’s consistent.
Now I’m checking for apprehension method:
cleaned_AdminArrest %>%
group_by(`Apprehension Method`) %>%
tally()
In terms of how things are literally spelled yes there is consistency. In terms of what the data is saying here, I do have a lot of questions. For example only one person is listed as arrested in the “Presented During Inspection” category, which feels like a contradiction with the courthouse and field office check-ins having so many individuals grabbed by ICE. However, those might be a different category of enforcement that’s included in another data set (i.e. detentions). This is probably my flawed understanding of the legal system here, but still something I want to note.
Now I’m going to check for apprehension criminality:
cleaned_AdminArrest %>%
group_by(`Apprehension Criminality`) %>%
tally()
Here I get back only three rows, which is consistent with my understanding of folks are categorized in these situations.
Now I’ll check for case status:
cleaned_AdminArrest %>%
group_by(`Case Status`) %>%
tally()
Here I get back 14 rows with categories listed. The first chunk are numbered starting at 0, but notably missing 1 and 2. I don’t know what those categories could be. There are 4462 who are not in a category at all, which is confusing because one category is literally just for active cases. There are also some acronyms I don’t understand, “9-VR Witnessed” I assume is voluntary removal witnessed, how is that different than voluntary departure? I’m concerned also about the 41 deaths listed. I’d also like to understand the difference between “L-Legalization - Permanent Residence Granted” and “Z-SAW - Permanent Residence Granted.”
Now I’ll look at Case Category:
cleaned_AdminArrest %>%
group_by(`Case Category`) %>%
tally()
Here there are no glaring consistency issues except the 4412 uncategorized cases.
Now I’ll check for Departure Country:
cleaned_AdminArrest %>%
group_by(`Departure Country`) %>%
tally()
I got back 193 rows which is in line with the number of countries in the world (depending on who you ask, but at least in line with the number of countries the US recognizes). Clicking through I didn’t see any super weird duplicates. One row is N/A but from looking at the other columns in conjunction, I think a departure country is only entered when the departure happens. So the active cases don’t have a departure country listed.
Now I’ll check final order Yes or No (hopefully 2 rows)
cleaned_AdminArrest %>%
group_by(`Final Order Yes No`) %>%
tally()
Here we get back Yes, No, or Blank (4412 rows).
Now I’ll look at citizenship country
cleaned_AdminArrest %>%
group_by(`Citizenship Country`) %>%
tally()
My citizenship countries outnumber my departure countries by 3, I would want to compare those and see where I have additional citizenship but not departure. Notably no one here is uncategorized.
Now I’m checking for gender which I’m expecting 2 rows for:
cleaned_AdminArrest %>%
group_by(Gender) %>%
tally()
We get back three rows: Female, Male, and Unknown. Important to note that unknown is written in, and isn’t just a blank.
Now I’m checking for Apprehension Site Landmark. This is where I’m expecting the most inconsistencies.
cleaned_AdminArrest %>%
group_by(`Apprehension Site Landmark`) %>%
tally()
Here I get back 500 rows, and there are major differences in capitalization and usage of Dashes. Some include the state they were in, others do not. Some also includde the program they were in (297(g), CAP, etc.). For the purposes of my project I’d probably take the California subset and clean that, but this would likely be the most onerous part of working with this data.
Now to check for missing data, I’m going to use/refer to the original unclean data “AdminArrest”:
MissingData <-colSums(is.na(AdminArrest))
print(MissingData)
Apprehension Date Apprehension State Apprehension County Apprehension AOR
0 56709 291722 5976
Final Program Final Program Group Apprehension Method Apprehension Criminality
0 0 0 0
Case Status Case Category Departed Date Departure Country
4450 4450 123300 123341
Final Order Yes No Final Order Date Birth Date Birth Year
4450 113703 0 0
Citizenship Country Gender Apprehension Site Landmark Alien File Number
0 0 6264 2471
EID Case ID EID Subject ID Unique Identifier
4450 0 2473
I want to mutate the apprehension date column so that I can filter later by year and month, and even potentially the time of day an administrative arrest took place.
new_AdminArrest <-cleaned_AdminArrest%>%
mutate(ApprehensionYear = year(`Apprehension Date`) )%>%
mutate(ApprehensionMonth = month(`Apprehension Date`))%>%
mutate(ApprehensionDay = day(`Apprehension Date`))%>%
mutate(ApprehensionHour = hour(`Apprehension Date`))
View(new_AdminArrest)
Now I want to look at just administrative arrests in California, so I’m going to filter by the apprehension state column.
CA_AdminArrest <- new_AdminArrest %>%
filter(`Apprehension State` == "CALIFORNIA")
View(CA_AdminArrest)
I want to filter that further to just the Administrative arrests in the SF Area of Responsibility, because even though that’s actually a lot larger than the geographic scope my project is looking at, it at least includes my area the Bay Area.
SFAOR_AdminArrest <- CA_AdminArrest %>%
filter(`Apprehension AOR` == "San Francisco Area of Responsibility")
View(SFAOR_AdminArrest)
ANALYSIS SECTION:
A really basic comparison I might make to start is comparing the number of Administrative Arrests by year.
SFAOR_AdminArrest %>%
group_by(ApprehensionYear) %>%
tally()
But the table I get here actually isn’t that informative given that I only have data from three years. But maybe I just want to see the change in time by month of just this new administration (2025), So I’d likely do the following:
SFAOR_AdminArrestbyMonth <-SFAOR_AdminArrest %>%
filter(ApprehensionYear == "2025") %>%
group_by(ApprehensionMonth) %>%
tally()
View(SFAOR_AdminArrestbyMonth)
Just looking at the table I see there’s an increase in June which held in July, but it might be helpful to visualize it, so I’m going to look at plot it out:
ggplot(data=SFAOR_AdminArrestbyMonth) +
geom_line(aes(x=ApprehensionMonth, y=n)) +
xlim(1,8) +
ylim(0,600) +
labs( x = "Month", y = "Number of Arrests")
ggtitle("Administrative Arrests by Month")
$title
[1] "Administrative Arrests by Month"
attr(,"class")
[1] "labels"
Anecdotally, I’ve also heard this thing from a number of advocates that ICE actions (arrests) tend to happen in the early morning. I want to check if that’s true, so I’m going to look at the ApprehensionHour column that I mutated out of the original Apprehension Date column. It’s on a 24 hour clock (no am/pm). I’m going to use the CA_AdminArrest data because according to advocates this represents ICE policy so it should be true across the state.
CA_AdminArrest_byHour <- CA_AdminArrest %>%
group_by(ApprehensionHour) %>%
tally()
View(CA_AdminArrest_byHour)
To see that visually I’m going to chart it.
ggplot(data=CA_AdminArrest_byHour, aes(x = ApprehensionHour, y = n)) +
geom_bar(stat="identity", width=0.5, fill= "steelblue") +
labs( x = "Hour of the Day", y = "Number of Arrests") +
ggtitle("Administrative Arrests by Hour of the Day from Sept 2023 through July 2025") +
scale_x_continuous(n.breaks=23)
This shows that the number of arrests peaks betwen 9 and 10 am, and if the actual arrests are taking place at that hour we can infer that ICE/DHS had to be active before then. The majority of the activity seems to happen between 7am and 1pm.
Another thing I’m interested in is under the current administration, what are the top countries of origin for folks being arrested. I’m interested both statewide and for just the San Francisco AOR, but I’l start with just California
CaliforniaCitizenshipCountry2025 <- CA_AdminArrest %>%
filter(ApprehensionYear == "2025") %>%
group_by(`Citizenship Country`) %>%
tally()
View(CaliforniaCitizenshipCountry2025)
When I do this and sort by values my top 10 citizenship countries are Mexico, Guatemala, El Salvador, Colombia, Honduras, India, Venezuela, China, Nicaragua and Peru. To compare this to the SFAOR I’m doing a similar analysis
SFAORCitizenshipCountry2025 <- SFAOR_AdminArrest %>%
filter(ApprehensionYear == "2025") %>%
group_by(`Citizenship Country`) %>%
tally()
View(SFAORCitizenshipCountry2025)
When I sort this table by the count, I get slightly different results, but there is a lot of overlap. In the San Francisco Area of Responsibility the top 10 citizenship countries for arrested folks are Mexico, India, Guatemala, El Salvador, Colombia, Honduras, Peru, Nicaragua, Venezuela, and China. Mexico notably outpaces the rest of these places in number of arrests.
Something else I’m interested in is the “criminality” of those arrested in the San Francisco Area of responsibility. Again I want to look just in the context of the current administration. Those who are arrested are assigned to one of three categories of criminality, but I’m interested in the proportion that those categories show up in arrests.
SFAORArrestCriminality2025 <- SFAOR_AdminArrest %>%
filter(ApprehensionYear == "2025") %>%
group_by(`Apprehension Criminality`) %>%
tally()
View(SFAORArrestCriminality2025)
For those arrested the majority are convicted criminals, according to the data, but it’s notable that a sizable chunk are categorized under “other immigration violation.” Arrests are supposed to represent individuals have a real case against them, so the convicted criminal category should represented the majority. And these arrests are not the totality of deportations, if you look at the deportation data set in addition to this arrests data, you would find the majority of folks coming in to contact with ICE/DHS do not have any criminal convictions. It’s also worth noting that the convicted criminal category doesn’t break down further into what crimes folks were convicted of. Other reporting has confirmed that there are many folks who end up arrested where the basis of their conviction is a traffic violation. I talked to one lawyer in the Bay Area who confirmed they had a client where a prior conviction came in the form of a citation for failing to pay for BART. Are those crimes the same as murder? No, but under this categorization they’re flattened into one category together.