This project analyzes The Florida voter file from 2016 to 2026 to determine if missing race registrations have gone up and if the percentage of Hispanic people registering as missing race has gone up.
The research question is simple:
Has missing race registrations among Hispanics gone up during the Trump administration.
This data came from a surname matched Florida voter file that includes variables such as voter ID, county, gender, race, birth date, registration date, party affiliation, voter status, and surname matched race prediction percentages.
Because the Florida’s voter file includes all of these variables it is appropriate to use in this project
This project is important because the Trump administration has come under a lot of scrutiny for their handling of Immigration enforcement especially with Hispanic people. If this data shows that there is a change after 2024 that could show that Hispanic people may be afraid of putting down their race on government documents.
This could also lead to polls having incorrect conclusions because the number of Hispanic people could be off if more Hispanic people are putting missing race.
First I loaded in the surname matched Florida voter file and stripped away irrelevant variables.
surname_path <- "C:/Users/xande/OneDrive/Desktop/Data Science Class/VOTER FILES FLORIDA/fl_vr_surname/fl_vr_surname.csv"
months_to_keep <- 48
election_dates <- tibble(
election_label = c("2016 Election", "2020 Election", "2024 Election"),
election_date = as.Date(c("2016-11-08", "2020-11-03", "2024-11-05"))
)
vf <- fread(file = surname_path) %>%
clean_names()
vf <- vf %>%
select(
voter_id,
county_code,
name_last,
name_suffix,
name_first,
name_middle,
gender,
race,
birth_date,
registration_date,
party_affiliation,
voter_status,
county,
surname,
pred_whi,
pred_bla,
pred_his,
pred_asi,
pred_oth
)
Next I created a data set by month with only the missing race records for quicker analysis.
vf_missing_only <- vf %>%
mutate(
race = str_trim(as.character(race))
) %>%
filter(is.na(race) | race == "" | race == "9")
vf2 <- vf %>%
mutate(
race = str_trim(as.character(race)),
registration_date = suppressWarnings(mdy(as.character(registration_date)))
) %>%
mutate(
hispanic = case_when(
race == "4" ~ 1,
race %in% c("1", "2", "3", "5", "6", "7") ~ 0,
race %in% c("9", "", NA) ~ NA_real_,
TRUE ~ NA_real_
),
missing_ethnicity = if_else(is.na(hispanic), 1, 0),
registration_month = floor_date(registration_date, "month")
) %>%
filter(!is.na(registration_date), !is.na(registration_month)) %>%
filter(registration_date >= as.Date("2016-11-01"))
state_month <- vf2 %>%
group_by(registration_month) %>%
summarise(
total_new_regs = n(),
hispanic_new_regs = sum(hispanic == 1, na.rm = TRUE),
known_race_regs = sum(!is.na(hispanic), na.rm = TRUE),
missing_race_regs = sum(is.na(hispanic), na.rm = TRUE),
pct_hispanic_among_all = hispanic_new_regs / total_new_regs,
pct_hispanic_among_known = if_else(
known_race_regs > 0,
hispanic_new_regs / known_race_regs,
NA_real_
),
pct_missing_ethnicity = missing_race_regs / total_new_regs,
.groups = "drop"
)
To compare elections easier, I aligned all of the months to be after the last election
comparison_df <- state_month %>%
tidyr::crossing(election_dates) %>%
mutate(
election_month = floor_date(election_date, "month"),
months_since_election =
(year(registration_month) - year(election_month)) * 12 +
(month(registration_month) - month(election_month))
) %>%
filter(months_since_election >= 0, months_since_election <= months_to_keep)
coverage_table <- comparison_df %>%
group_by(election_label) %>%
summarise(
first_month = min(months_since_election, na.rm = TRUE),
last_month = max(months_since_election, na.rm = TRUE),
number_of_months = n(),
.groups = "drop"
)
This table shows how many people on average registered after each election in each category.
max_common_month <- comparison_df %>%
filter(election_label == "2024 Election") %>%
summarise(max_month = max(months_since_election, na.rm = TRUE)) %>%
pull(max_month)
summary_table <- comparison_df %>%
filter(months_since_election <= max_common_month) %>%
mutate(Election = str_remove(election_label, " Election")) %>%
group_by(Election) %>%
summarise(
`Average % Hispanic (All Registrations)` = round(mean(pct_hispanic_among_all, na.rm = TRUE) * 100, 2),
`Average % Hispanic (Known Race Only)` = round(mean(pct_hispanic_among_known, na.rm = TRUE) * 100, 2),
`Average % Missing Race/Ethnicity` = round(mean(pct_missing_ethnicity, na.rm = TRUE) * 100, 2),
.groups = "drop"
)
knitr::kable(
summary_table,
caption = paste("Average Registration Patterns, Months 0 to", max_common_month, "After Each Election"),
align = c("l", "c", "c", "c")
)
| Election | Average % Hispanic (All Registrations) | Average % Hispanic (Known Race Only) | Average % Missing Race/Ethnicity |
|---|---|---|---|
| 2016 | 20.73 | 21.17 | 2.03 |
| 2020 | 18.13 | 18.51 | 2.05 |
| 2024 | 19.42 | 20.53 | 5.41 |
This graph shows the percent of new registrants per month for 48 months after each election that registered as missing race.
g4 <- ggplot(
comparison_df,
aes(x = months_since_election, y = pct_missing_ethnicity, color = election_label, group = election_label)
) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
labs(
title = "Missing Race/Ethnicity Share After Each Presidential Election",
x = "Months Since Election",
y = "Missing Share",
color = "Election"
) +
facet_wrap(~ election_label) +
theme_minimal()
g4
This graph shows that 14 months following the 2024 presidential election the percentage of new registrants that enter missing race is substantially higher than in 2020 and 2024 which are both very similar.
I then used the surname matched predicted races to determine the predicted race of missing race registerants
vf_missing_pred <- vf_missing_only %>%
mutate(
registration_date = suppressWarnings(mdy(as.character(registration_date))),
birth_date = suppressWarnings(mdy(as.character(birth_date))),
registration_month = floor_date(registration_date, "month"),
pred_whi = suppressWarnings(as.numeric(pred_whi)),
pred_bla = suppressWarnings(as.numeric(pred_bla)),
pred_his = suppressWarnings(as.numeric(pred_his)),
pred_asi = suppressWarnings(as.numeric(pred_asi)),
pred_oth = suppressWarnings(as.numeric(pred_oth)),
gender = str_trim(as.character(gender)),
party_affiliation = str_trim(as.character(party_affiliation)),
voter_status = str_trim(as.character(voter_status)),
age_at_registration = as.numeric(time_length(interval(birth_date, registration_date), "years"))
) %>%
filter(!is.na(registration_date), !is.na(registration_month)) %>%
filter(registration_date >= as.Date("2016-11-01"))
max_pred <- max(
c(vf_missing_pred$pred_whi, vf_missing_pred$pred_bla, vf_missing_pred$pred_his,
vf_missing_pred$pred_asi, vf_missing_pred$pred_oth),
na.rm = TRUE
)
if (is.finite(max_pred) && max_pred > 1) {
vf_missing_pred <- vf_missing_pred %>%
mutate(
pred_whi = pred_whi / 100,
pred_bla = pred_bla / 100,
pred_his = pred_his / 100,
pred_asi = pred_asi / 100,
pred_oth = pred_oth / 100
)
}
missing_pred_month <- vf_missing_pred %>%
group_by(registration_month) %>%
summarise(
avg_pred_whi = mean(pred_whi, na.rm = TRUE),
avg_pred_bla = mean(pred_bla, na.rm = TRUE),
avg_pred_his = mean(pred_his, na.rm = TRUE),
avg_pred_asi = mean(pred_asi, na.rm = TRUE),
avg_pred_oth = mean(pred_oth, na.rm = TRUE),
n_missing = n(),
.groups = "drop"
)
missing_pred_comparison <- missing_pred_month %>%
tidyr::crossing(election_dates) %>%
mutate(
election_month = floor_date(election_date, "month"),
months_since_election =
(year(registration_month) - year(election_month)) * 12 +
(month(registration_month) - month(election_month))
) %>%
filter(months_since_election >= 0, months_since_election <= months_to_keep)
missing_pred_long <- missing_pred_comparison %>%
pivot_longer(
cols = c(avg_pred_whi, avg_pred_bla, avg_pred_his, avg_pred_asi, avg_pred_oth),
names_to = "predicted_race",
values_to = "avg_share"
) %>%
mutate(
predicted_race = recode(
predicted_race,
avg_pred_whi = "Predicted White",
avg_pred_bla = "Predicted Black",
avg_pred_his = "Predicted Hispanic",
avg_pred_asi = "Predicted Asian",
avg_pred_oth = "Predicted Other"
)
)
This table shows the average predicted racial composition of registrations with missing race across the post-election months that all three election periods have in common.
max_common_month <- comparison_df %>%
filter(election_label == "2024 Election") %>%
summarise(max_month = max(months_since_election, na.rm = TRUE)) %>%
pull(max_month)
predicted_race_summary <- missing_pred_comparison %>%
filter(months_since_election <= max_common_month) %>%
mutate(Election = str_remove(election_label, " Election")) %>%
group_by(Election) %>%
summarise(
`Average % Predicted White` = round(mean(avg_pred_whi, na.rm = TRUE) * 100, 2),
`Average % Predicted Hispanic` = round(mean(avg_pred_his, na.rm = TRUE) * 100, 2),
`Average % Predicted Black` = round(mean(avg_pred_bla, na.rm = TRUE) * 100, 2),
`Average % Predicted Asian` = round(mean(avg_pred_asi, na.rm = TRUE) * 100, 2),
`Average % Predicted Other` = round(mean(avg_pred_oth, na.rm = TRUE) * 100, 2),
.groups = "drop"
)
knitr::kable(
predicted_race_summary,
caption = paste("Average Predicted Race of missing race registerants of each election, Months 0 to", max_common_month),
align = c("l", "c", "c", "c", "c", "c")
)
| Election | Average % Predicted White | Average % Predicted Hispanic | Average % Predicted Black | Average % Predicted Asian | Average % Predicted Other |
|---|---|---|---|---|---|
| 2016 | 42.18 | 31.75 | 15.46 | 6.15 | 4.46 |
| 2020 | 49.45 | 23.74 | 17.16 | 4.76 | 4.89 |
| 2024 | 33.30 | 42.10 | 15.50 | 4.99 | 4.10 |
To see if any changes were specifically caused by Hispanic voters I used the surname matched race probabilities to determine the average race of missing race registrations.
g5a <- ggplot(
missing_pred_long,
aes(x = months_since_election, y = avg_share, color = predicted_race, group = predicted_race)
) +
geom_line(linewidth = 1) +
geom_point(size = 1.8) +
scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
labs(
title = "Average Predicted Race Shares Within Missing-Race Registrations",
subtitle = "Based on surname-matched prediction columns",
x = "Months Since Election",
y = "Average Predicted Share",
color = "Predicted Race"
) +
facet_wrap(~ election_label) +
theme_minimal()
g5a
I also added vertical lines on the dates when major ICE activity in Florida happened to test correlation.
event_markers_2024 <- tibble(
election_label = "2024 Election",
months_since_election = c(5, 13),
event_label = c("Operation Tidal Wave", "Pinellas Park / Tampa")
)
g5 <- ggplot(
missing_pred_long,
aes(
x = months_since_election,
y = avg_share,
color = predicted_race,
group = predicted_race
)
) +
geom_line(linewidth = 1) +
geom_point(size = 1.8) +
geom_vline(
data = event_markers_2024,
aes(xintercept = months_since_election),
inherit.aes = FALSE,
linetype = "dashed"
) +
geom_text(
data = event_markers_2024,
aes(
x = months_since_election,
y = 0.99 * max(missing_pred_long$avg_share, na.rm = TRUE),
label = event_label
),
inherit.aes = FALSE,
angle = 90,
vjust = -0.3,
hjust = 1,
size = 3
) +
scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
labs(
title = "Average Predicted Race Shares Within Missing-Race Registrations",
subtitle = "Based on surname-matched prediction columns, with Florida ICE event markers",
x = "Months Since Election",
y = "Average Predicted Share",
color = "Predicted Race"
) +
facet_wrap(~ election_label) +
theme_minimal(base_size = 13)
g5
These graphs show that after the 2024 election the percentage of missing race registrants shot up to new decade records, and those records almost exactly match up with major ICE activity in Florida.
This table shows that months out of the decade had the highest percentage of missing race registrants predicted to be Hispanic
top_10_pred_hispanic <- missing_pred_comparison %>%
mutate(Election = str_remove(election_label, " Election")) %>%
select(
Election,
`Month After Election` = months_since_election,
`Registration Month` = registration_month,
`Average % Predicted Hispanic` = avg_pred_his
) %>%
arrange(desc(`Average % Predicted Hispanic`)) %>%
slice_head(n = 10) %>%
mutate(
`Registration Month` = format(`Registration Month`, "%B %Y"),
`Average % Predicted Hispanic` = round(`Average % Predicted Hispanic` * 100, 2)
)
knitr::kable(
top_10_pred_hispanic,
caption = "Top 10 Months by percent of Missing race registerants predicted hispanic",
align = c("l", "c", "l", "c")
)
| Election | Month After Election | Registration Month | Average % Predicted Hispanic |
|---|---|---|---|
| 2024 | 5 | April 2025 | 55.30 |
| 2024 | 6 | May 2025 | 53.89 |
| 2024 | 1 | December 2024 | 53.33 |
| 2024 | 2 | January 2025 | 51.17 |
| 2024 | 3 | February 2025 | 46.60 |
| 2024 | 8 | July 2025 | 44.48 |
| 2024 | 7 | June 2025 | 43.88 |
| 2016 | 18 | May 2018 | 41.36 |
| 2016 | 6 | May 2017 | 40.30 |
| 2024 | 9 | August 2025 | 40.08 |
missing_age_month <- vf_missing_pred %>%
filter(!is.na(age_at_registration), age_at_registration >= 0, age_at_registration <= 120) %>%
group_by(registration_month) %>%
summarise(
avg_age = mean(age_at_registration, na.rm = TRUE),
.groups = "drop"
) %>%
tidyr::crossing(election_dates) %>%
mutate(
election_month = floor_date(election_date, "month"),
months_since_election =
(year(registration_month) - year(election_month)) * 12 +
(month(registration_month) - month(election_month))
) %>%
filter(months_since_election >= 0, months_since_election <= months_to_keep)
g6 <- ggplot(
missing_age_month,
aes(x = months_since_election, y = avg_age, color = election_label, group = election_label)
) +
geom_line(linewidth = 1) +
geom_point(size = 1.8) +
scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
labs(
title = "Average Age Within Missing-Race Registrations",
subtitle = "Age at registration",
x = "Months Since Election",
y = "Average Age",
color = "Election"
) +
theme_minimal()
g6
This graph shows that the average age of registrants did not change throughout the months which can help disprove the hypothesis that the removal of inactive voters removed Hispanics that registered as missing race at a higher rate than other people.
The two largest challenges I had were:
Loading in the full voter file because it is so large. I ended up making a separate data set that was specifically only missing race entries to save time.
Going through the voter file to account of changes made throughout the years to the actual voter registration form which changes things like race codes.
My findings show that the percent of missing race registrants was significantly higher after the 2024 election than any other year, that the average age did not change throughout each election, and that the predicted race of missing race registrants being hispanic went up significantly.
There are several ways this project could be expanded.
One way is to look at how many predicted Hispanics register as other races and see if that has changed. You could also contact the registrants and see if they meant to do something else. You could also compare this with the census data to see if there is another reason for the data. Also a multinomial logit model could be used to reinforce the data. You could also look at surname matched data files from each Election completely disprove the removed records hypothesis.
This project was used to see if the Trump administration’s handling of immigration has changed the percentage of missing race registrants that are predicted Hispanic. The data predicts that this is almost certainly the case.