1 Project Overview

1.1 Research Question

This project analyzes The Florida voter file from 2016 to 2026 to determine if missing race registrations have gone up and if the percentage of Hispanic people registering as missing race has gone up.

The research question is simple:

Has missing race registrations among Hispanics gone up during the Trump administration.

1.2 Data Source

This data came from a surname matched Florida voter file that includes variables such as voter ID, county, gender, race, birth date, registration date, party affiliation, voter status, and surname matched race prediction percentages.

Because the Florida’s voter file includes all of these variables it is appropriate to use in this project

1.3 Why This Project Matters

This project is important because the Trump administration has come under a lot of scrutiny for their handling of Immigration enforcement especially with Hispanic people. If this data shows that there is a change after 2024 that could show that Hispanic people may be afraid of putting down their race on government documents.

This could also lead to polls having incorrect conclusions because the number of Hispanic people could be off if more Hispanic people are putting missing race.

2 Data Cleaning and Preparation

First I loaded in the surname matched Florida voter file and stripped away irrelevant variables.

surname_path <- "C:/Users/xande/OneDrive/Desktop/Data Science Class/VOTER FILES FLORIDA/fl_vr_surname/fl_vr_surname.csv"

months_to_keep <- 48

election_dates <- tibble(
  election_label = c("2016 Election", "2020 Election", "2024 Election"),
  election_date  = as.Date(c("2016-11-08", "2020-11-03", "2024-11-05"))
)

vf <- fread(file = surname_path) %>%
  clean_names()

vf <- vf %>%
  select(
    voter_id,
    county_code,
    name_last,
    name_suffix,
    name_first,
    name_middle,
    gender,
    race,
    birth_date,
    registration_date,
    party_affiliation,
    voter_status,
    county,
    surname,
    pred_whi,
    pred_bla,
    pred_his,
    pred_asi,
    pred_oth
  )

Next I created a data set by month with only the missing race records for quicker analysis.

vf_missing_only <- vf %>%
  mutate(
    race = str_trim(as.character(race))
  ) %>%
  filter(is.na(race) | race == "" | race == "9")

vf2 <- vf %>%
  mutate(
    race = str_trim(as.character(race)),
    registration_date = suppressWarnings(mdy(as.character(registration_date)))
  ) %>%
  mutate(
    hispanic = case_when(
      race == "4" ~ 1,
      race %in% c("1", "2", "3", "5", "6", "7") ~ 0,
      race %in% c("9", "", NA) ~ NA_real_,
      TRUE ~ NA_real_
    ),
    missing_ethnicity = if_else(is.na(hispanic), 1, 0),
    registration_month = floor_date(registration_date, "month")
  ) %>%
  filter(!is.na(registration_date), !is.na(registration_month)) %>%
  filter(registration_date >= as.Date("2016-11-01"))
state_month <- vf2 %>%
  group_by(registration_month) %>%
  summarise(
    total_new_regs = n(),
    hispanic_new_regs = sum(hispanic == 1, na.rm = TRUE),
    known_race_regs = sum(!is.na(hispanic), na.rm = TRUE),
    missing_race_regs = sum(is.na(hispanic), na.rm = TRUE),
    pct_hispanic_among_all = hispanic_new_regs / total_new_regs,
    pct_hispanic_among_known = if_else(
      known_race_regs > 0,
      hispanic_new_regs / known_race_regs,
      NA_real_
    ),
    pct_missing_ethnicity = missing_race_regs / total_new_regs,
    .groups = "drop"
  )

To compare elections easier, I aligned all of the months to be after the last election

comparison_df <- state_month %>%
  tidyr::crossing(election_dates) %>%
  mutate(
    election_month = floor_date(election_date, "month"),
    months_since_election =
      (year(registration_month) - year(election_month)) * 12 +
      (month(registration_month) - month(election_month))
  ) %>%
  filter(months_since_election >= 0, months_since_election <= months_to_keep)

coverage_table <- comparison_df %>%
  group_by(election_label) %>%
  summarise(
    first_month = min(months_since_election, na.rm = TRUE),
    last_month = max(months_since_election, na.rm = TRUE),
    number_of_months = n(),
    .groups = "drop"
  )

3 Percent of New Registerants Entering Missing Race

3.1 Missing Race Summary Table

This table shows how many people on average registered after each election in each category.

max_common_month <- comparison_df %>%
  filter(election_label == "2024 Election") %>%
  summarise(max_month = max(months_since_election, na.rm = TRUE)) %>%
  pull(max_month)

summary_table <- comparison_df %>%
  filter(months_since_election <= max_common_month) %>%
  mutate(Election = str_remove(election_label, " Election")) %>%
  group_by(Election) %>%
  summarise(
    `Average % Hispanic (All Registrations)` = round(mean(pct_hispanic_among_all, na.rm = TRUE) * 100, 2),
    `Average % Hispanic (Known Race Only)` = round(mean(pct_hispanic_among_known, na.rm = TRUE) * 100, 2),
    `Average % Missing Race/Ethnicity` = round(mean(pct_missing_ethnicity, na.rm = TRUE) * 100, 2),
    .groups = "drop"
  )

knitr::kable(
  summary_table,
  caption = paste("Average Registration Patterns, Months 0 to", max_common_month, "After Each Election"),
  align = c("l", "c", "c", "c")
)
Average Registration Patterns, Months 0 to 14 After Each Election
Election Average % Hispanic (All Registrations) Average % Hispanic (Known Race Only) Average % Missing Race/Ethnicity
2016 20.73 21.17 2.03
2020 18.13 18.51 2.05
2024 19.42 20.53 5.41

3.2 Graph 1: Percent of new registrants that entered missing race

This graph shows the percent of new registrants per month for 48 months after each election that registered as missing race.

g4 <- ggplot(
  comparison_df,
  aes(x = months_since_election, y = pct_missing_ethnicity, color = election_label, group = election_label)
) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  labs(
    title = "Missing Race/Ethnicity Share After Each Presidential Election",
    x = "Months Since Election",
    y = "Missing Share",
    color = "Election"
  ) +
  facet_wrap(~ election_label) +
  theme_minimal()

g4

3.3 Interpretation

This graph shows that 14 months following the 2024 presidential election the percentage of new registrants that enter missing race is substantially higher than in 2020 and 2024 which are both very similar.

4 Predicted Race Within Missing-Race Registrations

I then used the surname matched predicted races to determine the predicted race of missing race registerants

vf_missing_pred <- vf_missing_only %>%
  mutate(
    registration_date = suppressWarnings(mdy(as.character(registration_date))),
    birth_date = suppressWarnings(mdy(as.character(birth_date))),
    registration_month = floor_date(registration_date, "month"),
    pred_whi = suppressWarnings(as.numeric(pred_whi)),
    pred_bla = suppressWarnings(as.numeric(pred_bla)),
    pred_his = suppressWarnings(as.numeric(pred_his)),
    pred_asi = suppressWarnings(as.numeric(pred_asi)),
    pred_oth = suppressWarnings(as.numeric(pred_oth)),
    gender = str_trim(as.character(gender)),
    party_affiliation = str_trim(as.character(party_affiliation)),
    voter_status = str_trim(as.character(voter_status)),
    age_at_registration = as.numeric(time_length(interval(birth_date, registration_date), "years"))
  ) %>%
  filter(!is.na(registration_date), !is.na(registration_month)) %>%
  filter(registration_date >= as.Date("2016-11-01"))

max_pred <- max(
  c(vf_missing_pred$pred_whi, vf_missing_pred$pred_bla, vf_missing_pred$pred_his,
    vf_missing_pred$pred_asi, vf_missing_pred$pred_oth),
  na.rm = TRUE
)

if (is.finite(max_pred) && max_pred > 1) {
  vf_missing_pred <- vf_missing_pred %>%
    mutate(
      pred_whi = pred_whi / 100,
      pred_bla = pred_bla / 100,
      pred_his = pred_his / 100,
      pred_asi = pred_asi / 100,
      pred_oth = pred_oth / 100
    )
}

missing_pred_month <- vf_missing_pred %>%
  group_by(registration_month) %>%
  summarise(
    avg_pred_whi = mean(pred_whi, na.rm = TRUE),
    avg_pred_bla = mean(pred_bla, na.rm = TRUE),
    avg_pred_his = mean(pred_his, na.rm = TRUE),
    avg_pred_asi = mean(pred_asi, na.rm = TRUE),
    avg_pred_oth = mean(pred_oth, na.rm = TRUE),
    n_missing = n(),
    .groups = "drop"
  )

missing_pred_comparison <- missing_pred_month %>%
  tidyr::crossing(election_dates) %>%
  mutate(
    election_month = floor_date(election_date, "month"),
    months_since_election =
      (year(registration_month) - year(election_month)) * 12 +
      (month(registration_month) - month(election_month))
  ) %>%
  filter(months_since_election >= 0, months_since_election <= months_to_keep)

missing_pred_long <- missing_pred_comparison %>%
  pivot_longer(
    cols = c(avg_pred_whi, avg_pred_bla, avg_pred_his, avg_pred_asi, avg_pred_oth),
    names_to = "predicted_race",
    values_to = "avg_share"
  ) %>%
  mutate(
    predicted_race = recode(
      predicted_race,
      avg_pred_whi = "Predicted White",
      avg_pred_bla = "Predicted Black",
      avg_pred_his = "Predicted Hispanic",
      avg_pred_asi = "Predicted Asian",
      avg_pred_oth = "Predicted Other"
    )
  )

4.1 Predicted Race Summary Table

This table shows the average predicted racial composition of registrations with missing race across the post-election months that all three election periods have in common.

max_common_month <- comparison_df %>%
  filter(election_label == "2024 Election") %>%
  summarise(max_month = max(months_since_election, na.rm = TRUE)) %>%
  pull(max_month)

predicted_race_summary <- missing_pred_comparison %>%
  filter(months_since_election <= max_common_month) %>%
  mutate(Election = str_remove(election_label, " Election")) %>%
  group_by(Election) %>%
  summarise(
    `Average % Predicted White` = round(mean(avg_pred_whi, na.rm = TRUE) * 100, 2),
    `Average % Predicted Hispanic` = round(mean(avg_pred_his, na.rm = TRUE) * 100, 2),
    `Average % Predicted Black` = round(mean(avg_pred_bla, na.rm = TRUE) * 100, 2),
    `Average % Predicted Asian` = round(mean(avg_pred_asi, na.rm = TRUE) * 100, 2),
    `Average % Predicted Other` = round(mean(avg_pred_oth, na.rm = TRUE) * 100, 2),
    .groups = "drop"
  )

knitr::kable(
  predicted_race_summary,
  caption = paste("Average Predicted Race of missing race registerants of each election, Months 0 to", max_common_month),
  align = c("l", "c", "c", "c", "c", "c")
)
Average Predicted Race of missing race registerants of each election, Months 0 to 14
Election Average % Predicted White Average % Predicted Hispanic Average % Predicted Black Average % Predicted Asian Average % Predicted Other
2016 42.18 31.75 15.46 6.15 4.46
2020 49.45 23.74 17.16 4.76 4.89
2024 33.30 42.10 15.50 4.99 4.10

4.2 Graph 2: Average Predicted Race of missing race registrants

To see if any changes were specifically caused by Hispanic voters I used the surname matched race probabilities to determine the average race of missing race registrations.

g5a <- ggplot(
  missing_pred_long,
  aes(x = months_since_election, y = avg_share, color = predicted_race, group = predicted_race)
) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.8) +
  scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  labs(
    title = "Average Predicted Race Shares Within Missing-Race Registrations",
    subtitle = "Based on surname-matched prediction columns",
    x = "Months Since Election",
    y = "Average Predicted Share",
    color = "Predicted Race"
  ) +
  facet_wrap(~ election_label) +
  theme_minimal()

g5a

4.3 Graph 2.1: Average Predicted Race of missing race registrants

I also added vertical lines on the dates when major ICE activity in Florida happened to test correlation.

event_markers_2024 <- tibble(
  election_label = "2024 Election",
  months_since_election = c(5, 13),
  event_label = c("Operation Tidal Wave", "Pinellas Park / Tampa")
)

g5 <- ggplot(
  missing_pred_long,
  aes(
    x = months_since_election,
    y = avg_share,
    color = predicted_race,
    group = predicted_race
  )
) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.8) +
  geom_vline(
    data = event_markers_2024,
    aes(xintercept = months_since_election),
    inherit.aes = FALSE,
    linetype = "dashed"
  ) +
  geom_text(
    data = event_markers_2024,
    aes(
      x = months_since_election,
      y = 0.99 * max(missing_pred_long$avg_share, na.rm = TRUE),
      label = event_label
    ),
    inherit.aes = FALSE,
    angle = 90,
    vjust = -0.3,
    hjust = 1,
    size = 3
  ) +
  scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  labs(
    title = "Average Predicted Race Shares Within Missing-Race Registrations",
    subtitle = "Based on surname-matched prediction columns, with Florida ICE event markers",
    x = "Months Since Election",
    y = "Average Predicted Share",
    color = "Predicted Race"
  ) +
  facet_wrap(~ election_label) +
  theme_minimal(base_size = 13)

g5

4.4 Interpretation

These graphs show that after the 2024 election the percentage of missing race registrants shot up to new decade records, and those records almost exactly match up with major ICE activity in Florida.

4.5 Highest Hispanic Months Summary Table

This table shows that months out of the decade had the highest percentage of missing race registrants predicted to be Hispanic

top_10_pred_hispanic <- missing_pred_comparison %>%
  mutate(Election = str_remove(election_label, " Election")) %>%
  select(
    Election,
    `Month After Election` = months_since_election,
    `Registration Month` = registration_month,
    `Average % Predicted Hispanic` = avg_pred_his
  ) %>%
  arrange(desc(`Average % Predicted Hispanic`)) %>%
  slice_head(n = 10) %>%
  mutate(
    `Registration Month` = format(`Registration Month`, "%B %Y"),
    `Average % Predicted Hispanic` = round(`Average % Predicted Hispanic` * 100, 2)
  )

knitr::kable(
  top_10_pred_hispanic,
  caption = "Top 10 Months by percent of Missing race registerants predicted hispanic",
  align = c("l", "c", "l", "c")
)
Top 10 Months by percent of Missing race registerants predicted hispanic
Election Month After Election Registration Month Average % Predicted Hispanic
2024 5 April 2025 55.30
2024 6 May 2025 53.89
2024 1 December 2024 53.33
2024 2 January 2025 51.17
2024 3 February 2025 46.60
2024 8 July 2025 44.48
2024 7 June 2025 43.88
2016 18 May 2018 41.36
2016 6 May 2017 40.30
2024 9 August 2025 40.08

5 Data Removal Hypothesis

5.1 Graph 3: Average Age of missing race registrants

missing_age_month <- vf_missing_pred %>%
  filter(!is.na(age_at_registration), age_at_registration >= 0, age_at_registration <= 120) %>%
  group_by(registration_month) %>%
  summarise(
    avg_age = mean(age_at_registration, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  tidyr::crossing(election_dates) %>%
  mutate(
    election_month = floor_date(election_date, "month"),
    months_since_election =
      (year(registration_month) - year(election_month)) * 12 +
      (month(registration_month) - month(election_month))
  ) %>%
  filter(months_since_election >= 0, months_since_election <= months_to_keep)
g6 <- ggplot(
  missing_age_month,
  aes(x = months_since_election, y = avg_age, color = election_label, group = election_label)
) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.8) +
  scale_x_continuous(breaks = seq(0, months_to_keep, by = 6)) +
  labs(
    title = "Average Age Within Missing-Race Registrations",
    subtitle = "Age at registration",
    x = "Months Since Election",
    y = "Average Age",
    color = "Election"
  ) +
  theme_minimal()

g6

5.2 Interpretation

This graph shows that the average age of registrants did not change throughout the months which can help disprove the hypothesis that the removal of inactive voters removed Hispanics that registered as missing race at a higher rate than other people.

6 Challenges

The two largest challenges I had were:

Loading in the full voter file because it is so large. I ended up making a separate data set that was specifically only missing race entries to save time.

Going through the voter file to account of changes made throughout the years to the actual voter registration form which changes things like race codes.

7 Findings

My findings show that the percent of missing race registrants was significantly higher after the 2024 election than any other year, that the average age did not change throughout each election, and that the predicted race of missing race registrants being hispanic went up significantly.

8 Future Research

There are several ways this project could be expanded.

One way is to look at how many predicted Hispanics register as other races and see if that has changed. You could also contact the registrants and see if they meant to do something else. You could also compare this with the census data to see if there is another reason for the data. Also a multinomial logit model could be used to reinforce the data. You could also look at surname matched data files from each Election completely disprove the removed records hypothesis.

9 Conclusion

This project was used to see if the Trump administration’s handling of immigration has changed the percentage of missing race registrants that are predicted Hispanic. The data predicts that this is almost certainly the case.