Class Project

Voter Turnout of Naturalized (Foreign Born) Citizens and U.S.-Born Citizens

Research Question: How does voter turnout differ between Naturalized (Foreign Born) citizens and U.S.-Born Citizens?

Introduction.

Naturalized citizens must undergo a longer process and take a citizenship class as part of their path toward United States citizenship. In contrast, U.S.-born citizens may have taken civic courses during their education. Each group, however, has a different experience, which could influence their voter turnout.

In this project, I will compare voter turnout between naturalized (foreign-born) citizens and U.S.-born citizens across various elections. The focus is: How does voter turnout differ between naturalized (foreign-born) citizens and U.S.-born citizens?

Using the North Carolina Voter Registration Data and Voter History Data, I will examine the 2024 election, the November 2023 election, and the 2022 midterm election.

This analysis will focus on Wake County, as it is one of the counties with the largest foreign-born population.

Null Hypothesis: There is no difference in voter turnout rates between naturalized and U.S. born citizens.

Alternative Hypothesis: Naturalized citizens have lower voter turnout rates than U.S.-born citizens.

Getting Started

Download Necessary Libraries.

library(tidycensus) 
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2) 
library(dplyr)

Collecting Data-sets

WAKE_VOTER_registration <- read_tsv("~/Downloads/ncvoter92.zip")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 968240 Columns: 67
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (43): county_desc, voter_reg_num, ncid, last_name, first_name, middle_na...
dbl  (9): county_id, zip_code, mail_zipcode, birth_year, age_at_year_end, nc...
lgl (15): mail_addr3, mail_addr4, full_phone_number, township_abbrv, townshi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

WAKE_VOTER_HIST <- read_tsv("~/Downloads/ncvhis92 (1).zip")

Rows: 4063003 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (13): county_desc, voter_reg_num, election_lbl, election_desc, voting_me...
dbl  (2): county_id, voted_county_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter the Data-sets

WAKE_VOTER_registration <- WAKE_VOTER_registration |>
  filter(!grepl("REMOVED", voter_status_desc)) 
WAKE_VOTER_registration <- WAKE_VOTER_registration |>
  filter(!grepl("INACTIVE", voter_status_desc))  
WAKE_VOTER_registration <- WAKE_VOTER_registration |> 
  filter(!grepl("DENIED", voter_status_desc))   
WAKE_VOTER_registration <- WAKE_VOTER_registration |>
  filter(!grepl("Y", confidential_ind))

All voters categorized or labeled as “Removed,” “Inactive,” “Denied,” or “Confidential” were excluded. This ensures the study focuses only on voters labeled as active.

FB_VOTER_Registr <- WAKE_VOTER_registration  |> 
  filter(is.na(birth_state))

This final filter limits the dataset to voters who have left the Birth State section blank, or are recorded as NA. For now, we will assume these individuals are foreign-born and naturalized.

Merging Data-sets

FBVOTER_REGISTR_DATA <- FB_VOTER_Registr |>
  left_join(WAKE_VOTER_HIST, by = "voter_reg_num")

This step shows that we merged the Voter History File with the filtered Voter Registration Data. The two datasets were merged using voter_reg_num, which serves as a unique identifier for each voter in Wake County, North Carolina.

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA  |>
  mutate(county_id = coalesce(county_id.x, county_id.y)) |>
  select(-county_id.x, -county_id.y) 

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA |>
  mutate(county_desc = coalesce(county_desc.x, county_desc.y)) |>
  select(-county_desc.x, -county_desc.y) 

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA |>
  mutate(ncid = coalesce(ncid.x, ncid.y)) |>
  select(-ncid.x, -ncid.y)

When merging these datasets, it did create duplicates of the same columns/variables, which we need to eliminate them.

Now, I will repeat the same process for the U.S.-born citizens.

USBORN_VOTER <- WAKE_VOTER_registration |> 
  filter(!is.na(birth_state))   

USBORN_VOTER_Registr <- USBORN_VOTER |>
  left_join(WAKE_VOTER_HIST, by = "voter_reg_num")  

USBORN_VOTER_Registr <- USBORN_VOTER_Registr |>
  mutate(county_id = coalesce(county_id.x, county_id.y)) |>
  select(-county_id.x, -county_id.y) 

USBORN_VOTER_Registr <- USBORN_VOTER_Registr |>
  mutate(county_desc = coalesce(county_desc.x, county_desc.y)) |>
  select(-county_desc.x, -county_desc.y) 


USBORN_VOTER_Registr <- USBORN_VOTER_Registr |>
  mutate(ncid = coalesce(ncid.x, ncid.y)) |>
  select(-ncid.x, -ncid.y)

CHANGE DATE FORMAT

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA |> 
  mutate(election_lbl = as.Date(election_lbl, format = "%m/%d/%Y")) |> 
  mutate(registr_dt =as.Date(registr_dt, format = "%m/%d/%Y")) 

USBORN_VOTER_Registr <- USBORN_VOTER_Registr |> 
  mutate(election_lbl = as.Date(election_lbl, format = "%m/%d/%Y")) |> 
  mutate(registr_dt =as.Date(registr_dt, format = "%m/%d/%Y"))

Looking into which days throughout 2024 had the most individuals being registered to vote

naturalized_2024 <- FBVOTER_REGISTR_DATA |>
  filter(registr_dt >= as.Date("2024-01-01") &
           registr_dt <= as.Date("2024-12-31")) 

daily_counts <- naturalized_2024 |>
  group_by(registr_dt) |>
  summarise(registrations = n())  

library(lubridate)

daily_counts$month <- month(daily_counts$registr_dt, label = TRUE)

ggplot(daily_counts, aes(x = registr_dt, y = registrations)) +
  geom_line() +
  facet_wrap(~ month, scales = "free_x") +
  labs(
    title = "Daily Voter Registrations in 2024 by Month",
    x = "Date",
    y = "Registrations"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This was done to determine whether it is possible to distinguish naturalized individuals from those with Birth State recorded as NA. However, this revealed that it is rather difficult to identify naturalized citizens based solely on their registration date. Although there are noticeable peaks in March, October, and November, these may be due to voter registration deadlines.

Seeing if both Data sets are Balanced

FBVOTER_REGISTR_DATA %>% summarise(mean_age = mean(age_at_year_end, na.rm = TRUE))

# A tibble: 1 × 1
  mean_age
     <dbl>
1     45.1

USBORN_VOTER_Registr %>% summarise(mean_age = mean(age_at_year_end, na.rm = TRUE))

# A tibble: 1 × 1
  mean_age
     <dbl>
1     56.3

FBVOTER_REGISTR_DATA %>% count(gender_code) %>% mutate(percent = n / sum(n))

# A tibble: 3 × 3
  gender_code      n percent
  <chr>        <int>   <dbl>
1 F           172800   0.341
2 M           148303   0.293
3 U           185676   0.366

USBORN_VOTER_Registr %>% count(gender_code) %>% mutate(percent = n / sum(n))

# A tibble: 3 × 3
  gender_code       n percent
  <chr>         <int>   <dbl>
1 F           1724517 0.546  
2 M           1407733 0.446  
3 U             25167 0.00797

FBVOTER_REGISTR_DATA %>% count(party_cd) %>% mutate(percent = n / sum(n))

# A tibble: 5 × 3
  party_cd      n  percent
  <chr>     <int>    <dbl>
1 DEM      178157 0.352   
2 GRE         470 0.000927
3 LIB        2897 0.00572 
4 REP       84009 0.166   
5 UNA      241246 0.476

USBORN_VOTER_Registr %>% count(party_cd) %>% mutate(percent = n / sum(n))

# A tibble: 5 × 3
  party_cd       n  percent
  <chr>      <int>    <dbl>
1 DEM      1201821 0.381   
2 GRE          871 0.000276
3 LIB        10458 0.00331 
4 REP       755584 0.239   
5 UNA      1188683 0.376

FBVOTER_REGISTR_DATA %>% count(race_code) %>% mutate(percent = n / sum(n))

# A tibble: 8 × 3
  race_code      n  percent
  <chr>      <int>    <dbl>
1 A          28666 0.0566  
2 B          69400 0.137   
3 I           1555 0.00307 
4 M           3642 0.00719 
5 O          25084 0.0495  
6 P             69 0.000136
7 U         150949 0.298   
8 W         227414 0.449

USBORN_VOTER_Registr  %>% count(race_code) %>% mutate(percent = n / sum(n))

# A tibble: 8 × 3
  race_code       n   percent
  <chr>       <int>     <dbl>
1 A           87615 0.0277   
2 B          549489 0.174    
3 I            6547 0.00207  
4 M           14541 0.00461  
5 O          108508 0.0344   
6 P             152 0.0000481
7 U           89114 0.0282   
8 W         2301451 0.729

After calculating the mean age and frequency percentages for gender, party, and race, it is clear that the foreign-born/naturalized and U.S.-born datasets are not balanced. These differences suggest that any comparison between the groups—especially regarding voter turnout—may be confounded by demographic differences unless matching is applied.

Matching for Balanced Data Between Both Groups.

Foreign <- FBVOTER_REGISTR_DATA |> mutate(group = "Foreign")
USborn  <- USBORN_VOTER_Registr |> mutate(group = "USborn") 
ALL_data <- bind_rows(Foreign, USborn)

Foreign$group <- 1
USborn$group <- 0

all_data <- bind_rows(Foreign,USborn)

library(MatchIt)   


set.seed(123)  # for reproducibility

foreign_sample <- all_data |> filter(group == 1) |> sample_n(5000)
usborn_sample  <- all_data |> filter(group == 0) |> sample_n(5000)

all_data_small <- bind_rows(foreign_sample, usborn_sample)  



match_model <- matchit(
  group ~ age_at_year_end + gender_code + race_code + party_cd,
  data = all_data_small,
  method = "nearest",
  ratio = 1,
  caliper = 0.2
) 
matched_data <- match.data(match_model)
summary(match_model)


Call:
matchit(formula = group ~ age_at_year_end + gender_code + race_code + 
    party_cd, data = all_data_small, method = "nearest", caliper = 0.2, 
    ratio = 1)

Summary of Balance for All Data:
                Means Treated Means Control Std. Mean Diff. Var. Ratio
distance               0.6509        0.3491          1.0776     3.1482
age_at_year_end       45.3310       56.1714         -0.6359     1.0312
gender_codeF           0.3434        0.5496         -0.4342          .
gender_codeM           0.2998        0.4426         -0.3117          .
gender_codeU           0.3568        0.0078          0.7285          .
race_codeA             0.0560        0.0266          0.1279          .
race_codeB             0.1460        0.1848         -0.1099          .
race_codeI             0.0026        0.0014          0.0236          .
race_codeM             0.0086        0.0048          0.0412          .
race_codeO             0.0500        0.0360          0.0642          .
race_codeP             0.0002        0.0000          0.0141          .
race_codeU             0.2864        0.0296          0.5680          .
race_codeW             0.4502        0.7168         -0.5359          .
party_cdDEM            0.3592        0.3798         -0.0429          .
party_cdGRE            0.0014        0.0000          0.0374          .
party_cdLIB            0.0054        0.0030          0.0327          .
party_cdREP            0.1610        0.2384         -0.2106          .
party_cdUNA            0.4730        0.3788          0.1887          .
                eCDF Mean eCDF Max
distance           0.3258   0.4982
age_at_year_end    0.1290   0.2938
gender_codeF       0.2062   0.2062
gender_codeM       0.1428   0.1428
gender_codeU       0.3490   0.3490
race_codeA         0.0294   0.0294
race_codeB         0.0388   0.0388
race_codeI         0.0012   0.0012
race_codeM         0.0038   0.0038
race_codeO         0.0140   0.0140
race_codeP         0.0002   0.0002
race_codeU         0.2568   0.2568
race_codeW         0.2666   0.2666
party_cdDEM        0.0206   0.0206
party_cdGRE        0.0014   0.0014
party_cdLIB        0.0024   0.0024
party_cdREP        0.0774   0.0774
party_cdUNA        0.0942   0.0942

Summary of Balance for Matched Data:
                Means Treated Means Control Std. Mean Diff. Var. Ratio
distance               0.4578        0.4263          0.1123     1.2290
age_at_year_end       47.1173       48.5082         -0.0816     1.1897
gender_codeF           0.5534        0.5339          0.0411          .
gender_codeM           0.4285        0.4528         -0.0530          .
gender_codeU           0.0181        0.0133          0.0100          .
race_codeA             0.0626        0.0451          0.0759          .
race_codeB             0.1895        0.1970         -0.0213          .
race_codeI             0.0027        0.0024          0.0067          .
race_codeM             0.0092        0.0075          0.0185          .
race_codeO             0.0694        0.0557          0.0628          .
race_codeP             0.0003        0.0000          0.0242          .
race_codeU             0.0834        0.0506          0.0726          .
race_codeW             0.5828        0.6416         -0.1182          .
party_cdDEM            0.3663        0.3718         -0.0114          .
party_cdGRE            0.0024        0.0000          0.0640          .
party_cdLIB            0.0065        0.0051          0.0187          .
party_cdREP            0.1782        0.1785         -0.0009          .
party_cdUNA            0.4466        0.4446          0.0041          .
                eCDF Mean eCDF Max Std. Pair Dist.
distance           0.0382   0.1419          0.1123
age_at_year_end    0.0234   0.0978          0.5102
gender_codeF       0.0195   0.0195          0.7109
gender_codeM       0.0243   0.0243          0.7412
gender_codeU       0.0048   0.0048          0.0186
race_codeA         0.0174   0.0174          0.3793
race_codeB         0.0075   0.0075          0.5927
race_codeI         0.0003   0.0003          0.1007
race_codeM         0.0017   0.0017          0.1667
race_codeO         0.0137   0.0137          0.5021
race_codeP         0.0003   0.0003          0.0242
race_codeU         0.0328   0.0328          0.1997
race_codeW         0.0588   0.0588          0.5499
party_cdDEM        0.0055   0.0055          0.7057
party_cdGRE        0.0024   0.0024          0.0640
party_cdLIB        0.0014   0.0014          0.1587
party_cdREP        0.0003   0.0003          0.4141
party_cdUNA        0.0021   0.0021          0.7343

Sample Sizes:
          Control Treated
All          5000    5000
Matched      2924    2924
Unmatched    2076    2076
Discarded       0       0

I created a separate group within the datasets to differentiate voters and identify whether they are foreign-born or U.S.-born, which was necessary for matching. Due to the large size of the dataset, I used a smaller sample and matched voters by age, gender, race, and party affiliation. As a result, both groups became well balanced.

RESULTS for ELECTIONS

selected_elections <- matched_data %>%
  filter(election_lbl %in% as.Date(c("2024-11-05", "2022-11-08", "2023-11-07"))) 

turnout_counts <- selected_elections %>%
  group_by(group, election_lbl) %>%
  summarise(voters = n(), .groups = "drop")  

total_per_group <- matched_data %>%
  group_by(group) %>%
  summarise(total_registered = n())

turnout_summary <- turnout_counts %>%
  left_join(total_per_group, by = "group") %>%
  mutate(turnout_pct = voters / total_registered * 100) 

# 2024 General
general_2024 <- turnout_summary %>%
  filter(election_lbl == as.Date("2024-11-05"))

ggplot(general_2024, aes(x = factor(group), y = turnout_pct, fill = factor(group))) +
  geom_col() +
  labs(
    title = "Voter Turnout: 2024 General Election",
    x = "Group (0=US-born, 1=Foreign)",
    y = "Turnout (%)"
  ) +
  theme_minimal()

# 2022 Midterm
midterm_2022 <- turnout_summary %>%
  filter(election_lbl == as.Date("2022-11-08")) 
ggplot(midterm_2022, aes(x = factor(group), y = turnout_pct, fill = factor(group))) +
  geom_col() +
  labs(
    title = "Voter Turnout: 2022 Midterm Election",
    x = "Group (0=US-born, 1=Foreign)",
    y = "Turnout (%)"
  ) +
  theme_minimal()

# 2023 Municipal
municipal_2023 <- turnout_summary %>%
  filter(election_lbl == as.Date("2023-11-07"))

ggplot(municipal_2023, aes(x = factor(group), y = turnout_pct, fill = factor(group))) +
  geom_col() +
  labs(
    title = "Voter Turnout: 2023 Municipal Election",
    x = "Group (0=US-born, 1=Foreign)",
    y = "Turnout (%)"
  ) +
  theme_minimal()

Based on these results, we can conclude that in the general elections, foreign-born (naturalized) citizens have a higher voter turnout in this sample. However, U.S.-born citizens show higher turnout in midterm and local (municipal) elections compared to foreign-born voters.

Gender

turnout_gender <- matched_data %>%
  filter(
    election_lbl %in% as.Date(c("2024-11-05", "2023-11-07", "2022-11-08")),
    gender_code %in% c("F", "M")  # Only include Female and Male
  ) %>%
  group_by(group, election_lbl, gender_code) %>%
  summarise(voters = n(), .groups = "drop") %>%
  left_join(
    matched_data %>%
      filter(gender_code %in% c("F", "M")) %>%  # Match total registered
      group_by(group, gender_code) %>%
      summarise(total_registered = n(), .groups = "drop"),
    by = c("group", "gender_code")
  ) %>%
  mutate(turnout_pct = voters / total_registered * 100)
ggplot(turnout_gender, aes(x = gender_code, y = turnout_pct, fill = factor(group))) +
  geom_col(position = "dodge") +
  facet_wrap(~ election_lbl) +
  labs(
    title = "Voter Turnout by Gender (F & M only) and Group",
    x = "Gender",
    y = "Turnout (%)",
    fill = "Group (US-born, Foreign)"
  ) +
  theme_minimal()

Based on these results, we can conclude that in the 2024 general election, both female and male foreign-born citizens have higher voter turnout than their U.S.-born counterparts. However, in the midterm and municipal elections, the opposite is true, as both genders among U.S.-born citizens show higher turnout, which supports the previous results.

Age

turnout_age_2024 <- matched_data %>%
  filter(election_lbl == as.Date("2024-11-05")) %>%
  mutate(age_group = cut(age_at_year_end, breaks = seq(18, 90, by = 5), right = FALSE)) %>%
  group_by(group, age_group) %>%
  summarise(voters = n(), .groups = "drop") %>%
  left_join(
    matched_data %>%
      mutate(age_group = cut(age_at_year_end, breaks = seq(18, 90, by = 5), right = FALSE)) %>%
      group_by(group, age_group) %>%
      summarise(total_registered = n(), .groups = "drop"),
    by = c("group", "age_group")
  ) %>%
  mutate(turnout_pct = voters / total_registered * 100) 

ggplot(turnout_age_2024, aes(x = age_group, y = turnout_pct, fill = factor(group))) +
  geom_col(position = "dodge") +
  labs(
    title = "Voter Turnout by Age Group (2024 General Election)",
    x = "Age Group",
    y = "Turnout (%)",
    fill = "Group (0 = US-born, 1 = Foreign)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on these results, we can conclude that voter turnout is higher across all age groups in the foreign-born/naturalized group compared to U.S.-born citizens, suggesting greater civic engagement. However, turnout is particularly higher among the younger age group and those in the 58–63 age range.

Review

After conducting this study, I learned that in some cases, foreign-born/naturalized citizens had higher voter turnout in the past general election. However, I would like to examine previous years to determine if this is a consistent trend. In contrast, for local and midterm elections, U.S.-born citizens appear to be more civically engaged. A gender analysis also shows higher engagement among U.S.-born citizens. Regarding age, the younger demographic and the 58–63 age group among foreign-born citizens dominate voter turnout in the 2024 general election.

#####Things I would do differently: #####I would examine more previous years to see if these trends persist. Additionally, I would like more accurate information to confirm whether voters with Birth State recorded as NA are truly foreign-born/naturalized citizens. Finally, I would conduct the study on a much larger sample, as I only focused on Wake County and had to reduce the dataset to a sample of 5,000 due to its already large size.