Final Class Project

Voter Turnout of Naturalized (Foreign Born) Citizens and U.S.-Born Citizens

Research Question: How does voter turnout differ between Naturalized (Foreign Born) citizens and U.S.-Born Citizens?

Introduction.

Naturalized citizens must undergo a longer process and take a citizenship class as part of their path toward United States citizenship. In contrast, U.S.-born citizens may have taken civic courses during their education. Each group, however, has a different experience, which could influence their voter turnout.

When it comes to who is likely to register to vote, scholars have conducted research in where “the odds of registering among naturalized citizens are 36 percent lower and the odds of voting are 26 percent lower than those of native-born citizens. This may be because naturalized citizens in general have not developed strong ties within their communities or do not relate as well as the native born to the issues or candidates”(Bass and Casper 504).

In this project, I will compare voter turnout between naturalized (foreign-born) citizens and U.S.-born citizens across various elections. The focus is: How does voter turnout differ between naturalized (foreign-born) citizens and U.S.-born citizens?

Using the North Carolina Voter Registration Data and Voter History Data, I will examine the 2024 election, the November 2023 election, and the 2018 Midterm election.

This analysis will focus on Wake County, as it is one of the counties with the largest foreign-born population.

I will also analyze in focusing selective dates such as June 28 to July 5, 2024, March 05, 2024, and September 14, 2024 to September 17, 2024. As these dates are primarily where many naturalization ceremonies occurred during 2024, according to the U.S. Citizenship and Immigration Services.

Null Hypothesis: There is no difference in voter turnout rates between naturalized and U.S. born citizens.

Alternative Hypothesis: Naturalized citizens have lower voter turnout rates than U.S.-born citizens.

Getting Started

Download Necessary Libraries.

library(tidycensus) 
library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) 
library(dplyr) 
library(lubridate) 
library(knitr) 
library(scales) 

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(MatchIt)

Collecting Data-sets

WAKE_VOTER_registration <- read_tsv("~/Downloads/ncvoter92.zip") 
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 968240 Columns: 67
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (43): county_desc, voter_reg_num, ncid, last_name, first_name, middle_na...
dbl  (9): county_id, zip_code, mail_zipcode, birth_year, age_at_year_end, nc...
lgl (15): mail_addr3, mail_addr4, full_phone_number, township_abbrv, townshi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
WAKE_VOTER_HIST <- read_tsv("~/Downloads/ncvhis92 (1).zip")  
Rows: 4063003 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (13): county_desc, voter_reg_num, election_lbl, election_desc, voting_me...
dbl  (2): county_id, voted_county_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter the Data-sets

WAKE_VOTER_registration <- WAKE_VOTER_registration |>
  filter(!grepl("REMOVED", voter_status_desc)) 
WAKE_VOTER_registration <- WAKE_VOTER_registration |>
  filter(!grepl("INACTIVE", voter_status_desc))  
WAKE_VOTER_registration <- WAKE_VOTER_registration |> 
  filter(!grepl("DENIED", voter_status_desc))   
WAKE_VOTER_registration <- WAKE_VOTER_registration |>
  filter(!grepl("Y", confidential_ind))   
All voters categorized or labeled as “Removed,” “Inactive,” “Denied,” or “Confidential” were excluded. This ensures the study focuses only on voters labeled as active.
FB_VOTER_Registr <- WAKE_VOTER_registration  |> 
  filter(is.na(birth_state))   
This final filter limits the data-set to voters who have left the Birth State section blank, or are recorded as NA. For now, we will assume these individuals are foreign-born and naturalized.

Merging Data-sets

FBVOTER_REGISTR_DATA <- FB_VOTER_Registr |>
  left_join(WAKE_VOTER_HIST, by = "voter_reg_num")  
This step shows that we merged the Voter History File with the filtered Voter Registration Data. The two data-sets were merged using voter_reg_num, which serves as a unique identifier for each voter in Wake County, North Carolina.
FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA  |>
  mutate(county_id = coalesce(county_id.x, county_id.y)) |>
  select(-county_id.x, -county_id.y) 

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA |>
  mutate(county_desc = coalesce(county_desc.x, county_desc.y)) |>
  select(-county_desc.x, -county_desc.y) 

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA |>
  mutate(ncid = coalesce(ncid.x, ncid.y)) |>
  select(-ncid.x, -ncid.y)  
When merging these data-sets, it did create duplicates of the same columns/variables, which we need to eliminate them.
Now, I will repeat the same process for the U.S.-born citizens.
USBORN_VOTER <- WAKE_VOTER_registration |> 
  filter(!is.na(birth_state))   

USBORN_VOTER_Registr <- USBORN_VOTER |>
  left_join(WAKE_VOTER_HIST, by = "voter_reg_num")  

USBORN_VOTER_Registr <- USBORN_VOTER_Registr |>
  mutate(county_id = coalesce(county_id.x, county_id.y)) |>
  select(-county_id.x, -county_id.y) 

USBORN_VOTER_Registr <- USBORN_VOTER_Registr |>
  mutate(county_desc = coalesce(county_desc.x, county_desc.y)) |>
  select(-county_desc.x, -county_desc.y) 


USBORN_VOTER_Registr <- USBORN_VOTER_Registr |>
  mutate(ncid = coalesce(ncid.x, ncid.y)) |>
  select(-ncid.x, -ncid.y) 

CHANGE DATE FORMAT

FBVOTER_REGISTR_DATA <- FBVOTER_REGISTR_DATA |> 
  mutate(election_lbl = as.Date(election_lbl, format = "%m/%d/%Y")) |> 
  mutate(registr_dt =as.Date(registr_dt, format = "%m/%d/%Y")) 

USBORN_VOTER_Registr <- USBORN_VOTER_Registr |> 
  mutate(election_lbl = as.Date(election_lbl, format = "%m/%d/%Y")) |> 
  mutate(registr_dt =as.Date(registr_dt, format = "%m/%d/%Y"))  

Vital Step: To determine if we can look which days throughout 2024 had the most individuals being registered to vote. And if we can set apart which individuals who have NA in the Birth State variable are Naturalized citizens.

naturalized_2024 <- FBVOTER_REGISTR_DATA |>
  filter(registr_dt >= as.Date("2024-01-01") &
           registr_dt <= as.Date("2024-12-31")) 

daily_counts <- naturalized_2024 |>
  group_by(registr_dt) |>
  summarise(registrations = n())  

daily_counts$month <- month(daily_counts$registr_dt, label = TRUE)

ggplot(daily_counts, aes(x = registr_dt, y = registrations)) +
  geom_line() +
  facet_wrap(~ month, scales = "free_x") +
  labs(
    title = "Daily Voter Registrations in 2024 by Month",
    x = "Date",
    y = "Registrations"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

Selecting Dates To Determine Naturalized Citizens

According to the USCIS, there is a lot of naturalization ceremonies occurring during Late June/Early July, due to Independence Day period. And in the week from September 14 to September 23, also due to Citizenship Week, and September 17 being Citizenship Day and Constitution Day. While the graph did indicate March had a high number of individuals registering to vote, especially March 05, 2024. I will select these dates, and determine the voter turnout between U.S.Born and Foreign/Naturalized Citizens.
# Define the date ranges
ceremony_dates <- as.Date(c("2024-03-05"))  # single date
independence_dates <- seq(as.Date("2024-06-28"), as.Date("2024-07-05"), by = "day")
constitution_dates <- seq(as.Date("2024-09-14"), as.Date("2024-09-23"), by = "day")

# Combine all dates into a single vector
selected_dates <- c(ceremony_dates, independence_dates, constitution_dates)

# Filter datasets
FBVOTER_filtered <- FBVOTER_REGISTR_DATA |>
  filter(registr_dt %in% selected_dates)

USBORN_filtered <- USBORN_VOTER_Registr |>
  filter(registr_dt %in% selected_dates)  

# Create new variable
FBVOTER_filtered <- FBVOTER_filtered %>% mutate(group = "Foreign")
USBORN_filtered <- USBORN_filtered %>% mutate(group = "USborn")

# Combine into one dataset
all_filtered <- bind_rows(FBVOTER_filtered, USBORN_filtered)

Seeing if both Data-sets are Balanced

This showcases that between the two groups it is almost balanced each other, in regards with the variable of age.
# A tibble: 3 × 3
  Statistic Foreign USborn
  <chr>     <chr>   <chr> 
1 F         38.9%   57.3% 
2 M         28.8%   42.6% 
3 U         32.3%   0.1%  
This showcases that there is an imbalance of gender throughout both data-sets, which is not helpful to determine voter-turnout.
# A tibble: 8 × 3
  Race                         Foreign USborn
  <chr>                        <chr>   <chr> 
1 Asian                        3.3%    2.7%  
2 Black                        15.4%   22.9% 
3 Middle Eastern               0.8%    0.9%  
4 Native American / Indigenous 0.3%    0.1%  
5 Other                        5%      7.7%  
6 Pacific Islander             0%      <NA>  
7 Unknown                      24.4%   0.8%  
8 White                        50.8%   64.8% 
In the case of race, it is also unbalanced. Which to resolve this issue is to see if we can balance it out, by matching.

Matching

Warning: Fewer control units than treated units in some `exact` strata; not all
treated units will get a match.

Call:
matchit(formula = group ~ 1, data = all_data_select, method = "nearest", 
    exact = c("gender_code", "race_code", "age_group"), ratio = 1)

Summary of Balance for All Data:
               Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance              0.8407        0.8407         -0.0000    14.4283    0.0000
gender_codeF          0.3888        0.5728         -0.3774          .    0.1840
gender_codeM          0.2878        0.4261         -0.3055          .    0.1383
gender_codeU          0.3234        0.0011          0.6890          .    0.3223
race_codeA            0.0330        0.0274          0.0312          .    0.0056
race_codeB            0.1537        0.2290         -0.2090          .    0.0754
race_codeI            0.0028        0.0006          0.0419          .    0.0022
race_codeM            0.0083        0.0090         -0.0075          .    0.0007
race_codeO            0.0504        0.0773         -0.1228          .    0.0269
race_codeP            0.0002        0.0000          0.0146          .    0.0002
race_codeU            0.2437        0.0084          0.5481          .    0.2353
race_codeW            0.5080        0.6484         -0.2809          .    0.1404
age_group18-29        0.3819        0.3471          0.0715          .    0.0348
age_group30-44        0.3249        0.3331         -0.0176          .    0.0082
age_group45-59        0.1445        0.1383          0.0177          .    0.0062
age_group60+          0.1487        0.1814         -0.0921          .    0.0327
               eCDF Max
distance         0.0000
gender_codeF     0.1840
gender_codeM     0.1383
gender_codeU     0.3223
race_codeA       0.0056
race_codeB       0.0754
race_codeI       0.0022
race_codeM       0.0007
race_codeO       0.0269
race_codeP       0.0002
race_codeU       0.2353
race_codeW       0.1404
age_group18-29   0.0348
age_group30-44   0.0082
age_group45-59   0.0062
age_group60+     0.0327

Summary of Balance for Matched Data:
               Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance              0.8407        0.8407               0          1         0
gender_codeF          0.5744        0.5744               0          .         0
gender_codeM          0.4245        0.4245               0          .         0
gender_codeU          0.0011        0.0011               0          .         0
race_codeA            0.0275        0.0275               0          .         0
race_codeB            0.2296        0.2296               0          .         0
race_codeI            0.0006        0.0006               0          .         0
race_codeM            0.0062        0.0062               0          .         0
race_codeO            0.0775        0.0775               0          .         0
race_codeP            0.0000        0.0000               0          .         0
race_codeU            0.0084        0.0084               0          .         0
race_codeW            0.6502        0.6502               0          .         0
age_group18-29        0.3481        0.3481               0          .         0
age_group30-44        0.3341        0.3341               0          .         0
age_group45-59        0.1387        0.1387               0          .         0
age_group60+          0.1791        0.1791               0          .         0
               eCDF Max Std. Pair Dist.
distance              0               0
gender_codeF          0               0
gender_codeM          0               0
gender_codeU          0               0
race_codeA            0               0
race_codeB            0               0
race_codeI            0               0
race_codeM            0               0
race_codeO            0               0
race_codeP            0               0
race_codeU            0               0
race_codeW            0               0
age_group18-29        0               0
age_group30-44        0               0
age_group45-59        0               0
age_group60+          0               0

Sample Sizes:
          Control Treated
All          1786    9424
Matched      1781    1781
Unmatched       5    7643
Discarded       0       0
Summary of Balance After Matching
Variable Mean Treated Mean Control Std. Mean Difference
distance 0.8406780 0.8406780 0
gender_codeF 0.5743964 0.5743964 0
gender_codeM 0.4244806 0.4244806 0
gender_codeU 0.0011230 0.0011230 0
race_codeA 0.0275126 0.0275126 0
race_codeB 0.2296463 0.2296463 0
race_codeI 0.0005615 0.0005615 0
race_codeM 0.0061763 0.0061763 0
race_codeO 0.0774846 0.0774846 0
race_codeP 0.0000000 0.0000000 0
race_codeU 0.0084222 0.0084222 0
race_codeW 0.6501965 0.6501965 0
age_group18-29 0.3481190 0.3481190 0
age_group30-44 0.3340820 0.3340820 0
age_group45-59 0.1386861 0.1386861 0
age_group60+ 0.1791129 0.1791129 0
                 Variable Means Treated Means Control Std. Mean Diff.
1                    <NA>  0.8406779661  0.8406779661               0
2          Gender: Female  0.5743964065  0.5743964065               0
3            Gender: Male  0.4244806289  0.4244806289               0
4         Gender: Unknown  0.0011229646  0.0011229646               0
5             Race: Asian  0.0275126334  0.0275126334               0
6             Race: Black  0.2296462661  0.2296462661               0
7        Race: Indigenous  0.0005614823  0.0005614823               0
8    Race: Middle Eastern  0.0061763054  0.0061763054               0
9             Race: Other  0.0774845592  0.0774845592               0
10 Race: Pacific Islander  0.0000000000  0.0000000000               0
11          Race: Unknown  0.0084222347  0.0084222347               0
12            Race: White  0.6501965188  0.6501965188               0
13             Age: 18-29  0.3481190343  0.3481190343               0
14             Age: 30-44  0.3340819764  0.3340819764               0
15             Age: 45-59  0.1386861314  0.1386861314               0
16               Age: 60+  0.1791128579  0.1791128579               0
After matching, the results indicate that between both groups, it is balanced out. In order to balanced out, I needed to create a age-group, to reduce errors and unbalanced data.

Results only using Individuals from the Specific Registered Dates

selected_elections <- matched_data %>%
  filter(election_lbl %in% as.Date(c("2024-11-05", "2018-11-06", "2023-11-07"))) 

turnout_counts <- selected_elections %>%
  group_by(group, election_lbl) %>%
  summarise(voters = n(), .groups = "drop")  

total_per_group <- matched_data %>%
  group_by(group) %>%
  summarise(total_registered = n())

turnout_summary <- turnout_counts %>%
  left_join(total_per_group, by = "group") %>%
  mutate(turnout_pct = voters / total_registered * 100)   

# 2024 General
general_2024 <- turnout_summary %>%
  filter(election_lbl == as.Date("2024-11-05"))
# Convert group to descriptive labels
general_2024$group_label <- ifelse(general_2024$group == 1, "Foreign", "US-born")

ggplot(general_2024, aes(x = group_label, y = turnout_pct, fill = group_label)) +
  geom_col(width = 0.6, show.legend = FALSE) +  # remove legend since labels are clear
  geom_text(aes(label = paste0(round(turnout_pct,1), "%")), 
            vjust = -0.5, size = 5) +  # add labels above bars
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +  # give space for labels
  scale_fill_manual(values = c("US-born" = "#1f78b4", "Foreign" = "#33a02c")) +  # nicer colors
  labs(
    title = "Voter Turnout: 2024 General Election",
    x = "",
    y = "Turnout (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(face = "bold", hjust = 0.5)
  ) 

Based on this result, it does indicate, that the the Foreign/Naturalized citizens had a lower voter turnout percentage whie the U.S. born citizens had a high percentage. However, it is to note, that this is only based on a limited sample data.
# 2018 Midterm Election

midterm_2018 <- turnout_summary %>%
  filter(election_lbl == as.Date("2018-11-06")) 


midterm_2018$group_label <- ifelse(midterm_2018$group == 1, "Foreign", "US-born")

ggplot(midterm_2018, aes(x = group_label, y = turnout_pct, fill = group_label)) +
  geom_col(width = 0.6, show.legend = FALSE) +                # nicer bar width, remove legend
  geom_text(aes(label = paste0(round(turnout_pct,1), "%")),   # add % labels above bars
            vjust = -0.5, size = 5) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +  # space above bars for labels
  scale_fill_manual(values = c("US-born" = "#1f78b4", "Foreign" = "#33a02c")) + # distinct colors
  labs(
    title = "Voter Turnout: 2018 Midterm Election",
    x = "",
    y = "Turnout (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

# 2023 Municipal 

municipal_2023 <- turnout_summary %>%
  filter(election_lbl == as.Date("2023-11-07"))

municipal_2023$group_label <- ifelse(municipal_2023$group == 1, "Foreign", "US-born")

ggplot(municipal_2023, aes(x = group_label, y = turnout_pct, fill = group_label)) +
  geom_col(width = 0.6, show.legend = FALSE) +                    # nicer bar width, remove legend
  geom_text(aes(label = paste0(round(turnout_pct, 1), "%")),      # add % labels on top
            vjust = -0.5, size = 5) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +      # give space for labels
  scale_fill_manual(values = c("US-born" = "#1f78b4", "Foreign" = "#33a02c")) + # distinct colors
  labs(
    title = "Voter Turnout: 2023 Municipal Election",
    x = "",
    y = "Turnout (%)"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

However, in the 2018 Midterm Election and the 2023 Municipal(local) Election, the U.S. born citizens had a lower voter turnout than the Naturalized citizens. Which then, reflects that U.S.born citizens are less aware or participate less in these elections. But, this again, is from a sample population, the difference between groups is not high.

Gender Analysis

turnout_gender <- matched_data %>%
  filter(
    election_lbl %in% as.Date(c("2024-11-05", "2018-11-06", "2023-11-07")),
    gender_code %in% c("F", "M")  # Only include Female and Male
  ) %>%
  group_by(group, election_lbl, gender_code) %>%
  summarise(voters = n(), .groups = "drop") %>%
  left_join(
    matched_data %>%
      filter(gender_code %in% c("F", "M")) %>%  # Match total registered
      group_by(group, gender_code) %>%
      summarise(total_registered = n(), .groups = "drop"),
    by = c("group", "gender_code")
  ) %>%
  mutate(turnout_pct = voters / total_registered * 100) 

turnout_gender_fm <- turnout_gender |> 
  filter(gender_code %in% c("F", "M"))

# Add descriptive group labels
turnout_gender_fm$group_label <- ifelse(turnout_gender_fm$group == 1, "Foreign", "US-born")

ggplot(turnout_gender_fm, aes(x = gender_code, y = turnout_pct, fill = group_label)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  geom_text(aes(label = paste0(round(turnout_pct,1), "%")), 
            position = position_dodge(width = 0.8), 
            vjust = -0.5, size = 3.5) +
  facet_wrap(~ election_lbl, scales = "free_y") +
  scale_fill_manual(values = c("US-born" = "#1f78b4", "Foreign" = "#33a02c")) +
  labs(
    title = "Voter Turnout by Gender and Group",
    x = "Gender",
    y = "Turnout (%)",
    fill = "Group"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    axis.text.x = element_text(face = "bold"),
    strip.text = element_text(face = "bold"),  # facet labels bold
    plot.title = element_text(face = "bold", hjust = 0.5)
  ) 

This reflects the results between both groups based on the previous graphs. In analyzing gender, it appears that U.S.-born women had a higher voter turnout percentage than U.S.-born men in the 2018 and 2023 elections, but not in the 2024 election. Among foreign (naturalized) citizens, women had higher voter turnout than men in 2018 and 2023, but not in the 2024 .

Age Analysis

turnout_age_2024 <- matched_data %>% 
  filter(election_lbl == as.Date("2024-11-05")) %>%  
  mutate(age_group = cut(age_at_year_end, breaks = seq(18, 90, by = 5),  
  right = FALSE)) %>%  
  group_by(group, age_group) %>%  
  summarise(voters = n(), .groups = "drop") %>%  
  left_join( matched_data %>%  
  mutate(age_group = cut(age_at_year_end, breaks = seq(18, 90, by = 5),  
  right = FALSE)) %>%  
    group_by(group, age_group) %>%  
    summarise(total_registered = n(), .groups = "drop"),  
  by = c("group", "age_group") ) %>%  
  mutate(turnout_pct = voters / total_registered * 100)   

ggplot(turnout_age_2024,
       aes(x = age_group,
           y = turnout_pct,
           fill = factor(group,
                         labels = c("US-born", "Foreign-born")))) +
  geom_col(position = position_dodge(width = 0.8),
           width = 0.7) +
  labs(
    title = "Voter Turnout by Age Group",
    subtitle = "2024 General Election",
    x = "Age Group",
    y = "Turnout (%)",
    fill = "Citizenship Status"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    
  )

This showcases that in the younger age group from 18-23 the Foreign-Born/Naturalized citizens have a higher voter turnout than the U.S. born citizens. Based on these results, we can conclude that voter turnout is higher across majority age groups in U.S Born citizens group compared to foreign-born/naturalized citizens,suggesting greater civic engagement.

Review

After conducting this study, I learned that in some cases, foreign-born/naturalized citizens had higher voter turnout in the past general election. However, I would like to examine previous years to determine if this is a consistent trend. In contrast, for local and midterm elections, U.S.Born citizens appear to be more civically engaged. A gender analysis also shows which gender from each group had a higher voter turnout in each election. Regarding age, the younger demographic among the foreign-born/naturalized citizens dominate voter turnout in the 2024 general election.

Things I would do differently: