Introduction

This project focuses specifically on North Carolina voters from 2020 to 2022 and beyond. It aims to study how being turned away from voting at the polls affects future voting behavior.

To start, I loaded both the voter registration and voter history files from North Carolina. I also selected out the important variables that will be needed for this project from both data sets. Specifically, it’s important to note that “ncid” is used in both the registration and history files, and that’s because this variable will later be used to merge the two data sets into one for testing purposes. Both of these inputs are shown below.

voter_registration <- read.delim("/Users/ajcoots/Downloads/ncvoter_Statewide.txt", sep = "\t", header = TRUE)
voter_history <- read.delim("/Users/ajcoots/Downloads/ncvhis_Statewide.txt", sep = "\t", header = TRUE)

voter_registration <- voter_registration |>
  select(county_id, county_desc, voter_reg_num, ncid, voter_status_desc, registr_dt, race_code, ethnic_code, party_cd, gender_code, age_at_year_end)
voter_history <- voter_history |> 
  select(county_id, county_desc, voter_reg_num, election_lbl, voting_method, voted_party_cd, ncid)

Hypothesis

Voters who were able to successfully register and vote in the weekend prior to Election Day are more likely to participate in future elections when compared to voters who were registered on Election Day but turned away from voting due to North Carolina’s current voting laws.

Merged Data

To merge the two data sets, I used a left_join command associating the data by the same “ncid” mentioned earlier.

merged_data <- voter_registration |>
  left_join(voter_history, by = "ncid")

Election Day Registrants by Age

This data visualization shows the age of those registering on election day. It uses a data frame named “election_day_registration_2020”, which filters to find all those that registered on Election Day in 2020 and those with the “ACTIVE” voter status.

ggplot(election_day_registration_2020, aes(x = age_at_year_end)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Election Day Registrants by Age",
       x = "Age at Year End",
       y = "Number of Voters") +
  theme_minimal()

Challenge: Dates

I ran into a challenge when trying to assign binary’s to the dates of the data set. The NCSBE data treated each date as a string, so comparing dates to one another (such as determining if the date was before or after 2022) posed a challenge. The data set, “test_data”, used below is the merged_data frame except with 1 row mutated to classify voters as “Early Voters” or “Election Day Registrant”.

test_data$election_lbl <- as.Date(test_data$election_lbl, format = "%m/%d/%Y")

Binaries

To run the logistic model, I used binaries to determine if voters voted in elections on or after the 2022 Midterm. Rows were assigned with 1 if they voted in the 2022 Midterm or any election thereafter, and 0 if they had not voted after the 2020 General. The same date function was used.

test_data <- test_data |>
  mutate(election_period_2022 = case_when(
    as.Date(election_lbl, format = "%m/%d/%Y") >= as.Date("2022-11-08") ~ "On or After 2022-11-08",
    TRUE ~ "Haven't Voted on or After 2022-11-08"
  ))

test_data <- test_data |>
  mutate(election_period_2022_binary = case_when(
    election_period_2022 == "On or After 2022-11-08" ~ 1,
    election_period_2022 == "Haven't Voted on or After 2022-11-08" ~ 0,
    TRUE ~ NA_real_ 
  ))

Logistical Model

The next step in the project is running the logistical model that will analyze the correlation between registration and future behavior. The first line of code categorizes “Early Voter” and “Election Day Registrant”. The second line runs the model and displays a summary of the results.

test_data$voter_type <- factor(test_data$voter_type, levels = c("Early Voter", "Election Day Registrant"))

log_model <- glm(election_period_2022_binary ~ voter_type, data = test_data, family = binomial)
summary(log_model)
## 
## Call:
## glm(formula = election_period_2022_binary ~ voter_type, family = binomial, 
##     data = test_data)
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.46736    0.01909 -76.881  < 2e-16 ***
## voter_typeElection Day Registrant -0.08902    0.02684  -3.317  0.00091 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35471  on 37563  degrees of freedom
## Residual deviance: 35460  on 37562  degrees of freedom
## AIC: 35464
## 
## Number of Fisher Scoring iterations: 4

Model Results

With a p value < 0.005, and a coefficient of - 0.08902, there has shown to be a negative correlation between registration/voting outcome and future voting behavior. This means that someone who is turned away from voting on Election Day but is able to register is less likely to vote in future elections than someone who successfully voted the weekend prior during the Early Voting period.

Demographic Variables

I also wanted to study the effect that other demographic information might have on future voting behavior. This demographics model tests the effect that age, race, and gender have on 2022 and beyond voting behavior.

demographics_model <- glm(
  election_period_2022_binary ~ age_at_year_end + race_code + gender_code,
  family = binomial,
  data = test_data
)
summary(demographics_model)
## 
## Call:
## glm(formula = election_period_2022_binary ~ age_at_year_end + 
##     race_code + gender_code, family = binomial, data = test_data)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.726547   0.120703 -22.589  < 2e-16 ***
## age_at_year_end  0.019950   0.000887  22.491  < 2e-16 ***
## race_codeB      -0.134902   0.118122  -1.142   0.2534    
## race_codeI       0.055608   0.175552   0.317   0.7514    
## race_codeM       0.023635   0.195723   0.121   0.9039    
## race_codeO       0.039966   0.136269   0.293   0.7693    
## race_codeP       0.444580   0.514627   0.864   0.3876    
## race_codeU       0.294066   0.122800   2.395   0.0166 *  
## race_codeW       0.548094   0.114001   4.808 1.53e-06 ***
## gender_codeM     0.064123   0.030005   2.137   0.0326 *  
## gender_codeU    -0.053166   0.058493  -0.909   0.3634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35471  on 37563  degrees of freedom
## Residual deviance: 34523  on 37553  degrees of freedom
## AIC: 34545
## 
## Number of Fisher Scoring iterations: 4

Demographics Model Results

This model indicates that age is a positive factor in determining future voting behavior. With a p value of < 2e-16, and a coefficient of 0.019950, that means that with every year increase in age, the probability of voting future elections goes up by 0.019950.

These results also show that race is a significant factor on undesignated race voters and white voters. With p values of 0.0166 and 1.53e-06, respectively, these race coefficients show a positive correlation in future voting behavior, meaning undesignated race and white voters are more likely to vote in elections taking place after the 2022 Midterm.

Finally, gender also played a role in future behavior. With a p value of 0.0326 from this sample, men showed a weak positive correlation in future voter behavior and women, with a p value of 0.0128, showed a weak negative correlation (only -0.07361).

Additional Visualizations

To create additional visualizations, I calculated the proportion of voting participation by race and created a bar chart to better understand the data.

race_summary <- test_data |> 
  group_by(race_code, election_period_2022_binary) |> 
  summarize(count = n(), .groups = "drop") |>     
  mutate(proportion = count / sum(count)) 

race_summary$election_period_2022_binary <- factor(race_summary$election_period_2022_binary, 
  levels = c(0, 1), 
  labels = c("Did Not Vote", "Voted"))
ggplot(race_summary, aes(x = reorder(race_code, -proportion), y = proportion, fill = factor(election_period_2022_binary))) + 
    geom_bar(stat = "identity", position = "dodge") + 
    labs(title = "Voting Participation Rate by Race", 
         x = "Race Code", 
         y = "Proportion", 
         fill = "Voted in 2022+") + 
    theme_minimal()

Conclusion

This project has shown that there is a correlation between voter registration (depending on outcome of being able to successfully vote) and whether or not someone is more or less likely to vote in future election. The propensity shows that voters who were able to successfully register but not complete a ballot are less likely to vote in future elections, as seen from the data comparison in 2020 and 2022.

The Future of this Project

Completing this project has highlighted the importance of deepening my understanding of data science to fully analyze and interpret huge data sets, such as the North Carolina voter history data. I’m not 100% comfortable with all of the complex code along with the logistical models used. I’m in my second year at UF and hope to take a research methods course to strengthen my skills within this field because it’s one of my interests within political science.