Introduction:

This analysis examines the relationship between education level and voting patterns in the 2016 U.S. Presidential Election, drawing on data from the ANES 2016 Time Series Study. A central challenge in survey-based research is the presence of missing data, particularly in key variables such as presidential vote choice. To address this, the analysis implements multiple imputations via the Amelia package, following best practices outlined by (Acock 2005), which advocate for comparing results before and after imputation to assess potential bias introduced by missing values.

library(Amelia)
library(clarify)
library(modelsummary)
library(readxl)
library(dplyr)
library(tidyr)
library(nnet)
library(ggplot2)
library(kableExtra)
library(knitr)
library(patchwork)
library(tinytable)
setwd ("C:/Users/Yung Cho/Documents/GitHub/Yung_QC")

data16 <- read_xlsx("ANES 2016.xlsx", sheet = 2)

data16_imp <- read_xlsx("ANES 2016.xlsx", sheet = 2)

Methodology:

The imputation process leverages observed relationships between education, religious affiliation, age group, and self-perceived economic status to estimate missing values in the vote choice variable. Notably, a comparison of pre- and post-imputation vote distributions reveals substantial differences, indicating that the missing data were not missing completely at random (MCAR). This underscores the importance of imputation, as simply omitting cases with missing values could distort findings and reduce statistical power.

vote16_breakdown <- data16 %>%
  count(vote16) %>%
  mutate(percentage = n / sum(n) * 100)

table1 <- vote16_breakdown
amelia_results <- amelia(data16_imp, m = 20, 
                         idvars = c("religious", "education", "age.group", "feel.poor"),                                
                         noms = c("vote16"))  

data16_imp_imputed <- amelia_results$imputations[[1]]

vote16_breakdown_imp <- data16_imp_imputed %>%
  count(vote16) %>%
  mutate(percentage = n / sum(n) * 100)

table2 <- vote16_breakdown_imp
Pre- and Post- Imputation
Pre-Imputation
vote16 n percentage
Clinton 1290 30.203699
Trump 1178 27.581363
Other 195 4.565676
NA 1608 37.649262
Post-Imputation
vote16 n percentage
Clinton 2050 47.99813
Trump 1790 41.91056
Other 431 10.09131

Comparing the tables:

Comparing the tables before and after imputation enables us to evaluate any potential bias resulting from missing data. Notably, 37.6% of respondents declined to disclose their voting choice. Significant differences observed between these tables indicate that the missing data is likely not missing completely at random (MCAR). As referenced by (King et al. 2001), we will instead assume the data is missing at random (MAR) and use other variables in the dataset to predict and impute the missing values.

Although imputation generates estimates to fill in missing data, it cannot fully recover the lost information. However, imputation is generally preferable to simply ignoring missing data, and in most cases, it yields results that are at least as reliable, if not better.

Detailed Analysis of the Graph and Models:

Imputation is only the prelude to analysis. Once imputation is completed, the dataset is analyized as if it were complete, then results are combined with a variety of statistical methods, as described by (Honaker, King, and Blackwell 2011). Following imputation, the analysis employs multinomial logistic regression to model the probability of voting for Hillary Clinton, Donald Trump, or a third-party candidate as a function of education level. To quantify uncertainty in these predicted probabilities, simulation-based inference is conducted using the Clarify package, generating 1,000 draws from the estimated model.

data16_imp_imputed <- data16_imp_imputed %>%
  mutate(
    education_cat = case_when(
      education == 0 ~ "Less High School",
      education == 1 ~ "High School",
      education == 2 ~ "College Degree",
      education == 3 ~ "Graduate School",
      TRUE ~ NA_character_
    ),
    vote16 = case_when(
      vote16 == 1 ~ "Clinton",
      vote16 == 2 ~ "Trump",
      vote16 == 3 ~ "Other",
      TRUE ~ NA_character_
    )
  )

data16_imp_imputed$education_cat <- factor(
  data16_imp_imputed$education_cat,
  levels = c("Less High School", "High School", "College Degree", "Graduate School")
)

model <- multinom(vote16 ~ education_cat, data = data16_imp_imputed)
## # weights:  15 (8 variable)
## initial  value 4643.834144 
## iter  10 value 3996.212155
## final  value 3962.424340 
## converged
simulated_model <- clarify::sim(model, n = 1000)

sim_data <- expand.grid(
  education_cat = levels(data16_imp_imputed$education_cat)
)

predicted_probs <- lapply(1:1000, function(i) {
  coefs <- simulated_model$simulations[i, , drop = FALSE]
  temp_model <- model
  temp_model$coefficients <- coefs
  predict(temp_model, newdata = sim_data, type = "probs")
})

predicted_array <- array(unlist(predicted_probs),
                         dim = c(nrow(sim_data), ncol(predicted_probs[[1]]), 1000))

predicted_means <- apply(predicted_array, c(1,2), mean)

predicted_means_df <- as.data.frame(predicted_means)
predicted_means_df$education_cat <- sim_data$education_cat

predicted_means_long <- predicted_means_df %>%
  pivot_longer(
    cols = -education_cat,
    names_to = "candidate",
    values_to = "predicted_probability"
  )

plot2 <- ggplot(predicted_means_long, aes(x = education_cat, y = predicted_probability, color = candidate)) +
  geom_point(size = 2) +
  geom_line(aes(group = candidate)) +
  scale_color_manual(
    labels = c("Clinton", "Trump", "Other"),
    values = c("red", "green", "blue")
  ) +  
  theme_minimal() +
  labs(
    title = "",
    y = "Predicted Probability",
    x = "After Imputation"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## # weights:  20 (12 variable)
## initial  value 5859.866264 
## iter  10 value 5215.852107
## iter  20 value 5035.968133
## iter  20 value 5035.968123
## iter  20 value 5035.968123
## final  value 5035.968123 
## converged

Modeling Approach:

The primary model is a multinomial logistic regression, where the outcome variable is presidential vote choice (Clinton, Trump, or Other) and the predictor is education, categorized as “Less than High School,” “High School,” “College Degree,” and “Graduate School.”

The model is fit to the imputed dataset, ensuring that cases with previously missing vote data are included, thus maximizing the use of available information.

Simulation and Visualization:

The Clarify package is used to simulate 1,000 sets of model coefficients, allowing for the estimation of predicted probabilities and their uncertainty.

Predicted probabilities for each candidate are calculated across the four education categories and visualized in a line and point plot, with color coding for each candidate.

Graph Interpretation:

The graph displays the average predicted probability of voting for each candidate as education level increases. Clinton’s predicted probability tends to rise with higher education, peaking among those with graduate degrees. Trump’s predicted probability is highest among those with a high school education and declines with higher educational attainment. Third-party candidate support remains relatively low and stable across education levels.

Conclusion:

This analysis demonstrates a clear association between educational attainment and presidential vote choice in the 2016 U.S. election. After addressing missing data through multiple imputation, the findings indicate that higher educational attainment is associated with an increased likelihood of voting for Hillary Clinton and a decreased likelihood of voting for Donald Trump. The use of simulation-based inference provides robust estimates of uncertainty. The results underscore the importance of rigorous missing data handling in this survey-based research.

Acock, Alan C. 2005. “Working with Missing Values.” Journal of Marriage and Family 67 (4): 1012–28.
Honaker, James, Gary King, and Matthew Blackwell. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45: 1–47.
King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95 (1): 49–69.