1 Introduction

Every four years, American citizens are able to express their political be- liefs by electing their leaders in the presidential election. Before the votes are cast, candidates and their campaign managers attempt to reach out to as many eligible voters to present their plans for when he/she is elected and opinions on important issues in the country. In order to come up with a successful and effective campaign, campaign staffs collect and analyze various types of data about the eligible voters through surveys (online or in-person) and de- mographic studies. The information they collect include voters’ race, gender, political ideologies and so on. Using mathematical models, data analysts in the campaign team can determine the relationship between the aforementioned characteristics about voters and how they will vote for the candidates. By un- derstanding this relationship thoroughly, campaign managers will be able to allocate the resources effectively to influence voters and increase the likelihood of their candidate of winning the election. The 2020 United States presidential election took place during the global pandemic where people around the world were affected in numerous ways. De- spite the adversities people encountered due to the pandemic including health protocols to prevent the spread of the coronavirus, according to Igielnik, Keeter, Hartig (2021), the voter turnout increased by seven percentage points from the previous presidential election in 2016. In total, 66% of adult citizens voted in this presidential election.

2 Literature Review

Logistic regression models have been used in past researches to analyze voter behaviors and other related topics. Rusch, Lee, Hornik, Jank and Zeileis (2013) utilized logistic regression trees with different regressors and partitioning vari- ables to predict the likelihood of a turnout or support of an individual using the data on the eligible voters in Ohio. They compared these logistic regression tree models based on prediction accuracy through the calculation of the classification accuracy and ROC curves. Hellberg and Syr ́en (2019) used logistic regression to determine which variables affect the voter turnout in Germany using the data from World value survey from 2013. They compared the models based on ROC curves, confusion matrix and goodness of fit statistics such as R-squared, Akaike information criterion (AIC) and p-values from Hosmer and Lemeshow test and concluded that people’s income and valuing the importance of living in a demo- cratic country have a positive relationship with voter turnout. Therefore, as past researches on voter behaviors suggest, logistic regression models are useful and efficient tools in analyzing and predicting voter behaviors because they can help determine how different variables relate to how voters vote in an election.

3 Data Description and Preliminary Data Analysis

The data that will be analyzed in this paper is the survey data from Democ- racy Fund Voter Study Group. The survey was conducted twice in 2020 before and after the 2020 presidential election. The respondents of the survey were sampled by stratified sampling. In the original data, there are 5900 observations. After the data cleaning and filtering, there are 4062 observations in the data for analysis and model fitting for this paper. The independent variables that will be included in the logistic regression model are race (which has four categories, Asian, Black, Hispanic and white), gender (which has two categories, female and male) and political ideologies (which has three categories, conservative, moderate and liberal). The dependent variable is the choice of candidate from the 2020 United States presidential election (which has two categories, Donald Trump and Joe Biden). Bar charts and tables below are visualization tools that I used to understand the statistical relationship between the independent variables and dependent variables from the data before model fitting.

# Loading necessary R packages
library(ggplot2)
library(dplyr)
library(ggpubr)
library(kableExtra)
library(factoextra)
# Importing the data and extracting the necessary variables for the project
voterdata_original <- read.csv("voter_2020.csv")
voterdata <- voterdata_original %>% select(ideo5_2020Nov, presvote_2020Nov, race_2020Sep, gender_2020Sep)
# renaming variables
voterdata <- voterdata %>%  mutate(voted_for_president = ifelse(presvote_2020Nov==1,"Donald Trump",
                            ifelse(presvote_2020Nov==2,"Joe Biden",NA)),
                            
                            race = ifelse(race_2020Sep==1,"White",
                                   ifelse(race_2020Sep==2,"Black",
                                   ifelse(race_2020Sep==3, "Hispanic",
                                   ifelse(race_2020Sep==4, "Asian",NA)))),
                            
                            gender = ifelse(gender_2020Sep == 1, "Male",
                                     ifelse(gender_2020Sep == 2, "Female", NA)),
                            
                            political_ideology = ifelse(ideo5_2020Nov == 1, "Liberal", #very liberal
                                                 ifelse(ideo5_2020Nov == 2, "Liberal",
                                                 ifelse(ideo5_2020Nov == 3, "Moderate",        
                                                 ifelse(ideo5_2020Nov == 4, "Conservative",
                                                 ifelse(ideo5_2020Nov == 5, "Conservative", #very conservative
                                                 NA)))))
                            )
# filtering out the data
voterdata <- voterdata %>%
  filter(voted_for_president %in% c("Donald Trump", "Joe Biden"), race %in% c("White","Black", "Hispanic", "Asian")) %>%
  select(voted_for_president, race, gender, political_ideology)
# removing NA's
voterdata_adjusted <- na.omit(voterdata)

3.1 Choice of Candidates vs. Race of Voters

First, I look at the relationship between voter’s race and his/her choice of candidates. In the data set, there 112 voters who are Asians, 422 voters who are Black, 409 voters who are Hispanic and 3119 voters who are White. Table 1 and Figure 1 below summarize how voters in the data set voted for the presidential candidates by race.

# Table 1  
race <- c("Asian", "Black", "Hispanic", "White")
percentage_by_race_trump <- c(2, 2, 9, 87)
percentage_by_race_biden <- c(3, 16, 11, 70)
table_race<- rbind(race, percentage_by_race_trump, percentage_by_race_biden)
rownames(table_race) <- c("Race", "% of Voters who Voted for Trump", "% of Voters who Voted for Biden")
kbl(table_race, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 1: Choice Candidates vs. Race", general_title = "")
Race Asian Black Hispanic White
% of Voters who Voted for Trump 2 2 9 87
% of Voters who Voted for Biden 3 16 11 70
Table 1: Choice Candidates vs. Race
# Figure 1 (Stacked bar chart)
ggplot(voterdata_adjusted) + geom_bar(aes(x = voted_for_president, fill = race)) + ggtitle("Choice of Candidates by Race") + xlab("Presidential Candidate Choice") + ylab("Number of Voters") + theme(plot.title = element_text(hjust = 0.5))
Figure 1: Stacked bar chart of the voter’s choice of candidates by race

Figure 1: Stacked bar chart of the voter’s choice of candidates by race

Based on Table 1 and Figure 1 above, 2% of the people who voted for Donald Trump and 3% of the people who voted for Joe Biden are Asian. 2% of the people who voted for Donald Trump and 16% of the people who voted for Joe Biden are Black. 9% of the people who voted for Donald Trump and 11% of the people who voted for Joe Biden are Hispanic. 87% of the people who voted for Donald Trump and 70% of the people who voted for Joe Biden are white. There is a significant difference in the percentages of the people in the minority population who voted for these candidates (13% voted for Donald Trump and 30% voted for Joe Biden)

3.2 Choice of Candidate vs. Gender of Voters

Now I look at the relationship between the voter’s choice of candidates and the gender of the voter. In the data set, there are 2064 female voters and 1998 male voters. Table 2 and Figure 2 below summarize how voters in the data set voted for the presidential candidates by gender.

# Table 2
gender <- c("Female", "Male")
percentage_by_gender_trump <- c(44, 56)
percentage_by_gender_biden <- c(56, 44)
table_gender<- rbind(gender, percentage_by_gender_trump, percentage_by_gender_biden)
rownames(table_gender) <- c("Gender", "% of Voters who Voted for Trump", "% of Voters who Voted for Biden")
kbl(table_gender, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 2: Choice Candidates vs. Gender", general_title = "")
Gender Female Male
% of Voters who Voted for Trump 44 56
% of Voters who Voted for Biden 56 44
Table 2: Choice Candidates vs. Gender
# Figure 2
ggplot(voterdata_adjusted) + geom_bar(aes(x = voted_for_president, fill = gender)) + ggtitle("Choice of Candidates by Gender") + xlab("Presidential Candidate Choice") + ylab("Number of Voters") + theme(plot.title = element_text(hjust = 0.5))
Figure 2: Stacked bar chart of the voter’s choice of candidates by gender

Figure 2: Stacked bar chart of the voter’s choice of candidates by gender

Based on Table 2 and Figure 2 above, the majority of the people who voted Donald Trump were male whereas the majority of the people who voted Joe Biden were women.

3.3 Choice of Candidates vs Political Ideologies

Finally, I look at the relationship between the voter’s choice of candidates and the gender of the voter. In the data set, 1269 voters are conservative, 1354 voters are liberal and 1439 voters are moderate. Table 3 and Figure 3 below summarize how voters in the data set voted for the presidential candidates by political ideologies.

# Table 3
PI <- c("Conservative", "Liberal", "Moderate")
percentage_by_PI_trump <- c(72, 2, 26)
percentage_by_PI_biden <- c(4, 54, 42)
table_PI<- rbind(PI, percentage_by_PI_trump, percentage_by_PI_biden)
rownames(table_PI) <- c("Political Ideologies", "% of Voters who Voted for Trump", "% of Voters who Voted for Biden")
kbl(table_PI, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 3: Choice Candidates vs. Political Ideologies", general_title = "")
Political Ideologies Conservative Liberal Moderate
% of Voters who Voted for Trump 72 2 26
% of Voters who Voted for Biden 4 54 42
Table 3: Choice Candidates vs. Political Ideologies
# Figure 3
ggplot(voterdata_adjusted) + geom_bar(aes(x = voted_for_president, fill = political_ideology)) + ggtitle("Choice of Candidates by Political Ideologies") + xlab("Presidential Candidate Choice") + ylab("Number of Voters") + theme(plot.title = element_text(hjust = 0.5))
Figure 3: Stacked bar chart of the voter’s choice of candidates by political ideologies

Figure 3: Stacked bar chart of the voter’s choice of candidates by political ideologies

Based on Table 3 and Figure 3 above, The political ideologies of the people seem to play a role in their choice of the candidate as the majority of the people who voted for Donald Trump are conservative and the majority of the people who voted for Joe Biden are liberal.

4 Method

Logistic regression model is a binary outcome model in which the binary outcome dependent variable, which takes on two values (0 and 1), is explained by independent variables. The typical set up of a logistic regression is the following equation.

\(F(x^{\prime}\beta) = \frac{e^{x^{\prime}\beta}}{1+e^{x^{\prime}\beta}} = \frac{exp(x^{\prime}\beta)}{1+exp(x^{\prime}\beta)}, \tag{1}\)

where \(F(x^{\prime}\beta)\) is a cumulative distribution of the logistic distribution, \(x'\) is the vector of independent variables and \(\beta\) is the vector is the coefficients for the independent variables.

The coefficients in logistic regression model are estimated through the method of maximum likelihood and they represent the log-odds of the dependent variable being equal to 1. The log-odds can be obtained from the equation for the logistic regression discussed previously.

\(\text{log-odds} = \ln\frac{p}{1-p} = x'\beta, \tag{2}\) where p = the probability the dependent variable equals 1.

In order to interpret the results from logistic regression, we take the exponential of the coefficients to obtain the odds ratio.

\(\text{odds ratio} = \frac{p}{1-p} = exp(x'\beta), \tag{3}\)

The odds ratio represents the probability that the dependent variable is equal to 1 relative to the probability that the dependent variable is equal to 0. In this paper, the odds ratio represents the probability that a person voted for Joe Biden relative to the probability that a person voted for Donald Trump.

The average marginal effects is another way of interpreting the results from logistic regression. The average marginal effect is computed by taking the partial derivative of the cumulative distribution, evaluating the derivative using the coefficient estimate and calculating the mean of the individual marginal effects for all observation.

\(\text{Average marginal effect} = \frac{\partial p }{\partial x_{j}} = \sum \frac{F(x^{\prime}\beta)}{n} \beta_{j}, \tag{4}\)

Average marginal effect represents the average of the individuals marginal effects and is expressed relative to the base category.

As discussed in the previous section, the independent variables are race, gender and political ideologies of the voters and the dependent variable is their choice of candidates in the 2020 United States presidential election.

5 Results

Before fitting the logistic regression model, I converted the independent variables into factor variables by creating factor variables. In addition, using the function levels(), I learned that the reference category in this model is a white conservative female voter who voted for Donald Trump. (This means that all the independent variables are equal to 0)

# converting the variables into factor variables by 
voterdata_adjusted <- voterdata_adjusted %>% mutate(voted_for_president_fct = factor(voterdata_adjusted$voted_for_president, levels = c("Donald Trump", "Joe Biden")), race_fct = factor(voterdata_adjusted$race, levels = c("White", "Black", "Hispanic", "Asian")), gender_fct = factor(voterdata_adjusted$gender, levels = c("Female", "Male")),PI_fct = factor(voterdata_adjusted$political_ideology, levels = c("Conservative", "Liberal", "Moderate")))
# checking the reference group for these new variables
levels(voterdata_adjusted$voted_for_president_fct)
## [1] "Donald Trump" "Joe Biden"
levels(voterdata_adjusted$race_fct)
## [1] "White"    "Black"    "Hispanic" "Asian"
levels(voterdata_adjusted$gender_fct)
## [1] "Female" "Male"
levels(voterdata_adjusted$PI_fct)
## [1] "Conservative" "Liberal"      "Moderate"

The reference group for each variable: voted_for_president_fct: people who voted for Donald Trump race_fct: White voters gender_fct: female voters PI_fct: conservative voters

Now I fit the logistic regression model with the function glm().

# fitting the logistic regression model
Logit <- glm(voted_for_president_fct ~ race_fct + gender_fct + PI_fct, data = voterdata_adjusted, family = "binomial")
summary(Logit)
## 
## Call:
## glm(formula = voted_for_president_fct ~ race_fct + gender_fct + 
##     PI_fct, family = "binomial", data = voterdata_adjusted)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4629  -0.3484   0.2303   0.2959   2.3799  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.54040    0.12328 -20.607  < 2e-16 ***
## race_fctBlack     2.37717    0.23392  10.162  < 2e-16 ***
## race_fctHispanic  0.53973    0.16400   3.291 0.000998 ***
## race_fctAsian     0.76995    0.31149   2.472 0.013444 *  
## gender_fctMale   -0.23087    0.09969  -2.316 0.020560 *  
## PI_fctLiberal     6.15668    0.20964  29.369  < 2e-16 ***
## PI_fctModerate    3.26963    0.12291  26.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5462.8  on 4061  degrees of freedom
## Residual deviance: 2615.6  on 4055  degrees of freedom
## AIC: 2629.6
## 
## Number of Fisher Scoring iterations: 6

Based on the signs of the coefficients, Black, Hispanic and Asian voters are more likely to vote for Joe Biden compared to the white voters, male voters are less likely to vote for Joe Biden compared to the female voters and liberal and moderate voters are more likely to vote for Joe Biden compared to conservative voters. The coefficient estimates are log-odds and they do not provide sufficient information to describe the relationship between the independent variables and dependent variable in the model. Therefore, odds ratios are calculated by using (3) for each coefficient.

# Take the e of each coefficient for the interpretation
round(exp(Logit$coefficients),2)
##      (Intercept)    race_fctBlack race_fctHispanic    race_fctAsian 
##             0.08            10.77             1.72             2.16 
##   gender_fctMale    PI_fctLiberal   PI_fctModerate 
##             0.79           471.86            26.30

Based on the odds ratios from the output above:

Using the coefficient estimates, the average marginal effects of the independent variables are computed using (4). The output below shows the average marginal effect for each independent variable.

# average marginal effect
LogitScalar <- mean(dlogis(predict(Logit, type = "link")))
round(LogitScalar * coef(Logit),3)
##      (Intercept)    race_fctBlack race_fctHispanic    race_fctAsian 
##           -0.252            0.236            0.054            0.076 
##   gender_fctMale    PI_fctLiberal   PI_fctModerate 
##           -0.023            0.611            0.325

Based on the average marginal effects from the output above:

In order to evaluate the performance of the logistic regression model, the model’s prediction accuracy was calculated using the confusion matrix. The output below is confusion matrix which shows the number of correctly predicted cases as well as the number of cases that were not predicted correctly .

# confusion matrix from the logistic regression model
table(true = voterdata_adjusted$voted_for_president, pred = round(fitted(Logit)))
##               pred
## true              0    1
##   Donald Trump 1164  455
##   Joe Biden     105 2338

The accuracy is calculated by dividing the sum of the number of cases of true positive and true negative (the values in the diagonal entries) by the total number of cases. Therefore, \(Accuracy = \frac{2338 + 1164}{4062} = \frac{3502}{4062} = 0.862\) The accuracy of 0.862 implies that this logistic regression model correctly predicted 86% of the cases in the data.

6 Conclusion

The use of a logistic regression model helped us investigate the relationship between voters’ race,gender and political ideologies and their choice of candidate in the presidential election in 2020. Based on the p-values of the coefficient estimates, the coefficient estimates for the independent variables are statistically significant. This implies that voters’ race, gender and political ideologies have impacts on how they voted in the 2020 United States presidential election. Despite the coefficient estimates not being able to provide a clear information about how the independent variables affect the dependent variable, the odds ratios and average marginal effects allowed us to see how the variables in the model are related in the context of probability. For future research of voter behaviors, more variables such as voters’ perceptions about certain political issues can be included in the logistic regression model in addition to the variables discussed in this paper in order to analyze voter behaviors better. Moreover, adding more categories for the dependent variables will allow us to analyze the voter behaviors although this type of analysis will require the use of a multinomial logistic regression model and an appropriate data.

7 Appendix

7.1 Robustness Check

One may suggest the use of a probit model over a logistic regression model because a probit model is also a binary outcome model that could have been used to analyze the effects of the voters’ race, gender and political ideologies on their choice of candidate in the presidential election. The output below summarizes the results from the probit regression model.

# Robustness check
# the probit model
Probit <- glm(voted_for_president_fct ~ race_fct + gender_fct + PI_fct, data = voterdata_adjusted, family = "binomial"(link = "probit"))
summary(Probit)
## 
## Call:
## glm(formula = voted_for_president_fct ~ race_fct + gender_fct + 
##     PI_fct, family = binomial(link = "probit"), data = voterdata_adjusted)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7463  -0.3400   0.2334   0.3170   2.3999  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.44312    0.06206 -23.254  < 2e-16 ***
## race_fctBlack     1.19388    0.11960   9.982  < 2e-16 ***
## race_fctHispanic  0.29627    0.09016   3.286  0.00102 ** 
## race_fctAsian     0.38077    0.16913   2.251  0.02436 *  
## gender_fctMale   -0.14480    0.05505  -2.630  0.00853 ** 
## PI_fctLiberal     3.37187    0.09165  36.790  < 2e-16 ***
## PI_fctModerate    1.90374    0.06371  29.881  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5462.8  on 4061  degrees of freedom
## Residual deviance: 2622.4  on 4055  degrees of freedom
## AIC: 2636.4
## 
## Number of Fisher Scoring iterations: 6
# average marginal effect
ProbitScalar <- mean(dlogis(predict(Probit, type = "link")))
ProbitScalar * coef(Probit)
##      (Intercept)    race_fctBlack race_fctHispanic    race_fctAsian 
##      -0.23318475       0.19291133       0.04787294       0.06152684 
##   gender_fctMale    PI_fctLiberal   PI_fctModerate 
##      -0.02339810       0.54483787       0.30761226

As seen from the results above, the signs of the coefficient estimates and average marginal effects are the same as those of the logistic regression model discussed in this paper. Because all of the p-values of the coefficient estimates are less than 0.05, these coefficient estimates reported are statistically significant. Therefore, voters’ race, gender and political ideologies have impacts on how they voted in the 2020 United States presidential election. I have also computed the prediction accuracy of this probit model using the confusion matrix. The output below is the confusion matrix for the probit model.

# confusion matrix for the probit model
table(true = voterdata_adjusted$voted_for_president, pred = round(fitted(Probit)))
##               pred
## true              0    1
##   Donald Trump 1164  455
##   Joe Biden     105 2338

Based on the output above and the confusion matrix for the logistic regression model, both logistic and probit models have the same prediction accuracy. Both models correctly predicted 86.2% of the cases in the data. In order to compare the performance of these models using another statistic for goodness of fit, I extracted the Akaike information criterion (AIC) for both models from the regression results. The Akaike information criterion is an estimator of prediction error and a model with a lower AIC value is the better model. The AIC of the logistic regression model is 2629.6 and the AIC of the probit model is 2636.4. This implies that the logistic regression model performed better and therefore was used for the analysis of the survey data for this paper.

8 References

Democracy Fund Voter Study Group. Views of the Electorate Research Survey, December 2021. Washington, D.C. https://www.voterstudygroup.org/

Democracy Fund Voter Study Group.(2021). Views of the Electorate Research Survey[Data Set]. https://www.voterstudygroup.org/downloads/voter-survey?key=493046

Hellberg, F. & Syrén, E (2019). To vote, or not to vote? Understanding voter turnout patterns: Constructing, interpreting and comparing logistic regression models measuring voter turnout in German federal elections. Retrieved May 25, 2022, from http://www.diva-portal.se/smash/get/diva2:1389411/FULLTEXT01.pdf

Igielni, R.; Keeter, S.; Hartig, H. Behind Biden’s 2020 victory https://www.pewresearch.org/politics/2021/06/30/behind-bidens-2020-victory/ (accessed May 25, 2022).

Rusch, T., Lee, I., Hornik, K., Jank, W., & Zeileis, A. (2013). Influencing Election with Statistic: Targeting Voters with Logistic Regression Trees. The Annals of Applied Statistics, 7(3), 1612–1639. Retrieved May 25, 2020, from http://www.diva-portal.se/smash/get/diva2:1389411/FULLTEXT01.pdf