DATA 606 Final Project

Josh Iden

Abstract

This project looks at responses from a FiveThirtyEight.com sponsored survey conducted between Sept. 15-25 2020 amongst non-voters to questions regarding their attitudes about elections and whether there is a relationship between education level and voter attitudes based on those responses. Of the 33 questions posed in the survey, we have selected the 3 questions that focus specifically on attitudes towards voting. We then determine if a logistic regression model is appropriate to estimate the relationship between education level and the survey response attitudes towards voting and if a response score is able to predict a college degree and posit a null hypothesis that there is no relationship between the responses and having a college degree.

Part 1 - Introduction

This project focuses on the question “Is education level related to voter attitudes amongst non-voters?” using data collected for the FiveThirtyEight.com report “Why Many Americans Don’t Vote”

Part 2 - Data

Let’s take a look at the data.

# loading dataset
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv"
df = read.csv(url)
dim(df)
## [1] 5836  119

The dataset contains 5,836 observations across 119 columns (questions). For the purpose of this analysis, we will subset the data down to the desired data: Education, and Questions 5, 16, and 20.

These questions are as follows:

Q5: Does it matter who wins elections? 1 = Yes, 2 = No

Q16: In general, how easy or difficult is it to vote in elections? 1 = Very Easy, 4 = Very Difficult

Q20: Are you currently registered to vote? 1 = Yes, 2 = No

Subsetting the Data

cols = c("educ","Q5","Q16","Q20")
df.sub = df[cols]
kable(head(df.sub))
educ Q5 Q16 Q20
College 1 1 1
College 1 2 1
College 1 1 1
Some college 1 4 1
High school or less 1 1 1
High school or less 2 -1 2

Summary Statistics

Let’s take a look at the summary statistics:

summary(df.sub)
##      educ                 Q5             Q16             Q20        
##  Length:5836        Min.   :-1.00   Min.   :-1.00   Min.   :-1.000  
##  Class :character   1st Qu.: 1.00   1st Qu.: 1.00   1st Qu.: 1.000  
##  Mode  :character   Median : 1.00   Median : 1.00   Median : 1.000  
##                     Mean   : 1.16   Mean   : 1.62   Mean   : 1.076  
##                     3rd Qu.: 1.00   3rd Qu.: 2.00   3rd Qu.: 1.000  
##                     Max.   : 2.00   Max.   : 4.00   Max.   : 2.000
sum(is.na(df.sub))
## [1] 0

Replacing Negative Values

The first thing that pops out are the minimum values of -1, since the minimum response value is 1. Let’s see how many of those there are.

df.sub |> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> count()
##    n
## 1 82

There are 82 rows with -1 entries.

df.sub |> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> head()
##                  educ Q5 Q16 Q20
## 1 High school or less  2  -1   2
## 2        Some college  1  -1   2
## 3        Some college  1  -1   1
## 4 High school or less  1   2  -1
## 5 High school or less -1  -1  -1
## 6 High school or less  2   2  -1

We can see that these negative values do not otherwise affect other responses. This is likely a data entry error. To rectify this, I will impute a value of 1 for every -1 value.

df.sub <- df.sub |>
  mutate(Q5 = as.numeric(abs(Q5)),
            Q16 = as.numeric(abs(Q16)),
            Q20 = as.numeric(abs(Q20)))

summary(df.sub)
##      educ                 Q5            Q16            Q20       
##  Length:5836        Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  Class :character   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:1.000  
##  Mode  :character   Median :1.00   Median :1.00   Median :1.000  
##                     Mean   :1.17   Mean   :1.64   Mean   :1.083  
##                     3rd Qu.:1.00   3rd Qu.:2.00   3rd Qu.:1.000  
##                     Max.   :2.00   Max.   :4.00   Max.   :2.000

We can see this didn’t significantly affect the summary statistics.

Part 3 - Exploratory data analysis

Distribution of Education

Let’s take a look at the distribution of education.

ggplot(df.sub, aes(x=educ)) +
  geom_bar() + 
  geom_text(aes(label = after_stat(count)), 
            stat="count", vjust=2,
            color="white") +
  labs(x = "education")

Education Frequency Table

kable(prop.table(table(df.sub$educ)), col.names=c("Level","Freq"))
Level Freq
College 0.3992461
High school or less 0.3077450
Some college 0.2930089

We can see from the survey responses, college graduates accounted for just under 40% of non-voters amongst respondents, while those hadn’t completed college accounted for just over 60% of respondents.

Average Answer by Education Level

Taking a look at average answer by education level,

df.sub |> 
  group_by(educ) |>
  summarize(Q5 = mean(Q5),
            Q16 = mean(Q16),
            Q20 = mean(Q20))
## # A tibble: 3 × 4
##   educ                   Q5   Q16   Q20
##   <chr>               <dbl> <dbl> <dbl>
## 1 College              1.13  1.63  1.02
## 2 High school or less  1.22  1.68  1.17
## 3 Some college         1.17  1.61  1.07

We can see that responders with an education level of “High School or less” has the highest response scores – with higher scores representing more negative attitudes towards voting. Somewhat surprisingly, respondents with “some college” tend to view the difficulty involved with voting (Q16) more favorably than the other groups.

Transforming the Data

As question 16 is rated on a scale of 1-4, 1 being “very easy”, 2 being “easy”, 3 = “difficult”, 4 = “very difficult”, we dichotomize the scores to 1 and 2 in order to visualize and model the relationship. We also dichotomize the education variable, with “yes” representing “has a college degree”, and “no” representing “does not have a college degree”.

college <- df.sub |>
  mutate(Q16 = as.numeric(ifelse(Q16 < 3, 1, 2)),
         Q5 = factor(Q5),
         Q16 = factor(Q16),
         Q20 = factor(Q20),
         college = factor(ifelse(educ == "College", "yes","no"))) |>
  select(5, 2:4)
kable(head(college))
college Q5 Q16 Q20
yes 1 1 1
yes 1 1 1
yes 1 1 1
no 1 2 1
no 1 1 1
no 2 1 2

Let’s take a look at the average answer by education now,

college |> 
  group_by(college) |>
  summarize(Q5 = mean(as.numeric(Q5)),
            Q16 = mean(as.numeric(Q16)),
            Q20 = mean(as.numeric(Q20)))
## # A tibble: 2 × 4
##   college    Q5   Q16   Q20
##   <fct>   <dbl> <dbl> <dbl>
## 1 no       1.20  1.15  1.12
## 2 yes      1.13  1.15  1.02

We can see there’s almost no difference in the answers for Question 16.

Now, let’s take a look at the distribution of responses by education level.

Distribution of Ratings by Dichotomized Education

Let’s look at a scatter plot of the data,

college |>
  pivot_longer(cols = c(2:4), names_to = "question", values_to = "rating") |>
  mutate(question = factor(question, levels=c("Q5","Q16","Q20"))) |>
  ggplot(aes(x=college, y=rating, color=question)) +
  geom_jitter() +
  ylim("1","2") +
  labs(title = "Distribution of Ratings")

We can see when we try to visualize this data as a scatterplot, the binomial versus categorical variables do not provide much insight into a correlation and there’s certainly no linear relationship.

So instead we model the data as a barplot. Here we can see that the proportions of responses for college graduates versus non college graduates for questions 5 & 20 is more apparent than for question 16, which looks like a relatively comparable proportion.

college |>
  pivot_longer(cols = c(2:4), names_to = "question", values_to = "rating") |>
  ggplot(aes(x=rating, fill=college)) +
  facet_grid(~factor(question, levels=c('Q5','Q16','Q20'))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Ratings") +
  xlim("1","2") +
  scale_fill_manual(values=c("red","blue"))

We can see by the binomial ratings that a linear regression model would not be appropriate for this data. A logistic model, however, will allow us to observe whether a relationship exists.

Part 4 - Inference

Examining the Variables

The dependent varibale for this analysis is education level, yes = has college degree, no = no college degree.

The independent variables are the binomial question scores.

Let’s look at a frequency table for college degrees:

kable(prop.table(table(college$college)), col.names=c("Level","Freq"))
Level Freq
no 0.6007539
yes 0.3992461

Split is about 60/40.

Now let’s estimate the population proportion to see if this holds.

college |>
  specify(response = college, success = "yes") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.387    0.412

This tells us that 95% of the samples of the population of non-voters have mean college degree proportions between 38.7% and 41.1%

Let’s examine the relationship between college degree and response scores.

Multiple Logistic Regression Model

First, let’s see if there’s a relationship between the variables.

We build the logistic model using the glm() function and passing “binomial” into the family argument.

college_mod.full <- glm(college ~ ., data = college, family=binomial(link="logit"))
summary(college_mod.full)
## 
## Call:
## glm(formula = college ~ ., family = binomial(link = "logit"), 
##     data = college)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0979  -1.0629  -0.9465   1.2964   2.1364  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.27547    0.03145  -8.758  < 2e-16 ***
## Q52         -0.29528    0.07844  -3.764 0.000167 ***
## Q162         0.08559    0.07688   1.113 0.265586    
## Q202        -1.60368    0.14532 -11.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7851.8  on 5835  degrees of freedom
## Residual deviance: 7634.7  on 5832  degrees of freedom
## AIC: 7642.7
## 
## Number of Fisher Scoring iterations: 4

We can see that question 16 (perceived ease of voting) does not have significance in this model, so using backwards stepwise regression, we remove the variable from the model and re-run.

college_mod.v2 <- glm(college ~ Q5 + Q20, data = college, family=binomial(link="logit"))
summary(college_mod.v2)
## 
## Call:
## glm(formula = college ~ Q5 + Q20, family = binomial(link = "logit"), 
##     data = college)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0677  -1.0677  -0.9519   1.2912   2.1263  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.26364    0.02958  -8.912  < 2e-16 ***
## Q52         -0.29310    0.07840  -3.738 0.000185 ***
## Q202        -1.59365    0.14500 -10.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7851.8  on 5835  degrees of freedom
## Residual deviance: 7635.9  on 5833  degrees of freedom
## AIC: 7641.9
## 
## Number of Fisher Scoring iterations: 4

We see both questions maintain similar significance to the full model.

So we see there is significance, but can the model predict the probability of a college degree based on the response score?

Making Predictions

First we split the data into training and test sets:

set.seed(123)
indices <- sample.split(Y = college$college, SplitRatio=0.8)
college.train <- college[indices,]
college.test<- college[!indices,]

Next, we make predictions using the testing set. First we build a model using the training set.

training.mod <- glm(college ~ Q5 + Q20, data=college.train, family=binomial(link="logit"))

We make predictions for the testing set using the training model, passing “response” into the type argument, and estimate the probability the test model is able to predict a college degree. We know from our earlier frequency table that the proportion of college degrees is about 40%, hence the probability of observing a college degree is 40%, we set the threshold for predictions to 0.4.

preds <- predict(training.mod, newdata = college.test, type = "response")
predicted <- table(college_degree = college.test$college, predicted_value = preds>0.4 )
my_mat = as.matrix(predicted)
my_mat
##               predicted_value
## college_degree FALSE TRUE
##            no    181  520
##            yes    65  401

Given accuracy = $TP + TN / TP + TN + FP + FN = $

TN <- my_mat[1] # true negative (no/false)
FP <- my_mat[2] # false positive (yes/false)
FN <- my_mat[3] # false negative (no/true)
TP <- my_mat[4] # true positive (yes/true)

accuracy = (TP + TN)/(TP + TN + FP + FN)
accuracy
## [1] 0.4987147

We see the model only predicts with about 50% accuracy.

Using the ROC package, we can plot the trade off between sensitivity (True Positive) and specificity (True Negative):

p = prediction(preds, college.test$college)
prf <-performance(p, measure ="tpr", x.measure ="fpr")
plot(prf)

The nearly straight line confirms the model offers nearly no predictive value.

Lastly, let’s take a look at the Area Under the Curve (AUC)

auc <- performance(p, measure = "auc")
auc <- auc@y.values[[1]]

auc
## [1] 0.5655685

AUC measures separability. 0 = reciprocation, 1 = separability. 0.5 = no class separation capacity.

And we can see from the score here, this is precisely what we get. The model offers no separation capacity to distinguish between the categories.

Part 5 - Conclusion

The conclusion here is that although there is a relationship between education and some voter responses to specific survey questions, it’s clear that the model is not sufficient to observe or predict the strength of that relationship, so we fail to reject the null hypothesis. Some of the challenges involved in observing this relationship is the quality of the questions and the binomial and discrete response scores.