Abstract
This project looks at responses from a FiveThirtyEight.com sponsored survey conducted between Sept. 15-25 2020 amongst non-voters to questions regarding their attitudes about elections and whether there is a relationship between education level and voter attitudes based on those responses. Of the 33 questions posed in the survey, we have selected the 3 questions that focus specifically on attitudes towards voting. We then determine if a logistic regression model is appropriate to estimate the relationship between education level and the survey response attitudes towards voting and if a response score is able to predict a college degree and posit a null hypothesis that there is no relationship between the responses and having a college degree.
Part 1 - Introduction
This project focuses on the question “Is education level related to voter attitudes amongst non-voters?” using data collected for the FiveThirtyEight.com report “Why Many Americans Don’t Vote”
Part 2 - Data
- 5,836 respondents
- Collected by Ipsos for FiveThirtyEight.
- The study is Observational, relying on data taken from a random
sample.
- https://github.com/fivethirtyeight/data/blob/master/non-voters/README.md)
Let’s take a look at the data.
# loading dataset
= "https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv"
url = read.csv(url)
df dim(df)
## [1] 5836 119
The dataset contains 5,836 observations across 119 columns (questions). For the purpose of this analysis, we will subset the data down to the desired data: Education, and Questions 5, 16, and 20.
These questions are as follows:
Q5: Does it matter who wins elections? 1 = Yes, 2 = No
Q16: In general, how easy or difficult is it to vote in elections? 1 = Very Easy, 4 = Very Difficult
Q20: Are you currently registered to vote? 1 = Yes, 2 = No
Subsetting the Data
= c("educ","Q5","Q16","Q20")
cols = df[cols]
df.sub kable(head(df.sub))
educ | Q5 | Q16 | Q20 |
---|---|---|---|
College | 1 | 1 | 1 |
College | 1 | 2 | 1 |
College | 1 | 1 | 1 |
Some college | 1 | 4 | 1 |
High school or less | 1 | 1 | 1 |
High school or less | 2 | -1 | 2 |
Summary Statistics
Let’s take a look at the summary statistics:
summary(df.sub)
## educ Q5 Q16 Q20
## Length:5836 Min. :-1.00 Min. :-1.00 Min. :-1.000
## Class :character 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 1.000
## Mode :character Median : 1.00 Median : 1.00 Median : 1.000
## Mean : 1.16 Mean : 1.62 Mean : 1.076
## 3rd Qu.: 1.00 3rd Qu.: 2.00 3rd Qu.: 1.000
## Max. : 2.00 Max. : 4.00 Max. : 2.000
sum(is.na(df.sub))
## [1] 0
Replacing Negative Values
The first thing that pops out are the minimum values of -1, since the minimum response value is 1. Let’s see how many of those there are.
|> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> count() df.sub
## n
## 1 82
There are 82 rows with -1 entries.
|> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> head() df.sub
## educ Q5 Q16 Q20
## 1 High school or less 2 -1 2
## 2 Some college 1 -1 2
## 3 Some college 1 -1 1
## 4 High school or less 1 2 -1
## 5 High school or less -1 -1 -1
## 6 High school or less 2 2 -1
We can see that these negative values do not otherwise affect other responses. This is likely a data entry error. To rectify this, I will impute a value of 1 for every -1 value.
<- df.sub |>
df.sub mutate(Q5 = as.numeric(abs(Q5)),
Q16 = as.numeric(abs(Q16)),
Q20 = as.numeric(abs(Q20)))
summary(df.sub)
## educ Q5 Q16 Q20
## Length:5836 Min. :1.00 Min. :1.00 Min. :1.000
## Class :character 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.000
## Mode :character Median :1.00 Median :1.00 Median :1.000
## Mean :1.17 Mean :1.64 Mean :1.083
## 3rd Qu.:1.00 3rd Qu.:2.00 3rd Qu.:1.000
## Max. :2.00 Max. :4.00 Max. :2.000
We can see this didn’t significantly affect the summary statistics.
Part 3 - Exploratory data analysis
Distribution of Education
Let’s take a look at the distribution of education.
ggplot(df.sub, aes(x=educ)) +
geom_bar() +
geom_text(aes(label = after_stat(count)),
stat="count", vjust=2,
color="white") +
labs(x = "education")
Education Frequency Table
kable(prop.table(table(df.sub$educ)), col.names=c("Level","Freq"))
Level | Freq |
---|---|
College | 0.3992461 |
High school or less | 0.3077450 |
Some college | 0.2930089 |
We can see from the survey responses, college graduates accounted for just under 40% of non-voters amongst respondents, while those hadn’t completed college accounted for just over 60% of respondents.
Average Answer by Education Level
Taking a look at average answer by education level,
|>
df.sub group_by(educ) |>
summarize(Q5 = mean(Q5),
Q16 = mean(Q16),
Q20 = mean(Q20))
## # A tibble: 3 × 4
## educ Q5 Q16 Q20
## <chr> <dbl> <dbl> <dbl>
## 1 College 1.13 1.63 1.02
## 2 High school or less 1.22 1.68 1.17
## 3 Some college 1.17 1.61 1.07
We can see that responders with an education level of “High School or less” has the highest response scores – with higher scores representing more negative attitudes towards voting. Somewhat surprisingly, respondents with “some college” tend to view the difficulty involved with voting (Q16) more favorably than the other groups.
Transforming the Data
As question 16 is rated on a scale of 1-4, 1 being “very easy”, 2 being “easy”, 3 = “difficult”, 4 = “very difficult”, we dichotomize the scores to 1 and 2 in order to visualize and model the relationship. We also dichotomize the education variable, with “yes” representing “has a college degree”, and “no” representing “does not have a college degree”.
<- df.sub |>
college mutate(Q16 = as.numeric(ifelse(Q16 < 3, 1, 2)),
Q5 = factor(Q5),
Q16 = factor(Q16),
Q20 = factor(Q20),
college = factor(ifelse(educ == "College", "yes","no"))) |>
select(5, 2:4)
kable(head(college))
college | Q5 | Q16 | Q20 |
---|---|---|---|
yes | 1 | 1 | 1 |
yes | 1 | 1 | 1 |
yes | 1 | 1 | 1 |
no | 1 | 2 | 1 |
no | 1 | 1 | 1 |
no | 2 | 1 | 2 |
Let’s take a look at the average answer by education now,
|>
college group_by(college) |>
summarize(Q5 = mean(as.numeric(Q5)),
Q16 = mean(as.numeric(Q16)),
Q20 = mean(as.numeric(Q20)))
## # A tibble: 2 × 4
## college Q5 Q16 Q20
## <fct> <dbl> <dbl> <dbl>
## 1 no 1.20 1.15 1.12
## 2 yes 1.13 1.15 1.02
We can see there’s almost no difference in the answers for Question 16.
Now, let’s take a look at the distribution of responses by education level.
Distribution of Ratings by Dichotomized Education
Let’s look at a scatter plot of the data,
|>
college pivot_longer(cols = c(2:4), names_to = "question", values_to = "rating") |>
mutate(question = factor(question, levels=c("Q5","Q16","Q20"))) |>
ggplot(aes(x=college, y=rating, color=question)) +
geom_jitter() +
ylim("1","2") +
labs(title = "Distribution of Ratings")
We can see when we try to visualize this data as a scatterplot, the binomial versus categorical variables do not provide much insight into a correlation and there’s certainly no linear relationship.
So instead we model the data as a barplot. Here we can see that the proportions of responses for college graduates versus non college graduates for questions 5 & 20 is more apparent than for question 16, which looks like a relatively comparable proportion.
|>
college pivot_longer(cols = c(2:4), names_to = "question", values_to = "rating") |>
ggplot(aes(x=rating, fill=college)) +
facet_grid(~factor(question, levels=c('Q5','Q16','Q20'))) +
geom_bar(position = "dodge") +
labs(title = "Distribution of Ratings") +
xlim("1","2") +
scale_fill_manual(values=c("red","blue"))
We can see by the binomial ratings that a linear regression model would not be appropriate for this data. A logistic model, however, will allow us to observe whether a relationship exists.
Part 4 - Inference
Examining the Variables
The dependent varibale for this analysis is education level, yes = has college degree, no = no college degree.
The independent variables are the binomial question scores.
Let’s look at a frequency table for college degrees:
kable(prop.table(table(college$college)), col.names=c("Level","Freq"))
Level | Freq |
---|---|
no | 0.6007539 |
yes | 0.3992461 |
Split is about 60/40.
Now let’s estimate the population proportion to see if this holds.
|>
college specify(response = college, success = "yes") |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "prop") |>
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.387 0.412
This tells us that 95% of the samples of the population of non-voters have mean college degree proportions between 38.7% and 41.1%
Let’s examine the relationship between college degree and response scores.
Multiple Logistic Regression Model
First, let’s see if there’s a relationship between the variables.
We build the logistic model using the glm() function and passing “binomial” into the family argument.
<- glm(college ~ ., data = college, family=binomial(link="logit"))
college_mod.full summary(college_mod.full)
##
## Call:
## glm(formula = college ~ ., family = binomial(link = "logit"),
## data = college)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0979 -1.0629 -0.9465 1.2964 2.1364
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.27547 0.03145 -8.758 < 2e-16 ***
## Q52 -0.29528 0.07844 -3.764 0.000167 ***
## Q162 0.08559 0.07688 1.113 0.265586
## Q202 -1.60368 0.14532 -11.036 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7851.8 on 5835 degrees of freedom
## Residual deviance: 7634.7 on 5832 degrees of freedom
## AIC: 7642.7
##
## Number of Fisher Scoring iterations: 4
We can see that question 16 (perceived ease of voting) does not have significance in this model, so using backwards stepwise regression, we remove the variable from the model and re-run.
<- glm(college ~ Q5 + Q20, data = college, family=binomial(link="logit"))
college_mod.v2 summary(college_mod.v2)
##
## Call:
## glm(formula = college ~ Q5 + Q20, family = binomial(link = "logit"),
## data = college)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0677 -1.0677 -0.9519 1.2912 2.1263
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26364 0.02958 -8.912 < 2e-16 ***
## Q52 -0.29310 0.07840 -3.738 0.000185 ***
## Q202 -1.59365 0.14500 -10.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7851.8 on 5835 degrees of freedom
## Residual deviance: 7635.9 on 5833 degrees of freedom
## AIC: 7641.9
##
## Number of Fisher Scoring iterations: 4
We see both questions maintain similar significance to the full model.
So we see there is significance, but can the model predict the probability of a college degree based on the response score?
Making Predictions
First we split the data into training and test sets:
set.seed(123)
<- sample.split(Y = college$college, SplitRatio=0.8)
indices <- college[indices,]
college.train <- college[!indices,] college.test
Next, we make predictions using the testing set. First we build a model using the training set.
<- glm(college ~ Q5 + Q20, data=college.train, family=binomial(link="logit")) training.mod
We make predictions for the testing set using the training model, passing “response” into the type argument, and estimate the probability the test model is able to predict a college degree. We know from our earlier frequency table that the proportion of college degrees is about 40%, hence the probability of observing a college degree is 40%, we set the threshold for predictions to 0.4.
<- predict(training.mod, newdata = college.test, type = "response")
preds <- table(college_degree = college.test$college, predicted_value = preds>0.4 )
predicted = as.matrix(predicted)
my_mat my_mat
## predicted_value
## college_degree FALSE TRUE
## no 181 520
## yes 65 401
Given accuracy = $TP + TN / TP + TN + FP + FN = $
<- my_mat[1] # true negative (no/false)
TN <- my_mat[2] # false positive (yes/false)
FP <- my_mat[3] # false negative (no/true)
FN <- my_mat[4] # true positive (yes/true)
TP
= (TP + TN)/(TP + TN + FP + FN)
accuracy accuracy
## [1] 0.4987147
We see the model only predicts with about 50% accuracy.
Using the ROC package, we can plot the trade off between sensitivity (True Positive) and specificity (True Negative):
= prediction(preds, college.test$college)
p <-performance(p, measure ="tpr", x.measure ="fpr")
prf plot(prf)
The nearly straight line confirms the model offers nearly no predictive
value.
Lastly, let’s take a look at the Area Under the Curve (AUC)
<- performance(p, measure = "auc")
auc <- auc@y.values[[1]]
auc
auc
## [1] 0.5655685
AUC measures separability. 0 = reciprocation, 1 = separability. 0.5 = no class separation capacity.
And we can see from the score here, this is precisely what we get. The model offers no separation capacity to distinguish between the categories.
Part 5 - Conclusion
The conclusion here is that although there is a relationship between education and some voter responses to specific survey questions, it’s clear that the model is not sufficient to observe or predict the strength of that relationship, so we fail to reject the null hypothesis. Some of the challenges involved in observing this relationship is the quality of the questions and the binomial and discrete response scores.
References
https://godatadrive.com/blog/basic-guide-to-test-assumptions-of-linear-regression-in-r https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R https://online.stat.psu.edu/stat462/node/116/ https://medium.com/pew-research-center-decoded/a-short-intro-to-linear-regression-analysis-using-survey-data-ff39468f8afb https://www.displayr.com/what-is-a-roc-curve-how-to-interpret-it/ https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5