DATA 606 Final Project
Josh Iden
THE DATA
- 5,836 Observations
- Observational Study
- Random Sampling of Non-Voters
Questions
- Does It Matter Who Wins Elections? (1=Yes, 2=No)
- How Easy Is It To Vote? (1=Easy, 4=Difficult)
- Are You Currently Registerd To Vote? (1=Yes, 2=No)
Loading The Data
# loading dataset
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv"
df = read.csv(url)
dim(df)
## [1] 5836 119
Subsetting The Data
cols = c("educ","Q5","Q16","Q20")
df.sub = df[cols]
kable(head(df.sub))
College |
1 |
1 |
1 |
College |
1 |
2 |
1 |
College |
1 |
1 |
1 |
Some college |
1 |
4 |
1 |
High school or less |
1 |
1 |
1 |
High school or less |
2 |
-1 |
2 |
Summary Statistics
## educ Q5 Q16 Q20
## Length:5836 Min. :-1.00 Min. :-1.00 Min. :-1.000
## Class :character 1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 1.000
## Mode :character Median : 1.00 Median : 1.00 Median : 1.000
## Mean : 1.16 Mean : 1.62 Mean : 1.076
## 3rd Qu.: 1.00 3rd Qu.: 2.00 3rd Qu.: 1.000
## Max. : 2.00 Max. : 4.00 Max. : 2.000
Replacing Negative Values
df.sub |> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> count()
## n
## 1 82
Replacing Negative Values
df.sub <- df.sub |>
mutate(Q5 = as.numeric(abs(Q5)),
Q16 = as.numeric(abs(Q16)),
Q20 = as.numeric(abs(Q20)))
summary(df.sub)
## educ Q5 Q16 Q20
## Length:5836 Min. :1.00 Min. :1.00 Min. :1.000
## Class :character 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:1.000
## Mode :character Median :1.00 Median :1.00 Median :1.000
## Mean :1.17 Mean :1.64 Mean :1.083
## 3rd Qu.:1.00 3rd Qu.:2.00 3rd Qu.:1.000
## Max. :2.00 Max. :4.00 Max. :2.000
EXPLORATORY DATA ANALYSIS
Distribution of Education
![]()
Distribution of Education
College |
0.3992461 |
High school or less |
0.3077450 |
Some college |
0.2930089 |
college <- df.sub |>
mutate(Q16 = as.numeric(ifelse(Q16 < 3, 1, 2)),
Q5 = factor(Q5),
Q16 = factor(Q16),
Q20 = factor(Q20),
college = factor(ifelse(educ == "College", "yes","no"))) |>
select(5, 2:4)
kable(head(college))
yes |
1 |
1 |
1 |
yes |
1 |
1 |
1 |
yes |
1 |
1 |
1 |
no |
1 |
2 |
1 |
no |
1 |
1 |
1 |
no |
2 |
1 |
2 |
Distribution of Responses
![]()
Distribution of Responses
![]()
Examining The Variables
- Dependent = eduction
- Independent = ratings
Frequency Table
no |
0.6007539 |
yes |
0.3992461 |
Estimating the Population
Proportion
college |>
specify(response = college, success = "yes") |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "prop") |>
get_ci(level = 0.95)
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.386 0.412
Multiple Logistic Regression
Model
college_mod.full <- glm(college ~ ., data = college, family=binomial(link="logit"))
summary(college_mod.full)
##
## Call:
## glm(formula = college ~ ., family = binomial(link = "logit"),
## data = college)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0979 -1.0629 -0.9465 1.2964 2.1364
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.27547 0.03145 -8.758 < 2e-16 ***
## Q52 -0.29528 0.07844 -3.764 0.000167 ***
## Q162 0.08559 0.07688 1.113 0.265586
## Q202 -1.60368 0.14532 -11.036 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7851.8 on 5835 degrees of freedom
## Residual deviance: 7634.7 on 5832 degrees of freedom
## AIC: 7642.7
##
## Number of Fisher Scoring iterations: 4
Backwards Stepwise
Regression
college_mod.v2 <- glm(college ~ Q5 + Q20, data = college, family=binomial(link="logit"))
summary(college_mod.v2)
##
## Call:
## glm(formula = college ~ Q5 + Q20, family = binomial(link = "logit"),
## data = college)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0677 -1.0677 -0.9519 1.2912 2.1263
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.26364 0.02958 -8.912 < 2e-16 ***
## Q52 -0.29310 0.07840 -3.738 0.000185 ***
## Q202 -1.59365 0.14500 -10.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7851.8 on 5835 degrees of freedom
## Residual deviance: 7635.9 on 5833 degrees of freedom
## AIC: 7641.9
##
## Number of Fisher Scoring iterations: 4
Splitting the Data
set.seed(123)
indices <- sample.split(Y = college$college, SplitRatio=0.8)
college.train <- college[indices,]
college.test<- college[!indices,]
Training the Model
training.mod <- glm(college ~ Q5 + Q20, data=college.train, family=binomial(link="logit"))
Making Predictions
preds <- predict(training.mod, newdata = college.test, type = "response")
predicted <- table(college_degree = college.test$college, predicted_value = preds>0.4 )
predicted
## predicted_value
## college_degree FALSE TRUE
## no 181 520
## yes 65 401
Measuring Accuracy
- Accuracy = TP + TN / TP + TN + FP + FN
my_mat <- as.matrix(predicted)
TN <- my_mat[1] # true negative (no/false)
FP <- my_mat[2] # false positive (yes/false)
FN <- my_mat[3] # false negative (no/true)
TP <- my_mat[4] # true positive (yes/true)
accuracy = (TP + TN)/(TP + TN + FP + FN)
accuracy
## [1] 0.4987147
Plotting the ROC
- Receiver Operator Characteristic (ROC)
- Plots True Positive Rate against False Positive Rate
- Curves closer to top corner indicate better performance
p <- prediction(preds, college.test$college)
prf <-performance(p, measure ="tpr", x.measure ="fpr")
plot(prf)
![]()
Measuring the Area Under
the Curve (AUC)
- Measures separability
- Models capability to distinguish between classes (Yes/No)
- 0 to 1 scale
Measuring the Area
Under the Curve (AUC)
auc <- performance(p, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.5655685