DATA 606 Final Project

Josh Iden

ABSTRACT

THE DATA

  • 5,836 Observations
  • Observational Study
  • Random Sampling of Non-Voters

Questions

  • Does It Matter Who Wins Elections? (1=Yes, 2=No)
  • How Easy Is It To Vote? (1=Easy, 4=Difficult)
  • Are You Currently Registerd To Vote? (1=Yes, 2=No)

Loading The Data

# loading dataset
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv"
df = read.csv(url)
dim(df)
## [1] 5836  119

Subsetting The Data

cols = c("educ","Q5","Q16","Q20")
df.sub = df[cols]
kable(head(df.sub))
educ Q5 Q16 Q20
College 1 1 1
College 1 2 1
College 1 1 1
Some college 1 4 1
High school or less 1 1 1
High school or less 2 -1 2

Summary Statistics

summary(df.sub)
##      educ                 Q5             Q16             Q20        
##  Length:5836        Min.   :-1.00   Min.   :-1.00   Min.   :-1.000  
##  Class :character   1st Qu.: 1.00   1st Qu.: 1.00   1st Qu.: 1.000  
##  Mode  :character   Median : 1.00   Median : 1.00   Median : 1.000  
##                     Mean   : 1.16   Mean   : 1.62   Mean   : 1.076  
##                     3rd Qu.: 1.00   3rd Qu.: 2.00   3rd Qu.: 1.000  
##                     Max.   : 2.00   Max.   : 4.00   Max.   : 2.000

Missing Values

sum(is.na(df.sub))
## [1] 0

Replacing Negative Values

df.sub |> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> count()
##    n
## 1 82

Replacing Negative Values

df.sub <- df.sub |>
  mutate(Q5 = as.numeric(abs(Q5)),
            Q16 = as.numeric(abs(Q16)),
            Q20 = as.numeric(abs(Q20)))

summary(df.sub)
##      educ                 Q5            Q16            Q20       
##  Length:5836        Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  Class :character   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:1.000  
##  Mode  :character   Median :1.00   Median :1.00   Median :1.000  
##                     Mean   :1.17   Mean   :1.64   Mean   :1.083  
##                     3rd Qu.:1.00   3rd Qu.:2.00   3rd Qu.:1.000  
##                     Max.   :2.00   Max.   :4.00   Max.   :2.000

EXPLORATORY DATA ANALYSIS

Distribution of Education

Distribution of Education

Level Freq
College 0.3992461
High school or less 0.3077450
Some college 0.2930089

Transforming the Data

college <- df.sub |>
  mutate(Q16 = as.numeric(ifelse(Q16 < 3, 1, 2)),
         Q5 = factor(Q5),
         Q16 = factor(Q16),
         Q20 = factor(Q20),
         college = factor(ifelse(educ == "College", "yes","no"))) |>
  select(5, 2:4)
kable(head(college))
college Q5 Q16 Q20
yes 1 1 1
yes 1 1 1
yes 1 1 1
no 1 2 1
no 1 1 1
no 2 1 2

Distribution of Responses

Distribution of Responses

INFERENCE

Examining The Variables

  • Dependent = eduction
  • Independent = ratings

Frequency Table

Level Freq
no 0.6007539
yes 0.3992461

Estimating the Population Proportion

college |>
  specify(response = college, success = "yes") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = 0.95)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.386    0.412

Multiple Logistic Regression Model

college_mod.full <- glm(college ~ ., data = college, family=binomial(link="logit"))
summary(college_mod.full)
## 
## Call:
## glm(formula = college ~ ., family = binomial(link = "logit"), 
##     data = college)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0979  -1.0629  -0.9465   1.2964   2.1364  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.27547    0.03145  -8.758  < 2e-16 ***
## Q52         -0.29528    0.07844  -3.764 0.000167 ***
## Q162         0.08559    0.07688   1.113 0.265586    
## Q202        -1.60368    0.14532 -11.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7851.8  on 5835  degrees of freedom
## Residual deviance: 7634.7  on 5832  degrees of freedom
## AIC: 7642.7
## 
## Number of Fisher Scoring iterations: 4

Backwards Stepwise Regression

college_mod.v2 <- glm(college ~ Q5 + Q20, data = college, family=binomial(link="logit"))
summary(college_mod.v2)
## 
## Call:
## glm(formula = college ~ Q5 + Q20, family = binomial(link = "logit"), 
##     data = college)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0677  -1.0677  -0.9519   1.2912   2.1263  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.26364    0.02958  -8.912  < 2e-16 ***
## Q52         -0.29310    0.07840  -3.738 0.000185 ***
## Q202        -1.59365    0.14500 -10.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7851.8  on 5835  degrees of freedom
## Residual deviance: 7635.9  on 5833  degrees of freedom
## AIC: 7641.9
## 
## Number of Fisher Scoring iterations: 4

Making Predictions

Splitting the Data

set.seed(123)
indices <- sample.split(Y = college$college, SplitRatio=0.8)
college.train <- college[indices,]
college.test<- college[!indices,]

Training the Model

training.mod <- glm(college ~ Q5 + Q20, data=college.train, family=binomial(link="logit"))

Making Predictions

preds <- predict(training.mod, newdata = college.test, type = "response")
predicted <- table(college_degree = college.test$college, predicted_value = preds>0.4 )
predicted
##               predicted_value
## college_degree FALSE TRUE
##            no    181  520
##            yes    65  401

Measuring Accuracy

  • Accuracy = TP + TN / TP + TN + FP + FN
my_mat <- as.matrix(predicted)
TN <- my_mat[1] # true negative (no/false)
FP <- my_mat[2] # false positive (yes/false)
FN <- my_mat[3] # false negative (no/true)
TP <- my_mat[4] # true positive (yes/true)

accuracy = (TP + TN)/(TP + TN + FP + FN)
accuracy
## [1] 0.4987147

Plotting the ROC

  • Receiver Operator Characteristic (ROC)
  • Plots True Positive Rate against False Positive Rate
  • Curves closer to top corner indicate better performance
p <- prediction(preds, college.test$college)
prf <-performance(p, measure ="tpr", x.measure ="fpr")
plot(prf)

Measuring the Area Under the Curve (AUC)

  • Measures separability
  • Models capability to distinguish between classes (Yes/No)
  • 0 to 1 scale

Measuring the Area Under the Curve (AUC)

auc <- performance(p, measure = "auc")
auc <- auc@y.values[[1]]

auc
## [1] 0.5655685

CONCLUSION

THANK YOU