DATA 606 Final Project

Josh Iden

ABSTRACT

THE DATA

5,836 Observations
Observational Study
Random Sampling of Non-Voters

Questions

Does It Matter Who Wins Elections? (1=Yes, 2=No)
How Easy Is It To Vote? (1=Easy, 4=Difficult)
Are You Currently Registerd To Vote? (1=Yes, 2=No)

Loading The Data

# loading dataset
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv"
df = read.csv(url)
dim(df)

## [1] 5836  119

Subsetting The Data

cols = c("educ","Q5","Q16","Q20")
df.sub = df[cols]
kable(head(df.sub))

educ	Q5	Q16	Q20
College	1	1	1
College	1	2	1
College	1	1	1
Some college	1	4	1
High school or less	1	1	1
High school or less	2	-1	2

Summary Statistics

summary(df.sub)

##      educ                 Q5             Q16             Q20        
##  Length:5836        Min.   :-1.00   Min.   :-1.00   Min.   :-1.000  
##  Class :character   1st Qu.: 1.00   1st Qu.: 1.00   1st Qu.: 1.000  
##  Mode  :character   Median : 1.00   Median : 1.00   Median : 1.000  
##                     Mean   : 1.16   Mean   : 1.62   Mean   : 1.076  
##                     3rd Qu.: 1.00   3rd Qu.: 2.00   3rd Qu.: 1.000  
##                     Max.   : 2.00   Max.   : 4.00   Max.   : 2.000

Missing Values

sum(is.na(df.sub))

## [1] 0

Replacing Negative Values

df.sub |> filter(Q5 == -1 | Q16 == -1 | Q20 == -1) |> count()

##    n
## 1 82

Replacing Negative Values

df.sub <- df.sub |>
  mutate(Q5 = as.numeric(abs(Q5)),
            Q16 = as.numeric(abs(Q16)),
            Q20 = as.numeric(abs(Q20)))

summary(df.sub)

##      educ                 Q5            Q16            Q20       
##  Length:5836        Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  Class :character   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:1.000  
##  Mode  :character   Median :1.00   Median :1.00   Median :1.000  
##                     Mean   :1.17   Mean   :1.64   Mean   :1.083  
##                     3rd Qu.:1.00   3rd Qu.:2.00   3rd Qu.:1.000  
##                     Max.   :2.00   Max.   :4.00   Max.   :2.000

EXPLORATORY DATA ANALYSIS

Distribution of Education

Level	Freq
College	0.3992461
High school or less	0.3077450
Some college	0.2930089

Transforming the Data

college <- df.sub |>
  mutate(Q16 = as.numeric(ifelse(Q16 < 3, 1, 2)),
         Q5 = factor(Q5),
         Q16 = factor(Q16),
         Q20 = factor(Q20),
         college = factor(ifelse(educ == "College", "yes","no"))) |>
  select(5, 2:4)
kable(head(college))

college	Q5	Q16	Q20
yes	1	1	1
yes	1	1	1
yes	1	1	1
no	1	2	1
no	1	1	1
no	2	1	2

Distribution of Responses

INFERENCE

Examining The Variables

Dependent = eduction
Independent = ratings

Frequency Table

Level	Freq
no	0.6007539
yes	0.3992461

Estimating the Population Proportion

college |>
  specify(response = college, success = "yes") |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = 0.95)

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.386    0.412

Multiple Logistic Regression Model

college_mod.full <- glm(college ~ ., data = college, family=binomial(link="logit"))
summary(college_mod.full)

## 
## Call:
## glm(formula = college ~ ., family = binomial(link = "logit"), 
##     data = college)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0979  -1.0629  -0.9465   1.2964   2.1364  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.27547    0.03145  -8.758  < 2e-16 ***
## Q52         -0.29528    0.07844  -3.764 0.000167 ***
## Q162         0.08559    0.07688   1.113 0.265586    
## Q202        -1.60368    0.14532 -11.036  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7851.8  on 5835  degrees of freedom
## Residual deviance: 7634.7  on 5832  degrees of freedom
## AIC: 7642.7
## 
## Number of Fisher Scoring iterations: 4

Backwards Stepwise Regression

college_mod.v2 <- glm(college ~ Q5 + Q20, data = college, family=binomial(link="logit"))
summary(college_mod.v2)

## 
## Call:
## glm(formula = college ~ Q5 + Q20, family = binomial(link = "logit"), 
##     data = college)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0677  -1.0677  -0.9519   1.2912   2.1263  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.26364    0.02958  -8.912  < 2e-16 ***
## Q52         -0.29310    0.07840  -3.738 0.000185 ***
## Q202        -1.59365    0.14500 -10.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7851.8  on 5835  degrees of freedom
## Residual deviance: 7635.9  on 5833  degrees of freedom
## AIC: 7641.9
## 
## Number of Fisher Scoring iterations: 4

Making Predictions

Splitting the Data

set.seed(123)
indices <- sample.split(Y = college$college, SplitRatio=0.8)
college.train <- college[indices,]
college.test<- college[!indices,]

Training the Model

training.mod <- glm(college ~ Q5 + Q20, data=college.train, family=binomial(link="logit"))

Making Predictions

preds <- predict(training.mod, newdata = college.test, type = "response")
predicted <- table(college_degree = college.test$college, predicted_value = preds>0.4 )
predicted

##               predicted_value
## college_degree FALSE TRUE
##            no    181  520
##            yes    65  401

Measuring Accuracy

Accuracy = TP + TN / TP + TN + FP + FN

my_mat <- as.matrix(predicted)
TN <- my_mat[1] # true negative (no/false)
FP <- my_mat[2] # false positive (yes/false)
FN <- my_mat[3] # false negative (no/true)
TP <- my_mat[4] # true positive (yes/true)

accuracy = (TP + TN)/(TP + TN + FP + FN)
accuracy

## [1] 0.4987147

Plotting the ROC

Receiver Operator Characteristic (ROC)
Plots True Positive Rate against False Positive Rate
Curves closer to top corner indicate better performance

p <- prediction(preds, college.test$college)
prf <-performance(p, measure ="tpr", x.measure ="fpr")
plot(prf)

Measuring the Area Under the Curve (AUC)

Measures separability
Models capability to distinguish between classes (Yes/No)
0 to 1 scale

Measuring the Area Under the Curve (AUC)

auc <- performance(p, measure = "auc")
auc <- auc@y.values[[1]]

auc

## [1] 0.5655685

DATA 606 Final Project

Josh Iden

ABSTRACT

THE DATA

Questions

Loading The Data

Subsetting The Data

Summary Statistics

Missing Values

Replacing Negative Values

Replacing Negative Values

EXPLORATORY DATA ANALYSIS

Distribution of Education

Distribution of Education

Transforming the Data

Distribution of Responses

Distribution of Responses

INFERENCE

Examining The Variables

Frequency Table

Estimating the Population Proportion

Multiple Logistic Regression Model

Backwards Stepwise Regression

Making Predictions

Splitting the Data

Training the Model

Making Predictions

Measuring Accuracy

Plotting the ROC

Measuring the Area Under the Curve (AUC)

Measuring the Area Under the Curve (AUC)

CONCLUSION

THANK YOU