head(data)
## # A tibble: 6 x 28
## YEAR STATEFIP METRO AGE SEX RACE MARST VETSTAT CITIZEN HISPAN LABFORCE
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2016 1 2 70 1 100 1 2 1 0 2
## 2 2016 1 2 63 2 100 1 1 1 0 1
## 3 2016 1 2 59 1 200 4 1 1 0 2
## 4 2016 1 2 79 2 200 6 1 1 0 1
## 5 2016 1 2 57 2 200 6 1 1 0 1
## 6 2016 1 2 80 1 100 1 2 4 0 1
## # … with 17 more variables: EDUC99 <chr>, EDCYC <chr>, EDDIPGED <chr>,
## # EDHGCGED <chr>, SCHLCOLL <chr>, VOWHYNOT <chr>, VOYNOTREG <chr>,
## # VOTEHOW <chr>, VOTEWHEN <chr>, VOREGHOW <chr>, VOTED <chr>, VOREG <chr>,
## # VOSUPPWT <dbl>, RACESIMPLE <chr>, HISPSIMPLE <chr>, EDUSIMPLE <chr>,
## # RACEHISP <dbl>
Categorical response variable: VOTED [1 = did not vote; 2 = voted]
Categorical predictor: SEX [1 = male; 2 = female]
Numeric predictor: AGE [in years]
modAge<- glm(as.factor(VOTED) ~ as.numeric(AGE), data=data, family="binomial")
summary(modAge)
##
## Call:
## glm(formula = as.factor(VOTED) ~ as.numeric(AGE), family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0019 -1.3424 0.6834 0.8249 1.0391
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0752300 0.0228422 -3.293 0.00099 ***
## as.numeric(AGE) 0.0227560 0.0004632 49.124 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 92581 on 79819 degrees of freedom
## Residual deviance: 90061 on 79818 degrees of freedom
## AIC: 90065
##
## Number of Fisher Scoring iterations: 4
#note: we use as.factor to adjust VOTED and as.numeric to adjust AGE to avoid errors
graphAge<-ggplot(data=data)+
geom_bar(aes(x=AGE, fill=VOTED), position="fill")+
theme_classic()+
ggtitle("Votes by Age")+
xlab("Age (years)")+
labs(fill="Voted (1= no, 2 = yes)")+
scale_fill_manual(values=c("indianred2", "skyblue2"))
graphAge
This relationship does appear to be significant; the p-value is very small (less than .05) and the graphic shows a clear upwards trend: as age increases, votes increase with all other variables held constant.
Because our binary categorical predictor is already represented with numeric values, we did not find it necessary to write dummy code.
modSex <- glm(as.factor(VOTED) ~ SEX, data=data, family="binomial")
summary(modSex)
##
## Call:
## glm(formula = as.factor(VOTED) ~ SEX, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6661 -1.5825 0.7578 0.8206 0.8206
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.91549 0.01141 80.21 <2e-16 ***
## SEX2 0.18527 0.01602 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 92581 on 79819 degrees of freedom
## Residual deviance: 92447 on 79818 degrees of freedom
## AIC: 92451
##
## Number of Fisher Scoring iterations: 4
graphSex<-ggplot(data=data)+
geom_bar(aes(x=SEX, fill=VOTED), position="fill")+
theme_classic()+
ggtitle("Votes by Sex")+
xlab("Sex (1 = male, 2 = female)")+
labs(fill="Voted (1= no, 2 = yes)")+
scale_fill_manual(values=c("indianred2", "skyblue2"))
graphSex
Sex, as a categorical variable, doesn’t provide us with much interesting data in terms of voter turnout as the two levels are quantitatively similar, however, the relationship is still statistically significant with a p value of 2e^-16. Just under 75% of both males and females in this dataset voted in 2016. When looking at the graphic, the mean is slightly higher for females than males.
modAS<- glm(as.factor(VOTED) ~ as.numeric(AGE)+SEX, data=data, family="binomial")
summary(modAS)
##
## Call:
## glm(formula = as.factor(VOTED) ~ as.numeric(AGE) + SEX, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0352 -1.3253 0.6868 0.8234 1.0731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1574058 0.0242170 -6.50 8.04e-11 ***
## as.numeric(AGE) 0.0226607 0.0004638 48.86 < 2e-16 ***
## SEX2 0.1674585 0.0162791 10.29 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 92581 on 79819 degrees of freedom
## Residual deviance: 89955 on 79817 degrees of freedom
## AIC: 89961
##
## Number of Fisher Scoring iterations: 4
Logistic regression equation for SEX = female: \[p = \frac{1}{1+e^{-(-0.1574 + 0.1675 + (0.0227)x)}}=\frac{1}{1+e^{-(.0101 + (0.0227)x)}}\] Logistic regression equation for SEX = male: \[p = \frac{1}{1+e^{-(-0.1574 + (0.0227)x)}}\]
modASint<- glm(as.factor(VOTED) ~ as.numeric(AGE)*SEX, data=data, family="binomial")
summary(modASint)
##
## Call:
## glm(formula = as.factor(VOTED) ~ as.numeric(AGE) * SEX, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0444 -1.2759 0.6851 0.8166 1.1274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3753079 0.0328591 -11.422 <2e-16 ***
## as.numeric(AGE) 0.0274462 0.0006790 40.424 <2e-16 ***
## SEX2 0.5880463 0.0458578 12.823 <2e-16 ***
## as.numeric(AGE):SEX2 -0.0091299 0.0009305 -9.812 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 92581 on 79819 degrees of freedom
## Residual deviance: 89858 on 79816 degrees of freedom
## AIC: 89866
##
## Number of Fisher Scoring iterations: 4
Logistic regression equation for SEX = female: \[p = \frac{1}{1+e^{-((0.0274462 -0.00913)x + (-0.37531 + 0.5880))}}\]
Logistic regression equation for SEX = male: \[p = \frac{1}{1+e^{-((0.0274462)x -0.37531)}}\]
Compare models by calculating error rates from confusion matrices:
#note: I used the ISLR textbook (p 157-158) to do this because I ran into errors when trying the example we did in class
probAge<-predict(modAge, newdata = data, type = "response")
head(probAge)
## 1 2 3 4 5 6
## 0.8201978 0.7954989 0.7802926 0.8484495 0.7723908 0.8513524
contrasts(as.factor(data$VOTED))
## 2
## 1 0
## 2 1
predAge<-rep("1",79820)
head(predAge)
## [1] "1" "1" "1" "1" "1" "1"
predAge[probAge>.5]<-"2"
head(predAge)
## [1] "2" "2" "2" "2" "2" "2"
table(predAge,data$VOTED) #confusion matrix for modAge
##
## predAge 1 2
## 2 21287 58533
# this is me trying what we did with email spam example in class:
cmAge<-data.frame(Voted=data$VOTED, predVoted=predAge>.5)%>%
group_by(Voted, predVoted)%>%
summarise(n=n())
## `summarise()` regrouping output by 'Voted' (override with `.groups` argument)
cmAge #confusion matrix for modAge
## # A tibble: 2 x 3
## # Groups: Voted [2]
## Voted predVoted n
## <fct> <lgl> <int>
## 1 1 TRUE 21287
## 2 2 TRUE 58533
##both methods are giving me the same value in which it is always predicting that people vote regardless of age, resulting in only one row/column being present in the predicted side
#book method
probSex<-predict(modSex, newdata = data, type = "response")
head(probSex)
## 1 2 3 4 5 6
## 0.7141223 0.7504027 0.7141223 0.7504027 0.7504027 0.7141223
contrasts(as.factor(data$VOTED))
## 2
## 1 0
## 2 1
predSex<-rep("1",79820)
head(predSex)
## [1] "1" "1" "1" "1" "1" "1"
predSex[probSex>.5]="2"
head(predSex)
## [1] "2" "2" "2" "2" "2" "2"
table(predSex,data$VOTED)
##
## predSex 1 2
## 2 21287 58533
#class example method
predSex<-predict(modSex, newdata = data, type = "response")
cmSex<-data.frame(Voted=data$VOTED, predVoted=predSex>.5)%>%
group_by(Voted, predVoted)%>%
summarise(n=n())
## `summarise()` regrouping output by 'Voted' (override with `.groups` argument)
cmSex #confusion matrix for modSex
## # A tibble: 2 x 3
## # Groups: Voted [2]
## Voted predVoted n
## <fct> <lgl> <int>
## 1 1 TRUE 21287
## 2 2 TRUE 58533
## same issues as with modAge and getting same values as for previous model.
[I will do this for the rest of the models once I figure out what is going wrong]
Sex as a function of Voted tells us that females and males vote in comparable numbers. Although the bar graph shows us that females are very slightly more likely to vote. Even though both of the predictors are statistically significant, age provides us with a story: as age increases, votes increase. The glm with predictors and their interaction has a lower AIC than glm with just predictors, making that model a better fit for the data. ** [NOTE: This is what we have so far for the conclusion but we will update this section once we have figured out our confusion matrices.]