Stats HW 9

Data

head(data)

## # A tibble: 6 x 28
##    YEAR STATEFIP METRO AGE   SEX   RACE  MARST VETSTAT CITIZEN HISPAN LABFORCE
##   <dbl> <chr>    <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>  <chr>   
## 1  2016 1        2     70    1     100   1     2       1       0      2       
## 2  2016 1        2     63    2     100   1     1       1       0      1       
## 3  2016 1        2     59    1     200   4     1       1       0      2       
## 4  2016 1        2     79    2     200   6     1       1       0      1       
## 5  2016 1        2     57    2     200   6     1       1       0      1       
## 6  2016 1        2     80    1     100   1     2       4       0      1       
## # … with 17 more variables: EDUC99 <chr>, EDCYC <chr>, EDDIPGED <chr>,
## #   EDHGCGED <chr>, SCHLCOLL <chr>, VOWHYNOT <chr>, VOYNOTREG <chr>,
## #   VOTEHOW <chr>, VOTEWHEN <chr>, VOREGHOW <chr>, VOTED <chr>, VOREG <chr>,
## #   VOSUPPWT <dbl>, RACESIMPLE <chr>, HISPSIMPLE <chr>, EDUSIMPLE <chr>,
## #   RACEHISP <dbl>

(A) Our variables

Categorical response variable: VOTED [1 = did not vote; 2 = voted]

Categorical predictor: SEX [1 = male; 2 = female]

Numeric predictor: AGE [in years]

(B) GLM with numeric predictor

modAge<- glm(as.factor(VOTED) ~ as.numeric(AGE), data=data, family="binomial")
summary(modAge)

## 
## Call:
## glm(formula = as.factor(VOTED) ~ as.numeric(AGE), family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0019  -1.3424   0.6834   0.8249   1.0391  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.0752300  0.0228422  -3.293  0.00099 ***
## as.numeric(AGE)  0.0227560  0.0004632  49.124  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92581  on 79819  degrees of freedom
## Residual deviance: 90061  on 79818  degrees of freedom
## AIC: 90065
## 
## Number of Fisher Scoring iterations: 4

#note: we use as.factor to adjust VOTED and as.numeric to adjust AGE to avoid errors

graphAge<-ggplot(data=data)+
  geom_bar(aes(x=AGE,  fill=VOTED), position="fill")+
  theme_classic()+
  ggtitle("Votes by Age")+
  xlab("Age (years)")+
  labs(fill="Voted (1= no, 2 = yes)")+
  scale_fill_manual(values=c("indianred2", "skyblue2"))
graphAge

This relationship does appear to be significant; the p-value is very small (less than .05) and the graphic shows a clear upwards trend: as age increases, votes increase with all other variables held constant.

(C)

Because our binary categorical predictor is already represented with numeric values, we did not find it necessary to write dummy code.

(D) GLM with categorical predictor

modSex <- glm(as.factor(VOTED) ~ SEX, data=data, family="binomial")
summary(modSex)

## 
## Call:
## glm(formula = as.factor(VOTED) ~ SEX, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6661  -1.5825   0.7578   0.8206   0.8206  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.91549    0.01141   80.21   <2e-16 ***
## SEX2         0.18527    0.01602   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92581  on 79819  degrees of freedom
## Residual deviance: 92447  on 79818  degrees of freedom
## AIC: 92451
## 
## Number of Fisher Scoring iterations: 4

graphSex<-ggplot(data=data)+
  geom_bar(aes(x=SEX,  fill=VOTED), position="fill")+
  theme_classic()+
  ggtitle("Votes by Sex")+
  xlab("Sex (1 = male, 2 = female)")+
  labs(fill="Voted (1= no, 2 = yes)")+
  scale_fill_manual(values=c("indianred2", "skyblue2"))
graphSex

Sex, as a categorical variable, doesn’t provide us with much interesting data in terms of voter turnout as the two levels are quantitatively similar, however, the relationship is still statistically significant with a p value of 2e^-16. Just under 75% of both males and females in this dataset voted in 2016. When looking at the graphic, the mean is slightly higher for females than males.

(E) GLM with numeric and categorical predictors

modAS<- glm(as.factor(VOTED) ~ as.numeric(AGE)+SEX, data=data, family="binomial")
summary(modAS)

## 
## Call:
## glm(formula = as.factor(VOTED) ~ as.numeric(AGE) + SEX, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0352  -1.3253   0.6868   0.8234   1.0731  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.1574058  0.0242170   -6.50 8.04e-11 ***
## as.numeric(AGE)  0.0226607  0.0004638   48.86  < 2e-16 ***
## SEX2             0.1674585  0.0162791   10.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92581  on 79819  degrees of freedom
## Residual deviance: 89955  on 79817  degrees of freedom
## AIC: 89961
## 
## Number of Fisher Scoring iterations: 4

Logistic regression equation for SEX = female: \[p = \frac{1}{1+e^{-(-0.1574 + 0.1675 + (0.0227)x)}}=\frac{1}{1+e^{-(.0101 + (0.0227)x)}}\] Logistic regression equation for SEX = male: \[p = \frac{1}{1+e^{-(-0.1574 + (0.0227)x)}}\]

(F) GLM with predictors and interaction term

modASint<- glm(as.factor(VOTED) ~ as.numeric(AGE)*SEX, data=data, family="binomial")
summary(modASint)

## 
## Call:
## glm(formula = as.factor(VOTED) ~ as.numeric(AGE) * SEX, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0444  -1.2759   0.6851   0.8166   1.1274  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.3753079  0.0328591 -11.422   <2e-16 ***
## as.numeric(AGE)       0.0274462  0.0006790  40.424   <2e-16 ***
## SEX2                  0.5880463  0.0458578  12.823   <2e-16 ***
## as.numeric(AGE):SEX2 -0.0091299  0.0009305  -9.812   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92581  on 79819  degrees of freedom
## Residual deviance: 89858  on 79816  degrees of freedom
## AIC: 89866
## 
## Number of Fisher Scoring iterations: 4

Logistic regression equation for SEX = female: \[p = \frac{1}{1+e^{-((0.0274462 -0.00913)x + (-0.37531 + 0.5880))}}\]

Logistic regression equation for SEX = male: \[p = \frac{1}{1+e^{-((0.0274462)x -0.37531)}}\]

(G) note: this part isn’t working, I’ll update it after I ask Prof Smalley about it

Compare models by calculating error rates from confusion matrices:

#note: I used the ISLR textbook (p 157-158) to do this because I ran into errors when trying the example we did in class

probAge<-predict(modAge, newdata = data, type = "response")
head(probAge)

##         1         2         3         4         5         6 
## 0.8201978 0.7954989 0.7802926 0.8484495 0.7723908 0.8513524

contrasts(as.factor(data$VOTED))

##   2
## 1 0
## 2 1

predAge<-rep("1",79820)
head(predAge)

## [1] "1" "1" "1" "1" "1" "1"

predAge[probAge>.5]<-"2"
head(predAge)

## [1] "2" "2" "2" "2" "2" "2"

table(predAge,data$VOTED) #confusion matrix for modAge

##        
## predAge     1     2
##       2 21287 58533

# this is me trying what we did with email spam example in class:
cmAge<-data.frame(Voted=data$VOTED, predVoted=predAge>.5)%>%
  group_by(Voted, predVoted)%>%
  summarise(n=n())

## `summarise()` regrouping output by 'Voted' (override with `.groups` argument)

cmAge #confusion matrix for modAge

## # A tibble: 2 x 3
## # Groups:   Voted [2]
##   Voted predVoted     n
##   <fct> <lgl>     <int>
## 1 1     TRUE      21287
## 2 2     TRUE      58533

##both methods are giving me the same value in which it is always predicting that people vote regardless of age, resulting in only one row/column being present in the predicted side

#book method
probSex<-predict(modSex, newdata = data, type = "response")
head(probSex)

##         1         2         3         4         5         6 
## 0.7141223 0.7504027 0.7141223 0.7504027 0.7504027 0.7141223

contrasts(as.factor(data$VOTED))

##   2
## 1 0
## 2 1

predSex<-rep("1",79820)
head(predSex)

## [1] "1" "1" "1" "1" "1" "1"

predSex[probSex>.5]="2"
head(predSex)

## [1] "2" "2" "2" "2" "2" "2"

table(predSex,data$VOTED)

##        
## predSex     1     2
##       2 21287 58533

#class example method
predSex<-predict(modSex, newdata = data, type = "response")
cmSex<-data.frame(Voted=data$VOTED, predVoted=predSex>.5)%>%
  group_by(Voted, predVoted)%>%
  summarise(n=n())

## `summarise()` regrouping output by 'Voted' (override with `.groups` argument)

cmSex #confusion matrix for modSex

## # A tibble: 2 x 3
## # Groups:   Voted [2]
##   Voted predVoted     n
##   <fct> <lgl>     <int>
## 1 1     TRUE      21287
## 2 2     TRUE      58533

## same issues as with modAge and getting same values as for previous model.

[I will do this for the rest of the models once I figure out what is going wrong]

(I) Conclusion

Sex as a function of Voted tells us that females and males vote in comparable numbers. Although the bar graph shows us that females are very slightly more likely to vote. Even though both of the predictors are statistically significant, age provides us with a story: as age increases, votes increase. The glm with predictors and their interaction has a lower AIC than glm with just predictors, making that model a better fit for the data. ** [NOTE: This is what we have so far for the conclusion but we will update this section once we have figured out our confusion matrices.]