We will first get familiar with the data. Load the CSV file gerber.csv into R. What proportion of people in this dataset voted in this election?
108696/(235388+108696)
[1] 0.3158996
Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?
tapply(gerber$voting, gerber$civicduty, mean)
0 1
0.3160698 0.3145377
tapply(gerber$voting, gerber$hawthorne, mean)
0 1
0.3150909 0.3223746
tapply(gerber$voting, gerber$self, mean)
0 1
0.3122446 0.3451515
tapply(gerber$voting, gerber$neighbors, mean)
0 1
0.3081505 0.3779482
print("Neighbors")
[1] "Neighbors"
Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.
LogModel = glm(voting ~ civicduty + hawthorne + self + neighbors, data = gerber, family = "binomial")
summary(LogModel)
Call:
glm(formula = voting ~ civicduty + hawthorne + self + neighbors,
family = "binomial", data = gerber)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9744 -0.8691 -0.8389 1.4586 1.5590
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.863358 0.005006 -172.459 < 2e-16 ***
civicduty 0.084368 0.012100 6.972 3.12e-12 ***
hawthorne 0.120477 0.012037 10.009 < 2e-16 ***
self 0.222937 0.011867 18.786 < 2e-16 ***
neighbors 0.365092 0.011679 31.260 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 429238 on 344083 degrees of freedom
Residual deviance: 428090 on 344079 degrees of freedom
AIC: 428100
Number of Fisher Scoring iterations: 4
print("civicduty, hawthorne, self, neighbors")
[1] "civicduty, hawthorne, self, neighbors"
Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don’t need to use the newdata argument since we didn’t split our data.)
predictLog = predict(LogModel, type="response")
table(gerber$voting, predictLog > 0.3)
FALSE TRUE
0 134513 100875
1 56730 51966
(134513+51966)/(134513+100875+56730+51966)
[1] 0.5419578
Using a threshold of 0.5, what is the accuracy of the logistic regression model?
predictLog = predict(LogModel, type="response")
table(gerber$voting, predictLog > 0.5)
FALSE
0 235388
1 108696
(235388)/(235388+108696)
[1] 0.6841004
Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?
library(ROCR)
Loading required package: gplots
Attaching package: 愼㸱愼㸵gplots愼㸱愼㸶
The following object is masked from 愼㸱愼㸵package:stats愼㸱愼㸶:
lowess
ROCRpred = prediction(predictLog, gerber$voting)
as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.5308461
CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
install.packages("rpart.plot")
Error in install.packages : Updating loaded packages
library(rpart.plot)
CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
prp(CARTmodel)
print("No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on.")
[1] "No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on."
Now build the tree using the command:
CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
prp(CARTmodel2)
Using only the CART tree plot, determine what fraction (a number between 0 and 1) of “Civic Duty” people voted:
print(0.31)
Make a new tree that includes the “sex” variable, again with cp = 0.0. Notice that sex appears as a split that is of secondary importance to the treatment group.
In the control group, which gender is more likely to vote?
In the “Civic Duty” group, which gender is more likely to vote?
CARTmodel3 = rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=gerber, cp=0.0)
prp(CARTmodel3)
print("men")
[1] "men"
print("men")
[1] "men"
CARTcontrol = rpart(voting ~ control, data=gerber, cp=0.0)
CARTsex = rpart(voting ~ control + sex, data=gerber, cp=0.0)
prp(CARTcontrol, digits=6)
print(0.34-0.296638)
[1] 0.043362
Now, using the second tree (with control and sex), determine who is affected more by NOT being in the control group (being in any of the four treatment groups):
Going back to logistic regression now, create a model using “sex” and “control”. Interpret the coefficient for “sex”:
LogModelSex = glm(voting ~ control + sex, data=gerber, family="binomial")
summary(LogModel)
Call:
glm(formula = voting ~ civicduty + hawthorne + self + neighbors,
family = "binomial", data = gerber)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9744 -0.8691 -0.8389 1.4586 1.5590
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.863358 0.005006 -172.459 < 2e-16 ***
civicduty 0.084368 0.012100 6.972 3.12e-12 ***
hawthorne 0.120477 0.012037 10.009 < 2e-16 ***
self 0.222937 0.011867 18.786 < 2e-16 ***
neighbors 0.365092 0.011679 31.260 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 429238 on 344083 degrees of freedom
Residual deviance: 428090 on 344079 degrees of freedom
AIC: 428100
Number of Fisher Scoring iterations: 4
Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(LogModelSex, newdata=Possibilities, type="response")
1 2 3 4
0.3462559 0.3024455 0.3337375 0.2908065
0.2908065-0.290456
[1] 0.0003505
print(0.00035)
[1] 0.00035
LogModel2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(LogModel2)
Call:
glm(formula = voting ~ sex + control + sex:control, family = "binomial",
data = gerber)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9213 -0.9019 -0.8284 1.4573 1.5724
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.637471 0.007603 -83.843 < 2e-16 ***
sex -0.051888 0.010801 -4.804 1.55e-06 ***
control -0.196553 0.010356 -18.980 < 2e-16 ***
sex:control -0.007259 0.014729 -0.493 0.622
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 429238 on 344083 degrees of freedom
Residual deviance: 428442 on 344080 degrees of freedom
AIC: 428450
Number of Fisher Scoring iterations: 4
print("If a person is a woman and in the control group, the chance that she voted goes down.")
[1] "If a person is a woman and in the control group, the chance that she voted goes down."
predict(LogModel2, newdata=Possibilities, type="response")
1 2 3 4
0.3458183 0.3027947 0.3341757 0.2904558
0.2904558-0.290456
[1] -2e-07
print("no")
[1] "no"