§ 1.1 What proportion of people in this dataset voted in this election?
mean(gerber$voting)
[1] 0.3159
§ 1.2 Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?
tapply(gerber$voting, gerber$civicduty, mean)
0 1
0.3161 0.3145
tapply(gerber$voting, gerber$hawthorne, mean)
0 1
0.3151 0.3224
tapply(gerber$voting, gerber$self, mean)
0 1
0.3122 0.3452
tapply(gerber$voting, gerber$neighbors, mean)
0 1
0.3082 0.3779
print("neighbors")
[1] "neighbors"
§ 1.3 Which of the following coefficients are significant in the logistic regression model? Select all that apply.
names(gerber)
[1] "sex" "yob" "voting" "hawthorne" "civicduty" "neighbors" "self"
[8] "control"
model1 = glm(voting ~ ., gerber[3:8], family = binomial)
summary(model1)
Call:
glm(formula = voting ~ ., family = binomial, data = gerber[3:8])
Deviance Residuals:
Min 1Q Median 3Q Max
-0.974 -0.869 -0.839 1.459 1.559
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.86336 0.00501 -172.46 < 2e-16 ***
hawthorne 0.12048 0.01204 10.01 < 2e-16 ***
civicduty 0.08437 0.01210 6.97 0.0000000000031 ***
neighbors 0.36509 0.01168 31.26 < 2e-16 ***
self 0.22294 0.01187 18.79 < 2e-16 ***
control NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 429238 on 344083 degrees of freedom
Residual deviance: 428090 on 344079 degrees of freedom
AIC: 428100
Number of Fisher Scoring iterations: 4
print("hawthorne, civicduty, neighbors, self")
[1] "hawthorne, civicduty, neighbors, self"
§ 1.4 Using a threshold of 0.3, what is the accuracy of the logistic regression model?
pred0 = predict(model1, data = gerber, type = "response")
table(gerber$voting, pred0 >0.3)
FALSE TRUE
0 134513 100875
1 56730 51966
(134513+51966)/nrow(gerber)
[1] 0.542
§ 1.5 Using a threshold of 0.5, what is the accuracy of the logistic regression model?
pred0 = predict(model1, data = gerber, type = "response")
table(gerber$voting, pred0 >0.5)
FALSE
0 235388
1 108696
235388/nrow(gerber)
[1] 0.6841
§ 1.6 Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?
rocpred = prediction(pred0, gerber$voting)
auc0 = as.numeric(performance(rocpred, "auc")@y.values)
print("Even though all of the variables are significant, this is a weak predictive model.")
[1] "Even though all of the variables are significant, this is a weak predictive model."
§ 2.1 Plot the tree. What happens, and if relevant, why?
CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
rpart.plot(CARTmodel)
print("No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on.")
[1] "No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on."
§ 2.2 to force the complete tree to be built. Then plot the tree. What do you observe about the order of the splits?
CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
rpart.plot(CARTmodel2)
print("Neighbor is the first split, civic duty is the last.")
[1] "Neighbor is the first split, civic duty is the last."
§ 2.3 Using only the CART tree plot, determine what fraction (a number between 0 and 1) of “Civic Duty” people voted:
print("0.31")
[1] "0.31"
§ 2.4 Make a new tree that includes the “sex” variable, again with cp = 0.0. Notice that sex appears as a split that is of secondary importance to the treatment group.
In the control group, which gender is more likely to vote?
CARTmodel3 = rpart(voting ~ sex + civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
rpart.plot(CARTmodel3)
print("male")
[1] "male"
In the “Civic Duty” group, which gender is more likely to vote?
print("male")
[1] "male"
§ 3.1 In the “control” only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group?
CARTmodel4 = rpart(voting ~ control, data=gerber, cp=0.0)
CARTmodel5 = rpart(voting ~ control + sex, data=gerber, cp=0.0)
rpart.plot(CARTmodel4, digits=6)
abs(0.296638 - 0.34)
[1] 0.04336
§ 3.2 Now, using the second tree (with control and sex), determine who is affected more by NOT being in the control group (being in any of the four treatment groups)
CARTmodel5 = rpart(voting ~ control + sex, data=gerber, cp=0.0)
rpart.plot(CARTmodel5, digits=6)
print("They are affected about the same (change in probability within 0.001 of each other).")
[1] "They are affected about the same (change in probability within 0.001 of each other)."
§ 3.3 Going back to logistic regression now, create a model using “sex” and “control”. Interpret the coefficient for “sex”
model2 = glm(voting ~ sex + control, data = gerber, family = binomial)
summary(model2)
Call:
glm(formula = voting ~ sex + control, family = binomial, data = gerber)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.922 -0.901 -0.829 1.456 1.572
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.63554 0.00651 -97.6 < 2e-16 ***
sex -0.05579 0.00734 -7.6 0.00000000000003 ***
control -0.20014 0.00736 -27.2 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 429238 on 344083 degrees of freedom
Residual deviance: 428443 on 344081 degrees of freedom
AIC: 428449
Number of Fisher Scoring iterations: 4
print("Coefficient is negative, reflecting that women are less likely to vote")
[1] "Coefficient is negative, reflecting that women are less likely to vote"
§ 3.4 What is the absolute difference between the tree and the logistic regression for the (Woman, Control) case? Give an answer with five numbers after the decimal point.
Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(model2, newdata=Possibilities, type="response", digits = 5)
1 2 3 4
0.3463 0.3024 0.3337 0.2908
abs(0.290456 - 0.2908)
[1] 0.000344
§ 3.5 How do you interpret the coefficient for the new variable in isolation? That is, how does it relate to the dependent variable?
LogModel2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(LogModel2)
Call:
glm(formula = voting ~ sex + control + sex:control, family = "binomial",
data = gerber)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.921 -0.902 -0.828 1.457 1.573
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.63747 0.00760 -83.84 < 2e-16 ***
sex -0.05189 0.01080 -4.80 0.0000016 ***
control -0.19655 0.01036 -18.98 < 2e-16 ***
sex:control -0.00726 0.01473 -0.49 0.62
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 429238 on 344083 degrees of freedom
Residual deviance: 428442 on 344080 degrees of freedom
AIC: 428450
Number of Fisher Scoring iterations: 4
print("If a person is a woman and in the control group, the chance that she voted goes down.")
[1] "If a person is a woman and in the control group, the chance that she voted goes down."
§ 3.6 what is the difference between the logistic regression model and the CART model for the (Woman, Control) case?
options(digits = 7)
predict(LogModel2, newdata=Possibilities, type="response")
1 2 3 4
0.3458183 0.3027947 0.3341757 0.2904558
abs(0.290456 - 0.2904558 )
[1] 0.0000002
§ 3.7 Should we always include all possible interaction terms of the independent variables when building a logistic regression model?
print("No, We should not use all possible interaction terms in a logistic regression model due to overfitting. ")
[1] "No, We should not use all possible interaction terms in a logistic regression model due to overfitting. "