In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:
Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that ... "the casting of a single vote is of no significance where there is a multitude of electors"? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters' sense of civic duty has long been a leading explanation of vote turnout...
In this homework problem we will use both logistic regression and classification trees to analyze the data they collected. The data
The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.
"Civic Duty" (variable **civicduty**) group members were sent a letter that simply said "DO YOUR CIVIC DUTY - VOTE!"
"Hawthorne Effect" (variable **hawthorne**) group members were sent a letter that had the "Civic Duty" message plus the additional message "YOU ARE BEING STUDIED" and they were informed that their voting behavior would be examined by means of public records.
"Self" (variable **self**) group members received the "Civic Duty" message as well as the recent voting record of everyone in that household and a message stating that another message would be sent after the election with updated records.
"Neighbors" (variable **neighbors**) group members were given the same message as that for the "Self" group, except the message not only had the household voting records but also that of neighbors - maximizing social pressure.
"Control" (variable **control**) group members were not sent anything, and represented the typical voting situation.
Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).
We will first get familiar with the data. Load the CSV file gerber.csv into R.
# Load dataset
gerber = read.csv("gerber.csv")# Proportion of people who voted
z = table(gerber$voting)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 235388 |
| 1 | 108696 |
# Compute proportion
z[2]/sum(z)
## 1
## 0.3158996Proportion = 0.3159
# Tabulate voting across all treatment groups
z = table(gerber$voting, gerber$hawthorne)
kable(z)| 0 | 1 | |
|---|---|---|
| 0 | 209500 | 25888 |
| 1 | 96380 | 12316 |
sum(diag(z))/sum(z)
## [1] 0.6446565
z = table(gerber$voting, gerber$civicduty)
kable(z)| 0 | 1 | |
|---|---|---|
| 0 | 209191 | 26197 |
| 1 | 96675 | 12021 |
sum(diag(z))/sum(z)
## [1] 0.6429012
z = table(gerber$voting, gerber$neighbors)
kable(z)| 0 | 1 | |
|---|---|---|
| 0 | 211625 | 23763 |
| 1 | 94258 | 14438 |
sum(diag(z))/sum(z)
## [1] 0.6569995
z = table(gerber$voting, gerber$self)
kable(z)| 0 | 1 | |
|---|---|---|
| 0 | 210361 | 25027 |
| 1 | 95505 | 13191 |
sum(diag(z))/sum(z)
## [1] 0.6497018The neighbors group had the highest population of people voting
# Logistic Regression
LogModel = glm(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, family="binomial")# Output summary
summary(LogModel)
##
## Call:
## glm(formula = voting ~ civicduty + hawthorne + self + neighbors,
## family = "binomial", data = gerber)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9744 -0.8691 -0.8389 1.4586 1.5590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.863358 0.005006 -172.459 < 2e-16 ***
## civicduty 0.084368 0.012100 6.972 3.12e-12 ***
## hawthorne 0.120477 0.012037 10.009 < 2e-16 ***
## self 0.222937 0.011867 18.786 < 2e-16 ***
## neighbors 0.365092 0.011679 31.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428090 on 344079 degrees of freedom
## AIC: 428100
##
## Number of Fisher Scoring iterations: 4Civic Duty, Hawthorne Effect, Self, and Neighbors are significant.
# Make predictions
predictLog = predict(LogModel, type="response")
z = table(gerber$voting, predictLog > 0.3)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 134513 | 100875 |
| 1 | 56730 | 51966 |
# Compute accuracy
sum(diag(z))/sum(z)
## [1] 0.5419578Accuracy = 0.542
# Make predictions
predictLog = predict(LogModel, type="response")
z = table(gerber$voting, predictLog > 0.5)
kable(z)| FALSE | |
|---|---|
| 0 | 235388 |
| 1 | 108696 |
# Compute accuracy
sum(diag(z))/sum(z)
## [1] 0.6841004Accuracy = 0.684
# Compute AUC
library(ROCR)
ROCRpred = prediction(predictLog, gerber$voting)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.5308461Even though all of our variables are significant, our model does not improve over the baseline model of just predicting that someone will not vote, and the AUC is low. So while the treatment groups do make a difference, this is a weak predictive model.
We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before. Don’t set the option method=“class” - we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting. If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.
# CART Model
CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)# Plot Tree
prp(CARTmodel)There are no splits in the tree, because none of the variables make a big enough effect to be split on.
# CART Model
CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)# Plot Tree
prp(CARTmodel2)We saw in Problem 1 that the highest fraction of voters was in the Neighbors group, followed by the Self group, followed by the Hawthorne group, and lastly the Civic Duty group. And we see here that the tree detects this trend.
You can find this answer by reading the tree - the people in the civic duty group correspond to the bottom right split, which has value 0.31 in the leaf.
# CART Model3
CARTmodel3 = rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=gerber, cp=0.0)
# Plot tree
prp(CARTmodel3)For the control group, which corresponds to the bottom left, sex = 0 (male) corresponds to a higher voting percentage.
For the civic duty group, which corresponds to the bottom right, sex = 0 (male) corresponds to a higher voting percentage.
We know trees can handle “nonlinear” relationships, e.g. “in the ‘Civic Duty’ group and female”, but as we will see in the next few questions, it is possible to do the same for logistic regression. First, let’s explore what trees can tell us some more.
Let’s just focus on the “Control” treatment group. Create a regression tree using just the “control” variable, then create another tree with the “control” and “sex” variables, both with cp=0.0.
# CART Control
CARTcontrol = rpart(voting ~ control, data=gerber, cp=0.0)
CARTsex = rpart(voting ~ control + sex, data=gerber, cp=0.0)
# Plot tree
prp(CARTcontrol, digits=6)The split says that if control = 1, predict 0.296638, and if control = 0, predict 0.34. The absolute difference between these is 0.043362.
# Plot Tree
prp(CARTsex, digits=6)The first split says that if control = 1, go left. Then, if sex = 1 (female) predict 0.290456, and if sex = 0 (male) predict 0.302795. On the other side of the tree, where control = 0, if sex = 1 (female) predict 0.334176, and if sex = 0 (male) predict 0.345818. So for women, not being in the control group increases the fraction voting by 0.04372. For men, not being in the control group increases the fraction voting by 0.04302. So men and women are affected about the same.
# Logistic Regression
LogModelSex = glm(voting ~ control + sex, data=gerber, family="binomial")
# Output summary
summary(LogModelSex)
##
## Call:
## glm(formula = voting ~ control + sex, family = "binomial", data = gerber)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9220 -0.9012 -0.8290 1.4564 1.5717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.635538 0.006511 -97.616 < 2e-16 ***
## control -0.200142 0.007364 -27.179 < 2e-16 ***
## sex -0.055791 0.007343 -7.597 3.02e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428443 on 344081 degrees of freedom
## AIC: 428449
##
## Number of Fisher Scoring iterations: 4If you look at the summary of the model, you can see that the coefficient for the “sex” variable is -0.055791. This means that women are less likely to vote, since women have a larger value in the sex variable, and a negative coefficient means that larger values are predictive of 0.
The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). However, logistic regression on the “sex” and “control” variables considers these variables separately, not jointly, and therefore did not do as well.
We can quantify this precisely. Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function (where “LogModelSex” is the name of your logistic regression model that uses both control and sex):
# Predict
Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(LogModelSex, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3462559 0.3024455 0.3337375 0.2908065
prp(CARTsex)The CART tree predicts 0.290456 for the (Woman, Control) case, and the logistic regression model predicts 0.2908065. So the absolute difference, to five decimal places, is 0.00035.
So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group. We can do that with the following command:
# Logistic Regression
LogModel2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
# Output summary
summary(LogModel2)
##
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial",
## data = gerber)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9213 -0.9019 -0.8284 1.4573 1.5724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.637471 0.007603 -83.843 < 2e-16 ***
## sex -0.051888 0.010801 -4.804 0.00000155 ***
## control -0.196553 0.010356 -18.980 < 2e-16 ***
## sex:control -0.007259 0.014729 -0.493 0.622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428442 on 344080 degrees of freedom
## AIC: 428450
##
## Number of Fisher Scoring iterations: 4This coefficient is negative, so that means that a value of 1 in this variable decreases the chance of voting. This variable will have variable 1 if the person is a woman and in the control group.
# Make predictions
predict(LogModel2, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3458183 0.3027947 0.3341757 0.2904558The logistic regression model now predicts 0.2904558 for the (Woman, Control) case, so there is now a very small difference (practically zero) between CART and logistic regression.
We should not use all possible interaction terms in a logistic regression model due to overfitting. Even in this simple problem, we have four treatment groups and two values for sex. If we have an interaction term for every treatment variable with sex, we will double the number of variables. In smaller data sets, this could quickly lead to overfitting.