library(tidyverse)
library(ROCR) # receiver operator characteristics
library(rpart) # CART model
library(rpart.plot)
Assignment for MITx The Analytics Edge
In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:
Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that … “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout…
In this homework problem we will use both logistic regression and classification trees to analyze the data they collected.
The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.
Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).
data <- read.csv("gerber.csv")
str(data)
## 'data.frame': 344084 obs. of 8 variables:
## $ sex : int 0 1 1 1 0 1 0 0 1 0 ...
## $ yob : int 1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
## $ voting : int 0 0 1 1 1 1 1 0 0 0 ...
## $ hawthorne: int 0 0 1 1 1 0 0 0 0 0 ...
## $ civicduty: int 1 1 0 0 0 0 0 0 0 0 ...
## $ neighbors: int 0 0 0 0 0 0 0 0 0 0 ...
## $ self : int 0 0 0 0 0 0 0 0 0 0 ...
## $ control : int 0 0 0 0 0 1 1 1 1 1 ...
summary(data)
## sex yob voting hawthorne
## Min. :0.0000 Min. :1900 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1947 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :1956 Median :0.0000 Median :0.000
## Mean :0.4993 Mean :1956 Mean :0.3159 Mean :0.111
## 3rd Qu.:1.0000 3rd Qu.:1965 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1986 Max. :1.0000 Max. :1.000
## civicduty neighbors self control
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.1111 Mean :0.111 Mean :0.1111 Mean :0.5558
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
Proportion of people in this dataset who voted: 0.3158996
Is there anyone in the dataset with more than one treatment?
data %>%
mutate(two_treatments = (hawthorne + civicduty + neighbors + self) > 1) %>%
select(c("two_treatments")) %>%
summary()
## two_treatments
## Mode :logical
## FALSE:344084
No, there isn’t anyone with more than one treatment.
First table shows number who voted (1) and didn’t vote (0) in each treatment group.
treatment_groups <- data %>%
mutate(treatment = case_when(hawthorne == TRUE ~ c("hawthorne"),
civicduty == TRUE ~ c("civicduty"),
neighbors == TRUE ~ c("neighbors"),
self == TRUE ~ c("self"),
TRUE ~ c("Control"))) %>%
select(c("voting", "treatment"))
table(treatment_groups$treatment, treatment_groups$voting)
##
## 0 1
## civicduty 26197 12021
## Control 134513 56730
## hawthorne 25888 12316
## neighbors 23763 14438
## self 25027 13191
treatment_groups %>%
group_by(treatment) %>%
summarise(avg = mean(voting))
## # A tibble: 5 x 2
## treatment avg
## <chr> <dbl>
## 1 civicduty 0.315
## 2 Control 0.297
## 3 hawthorne 0.322
## 4 neighbors 0.378
## 5 self 0.345
logmodel <- glm(voting ~ civicduty + hawthorne + neighbors + self,
data = data, family = "binomial")
# note that this model uses the entire dataset
summary(logmodel)
##
## Call:
## glm(formula = voting ~ civicduty + hawthorne + neighbors + self,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9744 -0.8691 -0.8389 1.4586 1.5590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.863358 0.005006 -172.459 < 2e-16 ***
## civicduty 0.084368 0.012100 6.972 3.12e-12 ***
## hawthorne 0.120477 0.012037 10.009 < 2e-16 ***
## neighbors 0.365092 0.011679 31.260 < 2e-16 ***
## self 0.222937 0.011867 18.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428090 on 344079 degrees of freedom
## AIC: 428100
##
## Number of Fisher Scoring iterations: 4
cat("Accuracy of logistic regression model of predicting\nthe voting intention (with threshold 0.3):\n",
sum((predict(logmodel, type = "response")>.3)==data$voting)/nrow(data))
## Accuracy of logistic regression model of predicting
## the voting intention (with threshold 0.3):
## 0.5419578
cat("\nAccuracy of logistic regression model of predicting\nthe voting intention (with threshold 0.5):\n",
sum((predict(logmodel, type = "response")>.5)==data$voting)/nrow(data))
##
## Accuracy of logistic regression model of predicting
## the voting intention (with threshold 0.5):
## 0.6841004
Accuracy of ‘baseline’ model (the proportion of people who did not vote): 0.6841004. It appears that the model is poorly predictive with the two thresholds 0.3 and 0.5.
predicted_vote <- predict(logmodel, type = "response")
pred_ROCR <- prediction(predicted_vote, data$voting)
auc_ROCR <- performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,
print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))
paste('Area under Curve :', signif(auc_ROCR@y.values[[1]]))
## [1] "Area under Curve : 0.530846"
Area under Curve (AUC) barely more than 0.5, this is a poor predictive model for any threshold.
We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting.
If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.
CARTmodel <- rpart(voting ~ civicduty + hawthorne + self + neighbors,
data = data)
prp(CARTmodel)
There is just one leaf! There are no splits in the tree, because none of the variables make a big enough effect to be split on.
Now set cp = 0.0 to force the complete tree to be built.
CARTmodel2 <- rpart(voting ~ civicduty + hawthorne + self + neighbors,
data = data, cp=0.0)
prp(CARTmodel2)
From this tree, it can be seen that 0.31 of “Civic duty” people voted. (The last split is ‘Yes’ to the left of civicdut=0).
The same model, adding sex as another factor.
CARTmodel2sex <- rpart(voting ~ civicduty + hawthorne + self + neighbors + sex,
data = data, cp=0.0)
prp(CARTmodel2sex)
CARTmodelControl <- rpart(voting ~ control,
data = data, cp = 0.0)
prp(CARTmodelControl, digits = 6) # six digits of precision in display
CARTmodelControlsex <- rpart(voting ~ control + sex,
data = data, cp = 0.0)
prp(CARTmodelControlsex, digits = 6)
The difference in voting as the result of not being in the control group is about the same for men and women (less than 0.001 absolute difference).
logitcontrolsex <- glm(voting ~ control + sex,
data = data, family = "binomial")
summary(logitcontrolsex)
##
## Call:
## glm(formula = voting ~ control + sex, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9220 -0.9012 -0.8290 1.4564 1.5717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.635538 0.006511 -97.616 < 2e-16 ***
## control -0.200142 0.007364 -27.179 < 2e-16 ***
## sex -0.055791 0.007343 -7.597 3.02e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428443 on 344081 degrees of freedom
## AIC: 428449
##
## Number of Fisher Scoring iterations: 4
The coefficient for the “sex” variable is -0.055791. This means that women are less likely to vote, since women have a larger value in the sex variable, and a negative coefficient means that larger values are predictive of 0.
Our logistic regression on the “sex” and “control” variables considers these variables separately, not jointly, and therefore did not do as well as the CART model.
We can quantify this precisely.
Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(logitcontrolsex, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3462559 0.3024455 0.3337375 0.2908065
The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ).
For the (Woman, Control) case ot a large difference between the logistic regression model (0.2908065) and the CART model (0.290456), but it is there.
We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group.
logmodel2 <-glm(voting ~ sex + control + sex:control, data=data, family="binomial")
summary(logmodel2)
##
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9213 -0.9019 -0.8284 1.4573 1.5724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.637471 0.007603 -83.843 < 2e-16 ***
## sex -0.051888 0.010801 -4.804 1.55e-06 ***
## control -0.196553 0.010356 -18.980 < 2e-16 ***
## sex:control -0.007259 0.014729 -0.493 0.622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428442 on 344080 degrees of freedom
## AIC: 428450
##
## Number of Fisher Scoring iterations: 4
If a person is a woman and in the control group, the chance that she voted goes down.
(The sex:control coefficient is negative, so that means that a value of 1 in this variable decreases the chance of voting. This variable will have variable 1 if the person is a woman and in the control group.)
predict(logmodel2, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3458183 0.3027947 0.3341757 0.2904558
The logistic regression model now predicts 0.2904558 for the (Woman, Control) case, so there is now a very small difference (practically zero) between CART and logistic regression.
Trees can capture nonlinear relationships that logistic regression can not, but we can get around this sometimes by using variables that are the combination of two variables.
We should not use all possible interaction terms in a logistic regression model due to overfitting. Even in this simple problem, we have four treatment groups and two values for sex. If we have an interaction term for every treatment variable with sex, we will double the number of variables. In smaller data sets, this could quickly lead to overfitting.