library(tidyverse)
library(ROCR) # receiver operator characteristics
library(rpart) # CART model
library(rpart.plot)

Assignment for MITx The Analytics Edge

understanding why people vote

In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:

Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that … “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout…

In this homework problem we will use both logistic regression and classification trees to analyze the data they collected.

The data

The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.

  1. “Civic Duty” (variable civicduty) group members were sent a letter that simply said “DO YOUR CIVIC DUTY - VOTE!”
  2. “Hawthorne Effect” (variable hawthorne) group members were sent a letter that had the “Civic Duty” message plus the additional message “YOU ARE BEING STUDIED” and they were informed that their voting behavior would be examined by means of public records.
  3. “Self” (variable self) group members received the “Civic Duty” message as well as the recent voting record of everyone in that household and a message stating that another message would be sent after the election with updated records.
  4. “Neighbors” (variable neighbors) group members were given the same message as that for the “Self” group, except the message not only had the household voting records but also that of neighbors - maximizing social pressure.
  5. “Control” (variable control) group members were not sent anything, and represented the typical voting situation.

Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).

data <- read.csv("gerber.csv")
str(data)
## 'data.frame':    344084 obs. of  8 variables:
##  $ sex      : int  0 1 1 1 0 1 0 0 1 0 ...
##  $ yob      : int  1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
##  $ voting   : int  0 0 1 1 1 1 1 0 0 0 ...
##  $ hawthorne: int  0 0 1 1 1 0 0 0 0 0 ...
##  $ civicduty: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ neighbors: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ self     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ control  : int  0 0 0 0 0 1 1 1 1 1 ...
summary(data)
##       sex              yob           voting         hawthorne    
##  Min.   :0.0000   Min.   :1900   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1947   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :1956   Median :0.0000   Median :0.000  
##  Mean   :0.4993   Mean   :1956   Mean   :0.3159   Mean   :0.111  
##  3rd Qu.:1.0000   3rd Qu.:1965   3rd Qu.:1.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1986   Max.   :1.0000   Max.   :1.000  
##    civicduty        neighbors          self           control      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.1111   Mean   :0.111   Mean   :0.1111   Mean   :0.5558  
##  3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000

Proportion of people in this dataset who voted: 0.3158996

Voting and the different treatment groups

Is there anyone in the dataset with more than one treatment?

data %>%
  mutate(two_treatments = (hawthorne + civicduty + neighbors + self) > 1) %>%
  select(c("two_treatments")) %>%
  summary()
##  two_treatments 
##  Mode :logical  
##  FALSE:344084

No, there isn’t anyone with more than one treatment.

First table shows number who voted (1) and didn’t vote (0) in each treatment group.

treatment_groups <- data %>%
  mutate(treatment = case_when(hawthorne == TRUE ~ c("hawthorne"),
                               civicduty == TRUE ~ c("civicduty"),
                               neighbors == TRUE ~ c("neighbors"),
                               self == TRUE ~ c("self"),
                               TRUE ~ c("Control"))) %>%
  select(c("voting", "treatment"))

table(treatment_groups$treatment, treatment_groups$voting)
##            
##                  0      1
##   civicduty  26197  12021
##   Control   134513  56730
##   hawthorne  25888  12316
##   neighbors  23763  14438
##   self       25027  13191
treatment_groups %>%
  group_by(treatment) %>%
  summarise(avg = mean(voting))
## # A tibble: 5 x 2
##   treatment   avg
##   <chr>     <dbl>
## 1 civicduty 0.315
## 2 Control   0.297
## 3 hawthorne 0.322
## 4 neighbors 0.378
## 5 self      0.345

Logistic regression model

logmodel <- glm(voting ~ civicduty + hawthorne + neighbors + self,
                data = data, family = "binomial")
# note that this model uses the entire dataset
summary(logmodel)
## 
## Call:
## glm(formula = voting ~ civicduty + hawthorne + neighbors + self, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -0.863358   0.005006 -172.459  < 2e-16 ***
## civicduty    0.084368   0.012100    6.972 3.12e-12 ***
## hawthorne    0.120477   0.012037   10.009  < 2e-16 ***
## neighbors    0.365092   0.011679   31.260  < 2e-16 ***
## self         0.222937   0.011867   18.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428090  on 344079  degrees of freedom
## AIC: 428100
## 
## Number of Fisher Scoring iterations: 4
cat("Accuracy of logistic regression model of predicting\nthe voting intention (with threshold 0.3):\n",
      sum((predict(logmodel, type = "response")>.3)==data$voting)/nrow(data))
## Accuracy of logistic regression model of predicting
## the voting intention (with threshold 0.3):
##  0.5419578
cat("\nAccuracy of logistic regression model of predicting\nthe voting intention (with threshold 0.5):\n",
      sum((predict(logmodel, type = "response")>.5)==data$voting)/nrow(data))
## 
## Accuracy of logistic regression model of predicting
## the voting intention (with threshold 0.5):
##  0.6841004

Accuracy of ‘baseline’ model (the proportion of people who did not vote): 0.6841004. It appears that the model is poorly predictive with the two thresholds 0.3 and 0.5.

predicted_vote <- predict(logmodel, type = "response")

pred_ROCR <- prediction(predicted_vote, data$voting)
auc_ROCR <- performance(pred_ROCR, measure = 'auc')
plot(performance(pred_ROCR, measure = 'tpr', x.measure = 'fpr'), colorize = TRUE,
     print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.2, 1.7))

paste('Area under Curve :', signif(auc_ROCR@y.values[[1]]))
## [1] "Area under Curve : 0.530846"

Area under Curve (AUC) barely more than 0.5, this is a poor predictive model for any threshold.

CART model

We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting.

If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

CARTmodel <- rpart(voting ~ civicduty + hawthorne + self + neighbors,
                  data = data)
prp(CARTmodel)

There is just one leaf! There are no splits in the tree, because none of the variables make a big enough effect to be split on.

Now set cp = 0.0 to force the complete tree to be built.

CARTmodel2 <- rpart(voting ~ civicduty + hawthorne + self + neighbors,
                   data = data, cp=0.0)
prp(CARTmodel2)

From this tree, it can be seen that 0.31 of “Civic duty” people voted. (The last split is ‘Yes’ to the left of civicdut=0).

The same model, adding sex as another factor.

CARTmodel2sex <- rpart(voting ~ civicduty + hawthorne + self + neighbors + sex,
                      data = data, cp=0.0)
prp(CARTmodel2sex)

Interaction terms

CARTmodelControl <- rpart(voting ~ control,
                          data = data, cp = 0.0)
prp(CARTmodelControl, digits = 6) # six digits of precision in display

CARTmodelControlsex <- rpart(voting ~ control + sex,
                             data = data, cp = 0.0)
prp(CARTmodelControlsex, digits = 6)

The difference in voting as the result of not being in the control group is about the same for men and women (less than 0.001 absolute difference).

Logistic regression and interaction terms

logitcontrolsex <- glm(voting ~ control + sex,
                       data = data, family = "binomial")
summary(logitcontrolsex)
## 
## Call:
## glm(formula = voting ~ control + sex, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

The coefficient for the “sex” variable is -0.055791. This means that women are less likely to vote, since women have a larger value in the sex variable, and a negative coefficient means that larger values are predictive of 0.

Our logistic regression on the “sex” and “control” variables considers these variables separately, not jointly, and therefore did not do as well as the CART model.

We can quantify this precisely.

Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(logitcontrolsex, newdata=Possibilities, type="response")
##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ).

For the (Woman, Control) case ot a large difference between the logistic regression model (0.2908065) and the CART model (0.290456), but it is there.

We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group.

logmodel2 <-glm(voting ~ sex + control + sex:control, data=data, family="binomial")
summary(logmodel2)
## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

If a person is a woman and in the control group, the chance that she voted goes down.

(The sex:control coefficient is negative, so that means that a value of 1 in this variable decreases the chance of voting. This variable will have variable 1 if the person is a woman and in the control group.)

predict(logmodel2, newdata=Possibilities, type="response")
##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

The logistic regression model now predicts 0.2904558 for the (Woman, Control) case, so there is now a very small difference (practically zero) between CART and logistic regression.

Trees can capture nonlinear relationships that logistic regression can not, but we can get around this sometimes by using variables that are the combination of two variables.

We should not use all possible interaction terms in a logistic regression model due to overfitting. Even in this simple problem, we have four treatment groups and two values for sex. If we have an interaction term for every treatment variable with sex, we will double the number of variables. In smaller data sets, this could quickly lead to overfitting.