Source: Analytics Edge – Unit 4 Homework

Techniques involved: logistic regression, CART, regression tree, interpretation, interaction term in logistic regression, logistic regression vs CART

In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:

Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that … “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout…

In this homework problem we will use both logistic regression and classification trees to analyze the data they collected.

THE DATA

The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.

“Civic Duty” (variable civicduty) group members were sent a letter that simply said “DO YOUR CIVIC DUTY - VOTE!” “Hawthorne Effect” (variable hawthorne) group members were sent a letter that had the “Civic Duty” message plus the additional message “YOU ARE BEING STUDIED” and they were informed that their voting behavior would be examined by means of public records.

“Self” (variable self) group members received the “Civic Duty” message as well as the recent voting record of everyone in that household and a message stating that another message would be sent after the election with updated records.

“Neighbors” (variable neighbors) group members were given the same message as that for the “Self” group, except the message not only had the household voting records but also that of neighbors - maximizing social pressure.

“Control” (variable control) group members were not sent anything, and represented the typical voting situation. Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_4_Trees")
gerber <- read.csv("gerber.csv")
str(gerber)
## 'data.frame':    344084 obs. of  8 variables:
##  $ sex      : int  0 1 1 1 0 1 0 0 1 0 ...
##  $ yob      : int  1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
##  $ voting   : int  0 0 1 1 1 1 1 0 0 0 ...
##  $ hawthorne: int  0 0 1 1 1 0 0 0 0 0 ...
##  $ civicduty: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ neighbors: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ self     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ control  : int  0 0 0 0 0 1 1 1 1 1 ...
table(gerber$voting)
## 
##      0      1 
## 235388 108696

31.6% of people voted.

Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?

tapply(gerber$voting, gerber$civicduty, mean)
##         0         1 
## 0.3160698 0.3145377
tapply(gerber$voting, gerber$hawthorne, mean)
##         0         1 
## 0.3150909 0.3223746
tapply(gerber$voting, gerber$neighbors, mean)
##         0         1 
## 0.3081505 0.3779482
tapply(gerber$voting, gerber$self, mean)
##         0         1 
## 0.3122446 0.3451515

Build a logistic regression model for voting using the four treatment group variables as the independent variables

logReg <- glm(voting ~ civicduty + hawthorne + self + neighbors, data = gerber, family = binomial)
summary(logReg)
## 
## Call:
## glm(formula = voting ~ civicduty + hawthorne + self + neighbors, 
##     family = binomial, data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -0.863358   0.005006 -172.459  < 2e-16 ***
## civicduty    0.084368   0.012100    6.972 3.12e-12 ***
## hawthorne    0.120477   0.012037   10.009  < 2e-16 ***
## self         0.222937   0.011867   18.786  < 2e-16 ***
## neighbors    0.365092   0.011679   31.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428090  on 344079  degrees of freedom
## AIC: 428100
## 
## Number of Fisher Scoring iterations: 4

Using a threshold of 0.3, what is the accuracy of the logistic regression model?

predict_logReg <- predict(logReg, type = "response")
table(gerber$voting, predict_logReg >= 0.3)
##    
##      FALSE   TRUE
##   0 134513 100875
##   1  56730  51966

accuracy is (134513+51966)/(134513+51966+100875+56730) = 0.542

Using a threshold of 0.5, what is the accuracy of the logistic regression model?

table(gerber$voting, predict_logReg >= 0.5)
##    
##      FALSE
##   0 235388
##   1 108696
235388/nrow(gerber)
## [1] 0.6841004

Accuracy is 235388/nrow(gerber) = 0.684

Build a CART tree for voting using all data and the same four treatment variables. Don’t set the option method=“class” - we are actually going to create a regression tree here.

If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

library(rpart)
library(rpart.plot)
CARTmodel <- rpart(voting ~ civicduty + hawthorne + self + neighbors, data = gerber)

If you plot the tree, with prp(CARTmodel), you should just see one leaf! There are no splits in the tree, because none of the variables make a big enough effect to be split on.

prp(CARTmodel)

introducing cp to force the complete tree to be built

CARTmodel2 <- rpart(voting ~ civicduty + hawthorne + self + neighbors, data = gerber, cp = 0.0)
prp(CARTmodel2)

Using only the CART tree plot, determine what fraction (a number between 0 and 1) of “Civic Duty” people voted: You can find this answer by reading the tree - the people in the civic duty group correspond to the bottom right split, which has value 0.31 in the leaf.

introducing a new variable sex

CARTmodel3 <- rpart(voting ~ sex + civicduty + hawthorne + self + neighbors, data = gerber, cp = 0.0)
prp(CARTmodel3)

In the “Civic Duty” group, which gender is more likely to vote? For the civic duty group, which corresponds to the bottom right, sex = 0 (male) corresponds to a higher voting percentage.

In the control group, which gender is more likely to vote? For the control group, which corresponds to the bottom left, sex = 0 (male) corresponds to a higher voting percentage.

Let’s just focus on the “Control” treatment group. Create a regression tree using just the “control” variable, then create another tree with the “control” and “sex” variables, both with cp=0.0.

controlReg <- rpart(voting ~ control, data = gerber, cp = 0.0)
controlSexReg <- rpart(voting ~ control + sex, data = gerber, cp = 0.0)
prp(controlReg, digits=6)

prp(controlSexReg, digits = 6)

Going back to logistic regression now, create a model using “sex” and “control”

ControlSex_logReg <- glm(voting ~ sex + control, data = gerber, family = binomial)
summary(ControlSex_logReg)
## 
## Call:
## glm(formula = voting ~ sex + control, family = binomial, data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). Logistic regression has attempted to do the same, although it wasn’t able to do as well because it can’t consider exactly the joint possibility of being a women and in the control group.

We can quantify this precisely.

Possibilities <- data.frame(sex = c(0,0,1,1), control = c(0,1,0,1))

evaluate your logistic regression using the predict function

predict(ControlSex_logReg, newdata = Possibilities, type = "response")
##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ). What is the absolute difference between the tree and the logistic regression for the (Woman, Control) case?

(Woman, Control) = 0.2908065 in logistic regression

(woman, control) = 0.290456 in tree (controlSexReg)

So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group.

ControlSex_logReg2 <- glm(voting ~ sex + control + sex:control, data = gerber, family = binomial)
summary(ControlSex_logReg2)
## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = binomial, 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

Run the same code as before to calculate the average for each group

predict(ControlSex_logReg2, newdata = Possibilities, type = "response")
##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

what is the difference between the logistic regression model and the CART model for the (Woman, Control) case? logistic regression is 0.2904558, tree is 0.290456

This example has shown that trees can capture nonlinear relationships that logistic regression can not, but that we can get around this sometimes by using variables that are the combination of two variables. Should we always include all possible interaction terms of the independent variables when building a logistic regression model? No, overfitting