The data

The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.

Additional variables include:

gerber <- read.csv("data/gerber.csv")
glimpse(gerber)
## Observations: 344,084
## Variables: 8
## $ sex       <int> 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0...
## $ yob       <int> 1941, 1947, 1982, 1950, 1951, 1959, 1956, 1981, 1968...
## $ voting    <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0...
## $ hawthorne <int> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0...
## $ civicduty <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ neighbors <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ self      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...
## $ control   <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0...

Exploration

What proportion of people in this dataset voted in this election?

mean(gerber$voting)
## [1] 0.3158996

Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?

gerber %>% filter(voting == 1)  %>% 
           summarise(civicduty = sum(civicduty), hawthorne = sum(hawthorne),
                     neighbors = sum(neighbors), self = sum(self)) 
##   civicduty hawthorne neighbors  self
## 1     12021     12316     14438 13191

Logistic Regression

Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.

logmodel <- glm(data = gerber, voting ~ civicduty + hawthorne + self + neighbors)
summary(logmodel)
## 
## Call:
## glm(formula = voting ~ civicduty + hawthorne + self + neighbors, 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3780  -0.3145  -0.2966   0.6549   0.7034  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.296638   0.001061 279.525  < 2e-16 ***
## civicduty   0.017899   0.002600   6.884 5.85e-12 ***
## hawthorne   0.025736   0.002601   9.896  < 2e-16 ***
## self        0.048513   0.002600  18.657  < 2e-16 ***
## neighbors   0.081310   0.002601  31.263  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2153766)
## 
##     Null deviance: 74359  on 344083  degrees of freedom
## Residual deviance: 74107  on 344079  degrees of freedom
## AIC: 448180
## 
## Number of Fisher Scoring iterations: 2

Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don’t need to use the newdata argument since we didn’t split our data.)

predictLog = predict(logmodel, type="response")
accuarcy <- function(threshold = 0.3) {
  if (dim(table(gerber$voting, predictLog > threshold))[2] == 2) {
    res <-table(gerber$voting, predictLog > threshold)
    (res[1,1] + res[2,2]) / (res[1,1] + res[2,2] + res[2,1] + res[1,2])  

  } else {
    res <-table(gerber$voting, predictLog > threshold)
(res[1,1]) / (res[1,1] + res[2,1])  

  }
}
accuarcy()
## [1] 0.5419578

Using a threshold of 0.5, what is the accuracy of the logistic regression model?

accuarcy(0.5)
## [1] 0.6841004

Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?

ROCRpred = prediction(predictLog, gerber$voting)

as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.5308461

Even though all of our variables are significant, our model does not improve over the baseline model of just predicting that someone will not vote, and the AUC is low. So while the treatment groups do make a difference, this is a weak predictive model

Trees

We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before. Don’t set the option method="class" - we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting. If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
prp(CARTmodel)

There are no splits in the tree, because none of the variables make a big enough effect to be split on. Let us try with different parameters

CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
prp(CARTmodel2)

We saw in question 1 that the highest fraction of voters was in the Neighbors group, followed by the Self group, followed by the Hawthorne group, and lastly the Civic Duty group. And we see here that the tree detects this trend.

Make a new tree that includes the sex variable, again with cp = 0.0. Notice that sex appears as a split that is of secondary importance to the treatment group.

In the control group, which gender is more likely to vote?

CARTmodel3 = rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=gerber, cp=0.0)
prp(CARTmodel3)

Interaction Terms

We know trees can handle “nonlinear” relationships, e.g. “in the ‘Civic Duty’ group and female”, but as we will see in the next few questions, it is possible to do the same for logistic regression. First, let’s explore what trees can tell us some more.

Let’s just focus on the “Control” treatment group. Create a regression tree using just the “control” variable, then create another tree with the “control” and “sex” variables, both with cp=0.0.

In the “control” only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group? You can use the absolute value function to get answer, i.e. abs(Control Prediction - Non-Control Prediction). Add the argument “digits = 6” to the prp command to get a more accurate estimate.

CARTcontrol = rpart(voting ~ control, data=gerber, cp=0.0)
CARTsex = rpart(voting ~ control + sex, data=gerber, cp=0.0)
prp(CARTcontrol, digits=6)

Now, using the second tree (with control and sex), determine who is affected more by NOT being in the control group (being in any of the four treatment groups):

prp(CARTsex, digits=6)

men and women are affected about the same.

Going back to logistic regression now, create a model using sex and control. Interpret the coefficient for “sex”:

LogModelSex = glm(voting ~ control + sex, data=gerber, family="binomial")
summary(LogModelSex)
## 
## Call:
## glm(formula = voting ~ control + sex, family = "binomial", data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

If you look at the summary of the model, you can see that the coefficient for the sex variable is -0.055791. This means that women are less likely to vote, since women have a larger value in the sex variable, and a negative coefficient means that larger values are predictive of 0

The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). Logistic regression has attempted to do the same, although it wasn’t able to do as well because it can’t consider exactly the joint possibility of being a women and in the control group.

We can quantify this precisely. Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function (where “LogModelSex” is the name of your logistic regression model that uses both control and sex):

Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(LogModelSex, newdata=Possibilities, type="response")
##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ).

Give an answer with five numbers after the decimal point.

0.00035
## [1] 0.00035

The CART tree predicts 0.290456 for the (Woman, Control) case, and the logistic regression model predicts 0.2908065. So the absolute difference, to five decimal places, is 0.00035.

So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group. We can do that with the following command:

LogModel2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(LogModel2)
## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

This coefficient is negative, so that means that a value of 1 in this variable decreases the chance of voting. This variable will have variable 1 if the person is a woman and in the control group.

Now what is the difference between the logistic regression model and the CART model for the (Woman, Control) case?

predict(LogModel2, newdata=Possibilities, type="response")
##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

The logistic regression model now predicts 0.2904558 for the (Woman, Control) case, so there is now a very small difference (practically zero) between CART and logistic regression.

This example has shown that trees can capture nonlinear relationships that logistic regression can not, but that we can get around this sometimes by using variables that are the combination of two variables. Should we always include all possible interaction terms of the independent variables when building a logistic regression model?

Ans.: We should not use all possible interaction terms in a logistic regression model due to overfitting. Even in this simple problem, we have four treatment groups and two values for sex. If we have an interaction term for every treatment variable with sex, we will double the number of variables. In smaller data sets, this could quickly lead to overfitting.