AS4-1

1-1 Exploration and Logistic Regression

We will first get familiar with the data. Load the CSV file gerber.csv into R. What proportion of people in this dataset voted in this election?

gerber= read.csv("data/gerber.csv")
table(gerber$voting)

## 
##      0      1 
## 235388 108696

108696/nrow(gerber)

## [1] 0.3158996

1-2 Exploration and Logistic Regression

Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?

tapply(gerber$voting, gerber$hawthorne, mean)

##         0         1 
## 0.3150909 0.3223746

tapply(gerber$voting, gerber$civicduty, mean)

##         0         1 
## 0.3160698 0.3145377

tapply(gerber$voting, gerber$self, mean)

##         0         1 
## 0.3122446 0.3451515

tapply(gerber$voting, gerber$neighbors, mean)

##         0         1 
## 0.3081505 0.3779482

#Neighbor

1-3 Exploration and Logistic Regression

Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.

reg= glm(voting~hawthorne+ civicduty+ self+ neighbors, gerber,family="binomial")
summary(reg)

## 
## Call:
## glm(formula = voting ~ hawthorne + civicduty + self + neighbors, 
##     family = "binomial", data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -0.863358   0.005006 -172.459  < 2e-16 ***
## hawthorne    0.120477   0.012037   10.009  < 2e-16 ***
## civicduty    0.084368   0.012100    6.972 3.12e-12 ***
## self         0.222937   0.011867   18.786  < 2e-16 ***
## neighbors    0.365092   0.011679   31.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428090  on 344079  degrees of freedom
## AIC: 428100
## 
## Number of Fisher Scoring iterations: 4

#hawthorne, civicduty, self, neighbors

1-4

Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don’t need to use the newdata argument since we didn’t split our data.)

log_pre=predict(reg, data=gerber, type="response")
table(gerber$voting, log_pre>=0.3)

##    
##      FALSE   TRUE
##   0 134513 100875
##   1  56730  51966

(134513+51966)/nrow(gerber)

## [1] 0.5419578

1-5

Using a threshold of 0.5, what is the accuracy of the logistic regression model?

table(gerber$voting, log_pre>=0.5)

##    
##      FALSE
##   0 235388
##   1 108696

235388/nrow(gerber)

## [1] 0.6841004

1-6

Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here? - ROCRpred = prediction(li_pre, gerber$voting) - as.numeric(performance(ROCRpred, “auc”)@y.values)

#Even though all of the variables are significant, this is a weak predictive model.

2-1 Trees

Plot the tree. What happens, and if relevant, why?

#install.packages("rpart.plot")
library(caTools)
library(rpart); library(rpart.plot)

CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
rpart.plot(CARTmodel)

#No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on. 只有一個節點，代表沒有一個變數有足夠效果分割

2.2

to force the complete tree to be built. Then plot the tree. What do you observe about the order of the splits?

CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
rpart.plot(CARTmodel2)

#Neighbor is the first split, civic duty is the last.

2-3

Using only the CART tree plot, determine what fraction (a number between 0 and 1) of “Civic Duty” people voted:

0.31

## [1] 0.31

2-4

In the control group, which gender is more likely to vote?

CARTmodel3 = rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=gerber, cp=0.0)
rpart.plot(CARTmodel3)

# male:0.38, female: 0.37，所以是male

In the “Civic Duty” group, which gender is more likely to vote?

# male: 0.32 female 0.31, 也是男性

3-1 Interaction Terms

In the “control” only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group? You can use the absolute value function to get answer, i.e. abs(Control Prediction - Non-Control Prediction). Add the argument “digits = 6” to the prp command to get a more accurate estimate.

control_tree= rpart(voting~control, data=gerber, cp=0.0)
control_tree2= rpart(voting~control+sex, data=gerber, cp=0.0)
rpart.plot(control_tree, digits=6)

0.32-0.296638

## [1] 0.023362

3-2

Now, using the second tree (with control and sex), determine who is affected more by NOT being in the control group (being in any of the four treatment groups):

rpart.plot(control_tree2, digits=6)

0.345818-0.302795

## [1] 0.043023

0.334176-0.290456

## [1] 0.04372

#They are affected about the same (change in probability within 0.001 of each other). 
#男女效果大約相同，差別在0.001以下

3-3 Interaction Terms

Going back to logistic regression now, create a model using “sex” and “control”. Interpret the coefficient for “sex”:

log_reg2= glm(voting~control +sex, data= gerber, family= "binomial")
summary(log_reg2)

## 
## Call:
## glm(formula = voting ~ control + sex, family = "binomial", data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

#Coefficient is negative, reflecting that women are less likely to vote. 也表示男性更會投

3-4

The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ). What is the absolute difference between the tree and the logistic regression for the (Woman, Control) case? Give an answer with five numbers after the decimal point.

Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(log_reg2, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

0.2908065- 0.290456 #logistic regression跟tree裡面的女性控制的預測機率相減

## [1] 0.0003505

3-5

log_reg3 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(log_reg3)

## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

#If a person is a woman and in the control group, the chance that she voted goes down.
#新製做添加的變數代表：如果是女性且受控制，則其投的機率會下降

3-6

Now what is the difference between the logistic regression model and the CART model for the (Woman, Control) case? Again, give your answer with five numbers after the decimal point.

predict(log_reg3, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

0.2904558- 0.290456 #close to 0

## [1] -2e-07

3-7

This example has shown that trees can capture nonlinear relationships that logistic regression can not, but that we can get around this sometimes by using variables that are the combination of two variables. Should we always include all possible interaction terms of the independent variables when building a logistic regression model?

#樹可以解決Logistic regression不能的非線性關係，但迴歸有時靠著創新變數可解決此問題
#怛不必總是放入全部，會造成overfitting的問題