Predictive Analytics on greber’s Data

Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout.

The researchers grouped about 344,000 voters into different groups randomly about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.

“Civic Duty” (variable civicduty) group members were sent a letter that simply said “DO YOUR CIVIC DUTY - VOTE!”
“Hawthorne Effect” (variable hawthorne) group members were sent a letter that had the “Civic Duty” message plus the additional message “YOU ARE BEING STUDIED” and they were informed that their voting behavior would be examined by means of public records.
“Self” (variable self) group members received the “Civic Duty” message as well as the recent voting record of everyone in that household and a message stating that another message would be sent after the election with updated records.
“Neighbors” (variable neighbors) group members were given the same message as that for the “Self” group, except the message not only had the household voting records but also that of neighbors - maximizing social pressure.
“Control” (variable control) group members were not sent anything, and represented the typical voting situation.

Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).

gerber<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\gerber.csv", header = TRUE)

top row of the data

head(gerber)

##   sex  yob voting hawthorne civicduty neighbors self control
## 1   0 1941      0         0         1         0    0       0
## 2   1 1947      0         0         1         0    0       0
## 3   1 1982      1         1         0         0    0       0
## 4   1 1950      1         1         0         0    0       0
## 5   0 1951      1         1         0         0    0       0
## 6   1 1959      1         0         0         0    0       1

Exploring the gerber’s data

summary(gerber)

##       sex              yob           voting         hawthorne    
##  Min.   :0.0000   Min.   :1900   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1947   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :1956   Median :0.0000   Median :0.000  
##  Mean   :0.4993   Mean   :1956   Mean   :0.3159   Mean   :0.111  
##  3rd Qu.:1.0000   3rd Qu.:1965   3rd Qu.:1.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1986   Max.   :1.0000   Max.   :1.000  
##    civicduty        neighbors          self           control      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.1111   Mean   :0.111   Mean   :0.1111   Mean   :0.5558  
##  3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000

str(gerber)

## 'data.frame':    344084 obs. of  8 variables:
##  $ sex      : int  0 1 1 1 0 1 0 0 1 0 ...
##  $ yob      : int  1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
##  $ voting   : int  0 0 1 1 1 1 1 0 0 0 ...
##  $ hawthorne: int  0 0 1 1 1 0 0 0 0 0 ...
##  $ civicduty: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ neighbors: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ self     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ control  : int  0 0 0 0 0 1 1 1 1 1 ...

Let’s find out the accuracy of baseline model

table(gerber$voting)

## 
##      0      1 
## 235388 108696

accuracy<-235388/nrow(gerber)

accuracy

## [1] 0.6841004

Now, because dependent variable(voting) has two outputs, we are going to apply logistic regression model on the data

logireg<-glm(voting~hawthorne+civicduty+neighbors+self, data = gerber, family = binomial)
summary(logireg)

## 
## Call:
## glm(formula = voting ~ hawthorne + civicduty + neighbors + self, 
##     family = binomial, data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -0.863358   0.005006 -172.459  < 2e-16 ***
## hawthorne    0.120477   0.012037   10.009  < 2e-16 ***
## civicduty    0.084368   0.012100    6.972 3.12e-12 ***
## neighbors    0.365092   0.011679   31.260  < 2e-16 ***
## self         0.222937   0.011867   18.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428090  on 344079  degrees of freedom
## AIC: 428100
## 
## Number of Fisher Scoring iterations: 4

predicting the response of the voters from the logistic regression model

pred<-predict(logireg, type = "response")

Accuracy for threshold value 0.5

tab<-table(gerber$voting, pred>0.5)

accuracy<-sum(diag(tab))/nrow(gerber)
accuracy

## [1] 0.6841004

Finding the auc of the ROC curve

library(ROCR)

## Warning: package 'ROCR' was built under R version 3.3.3

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.3.3

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

ROCRpredTest = prediction(pred, gerber$voting)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)
auc

## [1] 0.5308461

Building CART Tree

We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before.Don’t set the option method=“class”, we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We would like CART to split our groups if they have different probabilities of voting. If we used method=“class”, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.3.3

CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
prp(CARTmodel)

Now build the tree using the cp value

CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
prp(CARTmodel2)

Now build the tree including sex

CARTmodel3 = rpart(voting ~ sex+civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
prp(CARTmodel3)

Now build the tree using only control variable

CARTmodel4 = rpart(voting ~ control, data=gerber, cp=0.0)
prp(CARTmodel4, digits = 6)

Now build the tree using sex and control

CARTmodel5 = rpart(voting ~ sex+control, data=gerber, cp=0.0)
prp(CARTmodel5, digits = 6)

logistic model using control and sex

logireg1<-glm(voting~sex+control, data = gerber, family = binomial)
summary(logireg1)

## 
## Call:
## glm(formula = voting ~ sex + control, family = binomial, data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function

Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(logireg1, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables so if this new variable is 1, that means the person is a woman AND in the control group.

Logireg2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(Logireg2)

## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

Run the same code as before to calculate the average for each group

predict(Logireg2, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

Now we see that we’ve done a much better job than linear regression was able to do. We’ve correctly left the low value area in gerber and below out, and we’ve correctly managed to classify some of those points in the bottom right and top right. We’re still making mistakes, but we’re able to make a nonlinear prediction on latitude and longitude. So that’s interesting, but the tree was very complicated. So maybe it’s drastically overfitting.

regression trees can do things linear regression cannot.