Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout.
The researchers grouped about 344,000 voters into different groups randomly about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.
Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).
gerber<-read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\gerber.csv", header = TRUE)
top row of the data
head(gerber)
## sex yob voting hawthorne civicduty neighbors self control
## 1 0 1941 0 0 1 0 0 0
## 2 1 1947 0 0 1 0 0 0
## 3 1 1982 1 1 0 0 0 0
## 4 1 1950 1 1 0 0 0 0
## 5 0 1951 1 1 0 0 0 0
## 6 1 1959 1 0 0 0 0 1
Exploring the gerber’s data
summary(gerber)
## sex yob voting hawthorne
## Min. :0.0000 Min. :1900 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1947 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :1956 Median :0.0000 Median :0.000
## Mean :0.4993 Mean :1956 Mean :0.3159 Mean :0.111
## 3rd Qu.:1.0000 3rd Qu.:1965 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1986 Max. :1.0000 Max. :1.000
## civicduty neighbors self control
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.1111 Mean :0.111 Mean :0.1111 Mean :0.5558
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
str(gerber)
## 'data.frame': 344084 obs. of 8 variables:
## $ sex : int 0 1 1 1 0 1 0 0 1 0 ...
## $ yob : int 1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
## $ voting : int 0 0 1 1 1 1 1 0 0 0 ...
## $ hawthorne: int 0 0 1 1 1 0 0 0 0 0 ...
## $ civicduty: int 1 1 0 0 0 0 0 0 0 0 ...
## $ neighbors: int 0 0 0 0 0 0 0 0 0 0 ...
## $ self : int 0 0 0 0 0 0 0 0 0 0 ...
## $ control : int 0 0 0 0 0 1 1 1 1 1 ...
Let’s find out the accuracy of baseline model
table(gerber$voting)
##
## 0 1
## 235388 108696
accuracy<-235388/nrow(gerber)
accuracy
## [1] 0.6841004
Now, because dependent variable(voting) has two outputs, we are going to apply logistic regression model on the data
logireg<-glm(voting~hawthorne+civicduty+neighbors+self, data = gerber, family = binomial)
summary(logireg)
##
## Call:
## glm(formula = voting ~ hawthorne + civicduty + neighbors + self,
## family = binomial, data = gerber)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9744 -0.8691 -0.8389 1.4586 1.5590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.863358 0.005006 -172.459 < 2e-16 ***
## hawthorne 0.120477 0.012037 10.009 < 2e-16 ***
## civicduty 0.084368 0.012100 6.972 3.12e-12 ***
## neighbors 0.365092 0.011679 31.260 < 2e-16 ***
## self 0.222937 0.011867 18.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428090 on 344079 degrees of freedom
## AIC: 428100
##
## Number of Fisher Scoring iterations: 4
predicting the response of the voters from the logistic regression model
pred<-predict(logireg, type = "response")
Accuracy for threshold value 0.5
tab<-table(gerber$voting, pred>0.5)
accuracy<-sum(diag(tab))/nrow(gerber)
accuracy
## [1] 0.6841004
Finding the auc of the ROC curve
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
ROCRpredTest = prediction(pred, gerber$voting)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)
auc
## [1] 0.5308461
We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before.Don’t set the option method=“class”, we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We would like CART to split our groups if they have different probabilities of voting. If we used method=“class”, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
prp(CARTmodel)
Now build the tree using the cp value
CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
prp(CARTmodel2)
Now build the tree including sex
CARTmodel3 = rpart(voting ~ sex+civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
prp(CARTmodel3)
Now build the tree using only control variable
CARTmodel4 = rpart(voting ~ control, data=gerber, cp=0.0)
prp(CARTmodel4, digits = 6)
Now build the tree using sex and control
CARTmodel5 = rpart(voting ~ sex+control, data=gerber, cp=0.0)
prp(CARTmodel5, digits = 6)
logistic model using control and sex
logireg1<-glm(voting~sex+control, data = gerber, family = binomial)
summary(logireg1)
##
## Call:
## glm(formula = voting ~ sex + control, family = binomial, data = gerber)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9220 -0.9012 -0.8290 1.4564 1.5717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.635538 0.006511 -97.616 < 2e-16 ***
## sex -0.055791 0.007343 -7.597 3.02e-14 ***
## control -0.200142 0.007364 -27.179 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428443 on 344081 degrees of freedom
## AIC: 428449
##
## Number of Fisher Scoring iterations: 4
Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function
Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(logireg1, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3462559 0.3024455 0.3337375 0.2908065
So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables so if this new variable is 1, that means the person is a woman AND in the control group.
Logireg2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(Logireg2)
##
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial",
## data = gerber)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9213 -0.9019 -0.8284 1.4573 1.5724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.637471 0.007603 -83.843 < 2e-16 ***
## sex -0.051888 0.010801 -4.804 1.55e-06 ***
## control -0.196553 0.010356 -18.980 < 2e-16 ***
## sex:control -0.007259 0.014729 -0.493 0.622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428442 on 344080 degrees of freedom
## AIC: 428450
##
## Number of Fisher Scoring iterations: 4
Run the same code as before to calculate the average for each group
predict(Logireg2, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3458183 0.3027947 0.3341757 0.2904558
Now we see that we’ve done a much better job than linear regression was able to do. We’ve correctly left the low value area in gerber and below out, and we’ve correctly managed to classify some of those points in the bottom right and top right. We’re still making mistakes, but we’re able to make a nonlinear prediction on latitude and longitude. So that’s interesting, but the tree was very complicated. So maybe it’s drastically overfitting.
regression trees can do things linear regression cannot.