Understanding why people vote?

In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:

Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that ... "the casting of a single vote is of no significance where there is a multitude of electors"? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters' sense of civic duty has long been a leading explanation of vote turnout...

We will first get familiar with the data. Load the CSV file gerber.csv into R. What proportion of people in this dataset voted in this election?

vote <- read.csv("gerber.csv")
str(vote)

## 'data.frame':    344084 obs. of  8 variables:
##  $ sex      : int  0 1 1 1 0 1 0 0 1 0 ...
##  $ yob      : int  1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
##  $ voting   : int  0 0 1 1 1 1 1 0 0 0 ...
##  $ hawthorne: int  0 0 1 1 1 0 0 0 0 0 ...
##  $ civicduty: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ neighbors: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ self     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ control  : int  0 0 0 0 0 1 1 1 1 1 ...

summary(vote)

##       sex              yob           voting         hawthorne    
##  Min.   :0.0000   Min.   :1900   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1947   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.0000   Median :1956   Median :0.0000   Median :0.000  
##  Mean   :0.4993   Mean   :1956   Mean   :0.3159   Mean   :0.111  
##  3rd Qu.:1.0000   3rd Qu.:1965   3rd Qu.:1.0000   3rd Qu.:0.000  
##  Max.   :1.0000   Max.   :1986   Max.   :1.0000   Max.   :1.000  
##    civicduty        neighbors          self           control      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.1111   Mean   :0.111   Mean   :0.1111   Mean   :0.5558  
##  3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000

table(vote$voting)/nrow(vote)

## 
##         0         1 
## 0.6841004 0.3158996

Finding the largest voter group

Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?

table(vote$voting, vote$hawthorne)/ nrow(vote)

##    
##              0          1
##   0 0.60886295 0.07523744
##   1 0.28010602 0.03579359

table(vote$voting, vote$civicduty)/ nrow(vote)

##    
##              0          1
##   0 0.60796492 0.07613548
##   1 0.28096337 0.03493624

table(vote$voting, vote$neighbors)/ nrow(vote)

##    
##              0          1
##   0 0.61503877 0.06906162
##   1 0.27393892 0.04196068

table(vote$voting, vote$self)/ nrow(vote)

##    
##              0          1
##   0 0.61136525 0.07273515
##   1 0.27756304 0.03833657

table(vote$voting, vote$control)/ nrow(vote)

##    
##             0         1
##   0 0.2931697 0.3909307
##   1 0.1510271 0.1648725

Exploration and logistic regression

Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.

logVote <- glm(voting ~ civicduty + hawthorne + self + neighbors, data = vote, family = binomial)
summary(logVote)

## 
## Call:
## glm(formula = voting ~ civicduty + hawthorne + self + neighbors, 
##     family = binomial, data = vote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -0.863358   0.005006 -172.459  < 2e-16 ***
## civicduty    0.084368   0.012100    6.972 3.12e-12 ***
## hawthorne    0.120477   0.012037   10.009  < 2e-16 ***
## self         0.222937   0.011867   18.786  < 2e-16 ***
## neighbors    0.365092   0.011679   31.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428090  on 344079  degrees of freedom
## AIC: 428100
## 
## Number of Fisher Scoring iterations: 4

Prediction

Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don’t need to use the newdata argument since we didn’t split our data.)

predicted <- predict(logVote, type = 'response')
# confusion matrix
table(vote$voting, predicted > 0.3)

##    
##      FALSE   TRUE
##   0 134513 100875
##   1  56730  51966

(134513+51966)/nrow(vote)

## [1] 0.5419578

# with higher threshold
table(vote$voting, predicted > 0.5)

##    
##      FALSE
##   0 235388
##   1 108696

(235388)/nrow(vote)

## [1] 0.6841004

Area under Curve with ROCR

Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?

library(ROCR)
ROCRpred <- prediction(predicted, vote$voting)
as.numeric(performance(ROCRpred, "auc")@y.values)

## [1] 0.5308461

Problem 2: TREES

We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before. Don’t set the option method=“class” - we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting. If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

Leave all the parameters at their default values. You can use the following command in R to build the tree:

library(rpart)
library(rpart.plot)
CARTmodel <- rpart(voting ~ civicduty + hawthorne + self + neighbors, data=vote)
prp(CARTmodel)

# repeat with other model, prev result was bad

CARTmodel2 <-  rpart(voting ~ civicduty + hawthorne + self + neighbors, data=vote, cp=0.0)
prp(CARTmodel2)

# Another model with sex

CARTmodel3 <-  rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=vote, cp=0.0)
prp(CARTmodel3)

# Interaction terms

We know trees can handle “nonlinear” relationships, e.g. “in the ‘Civic Duty’ group and female”, but as we will see in the next few questions, it is possible to do the same for logistic regression. First, let’s explore what trees can tell us some more.

Let’s just focus on the “Control” treatment group. Create a regression tree using just the “control” variable, then create another tree with the “control” and “sex” variables, both with cp=0.0.

In the “control” only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group? You can use the absolute value function to get answer, i.e. abs(Control Prediction - Non-Control Prediction). Add the argument “digits = 6” to the prp command to get a more accurate estimate.

CARTmodel4 <- rpart(voting ~ control, data=vote, cp = 0.0)
prp(CARTmodel4, digits = 6)

abs(0.296638 - 0.34)

## [1] 0.043362

CARTmodel5 <- rpart(voting ~ control + sex, data=vote, cp = 0.0)
prp(CARTmodel5, digits = 6)

# Back to regression model
LogReg2 <- glm(voting ~ control + sex, data = vote, family = binomial)
summary(LogReg2)

## 
## Call:
## glm(formula = voting ~ control + sex, family = binomial, data = vote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

Interaction terms 2

The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). However, logistic regression on the “sex” and “control” variables considers these variables separately, not jointly, and therefore did not do as well.

We can quantify this precisely. Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function (where “LogModelSex” is the name of your logistic regression model that uses both control and sex):

Possibilities <-  data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(LogReg2, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

# the diff
abs(0.2908065 - 0.290456)

## [1] 0.0003505

Interaction terms 3

So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group. We can do that with the following command:

# combination of sex and control together
LogModel2 <-  glm(voting ~ sex + control + sex:control, data=vote, family="binomial")
summary(LogModel2)

## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = vote)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

predicted_2 <- predict(LogModel2, newdata=Possibilities, type="response")
predicted_2

##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

abs(0.2904558 - 0.290456)

## [1] 2e-07

Letter Recognition

Let’s warm up by attempting to predict just whether a letter is B or not. To begin, load the file letters_ABPR.csv into R, and call it letters. Then, create a new variable isB in the dataframe, which takes the value “TRUE” if the observation corresponds to the letter B, and “FALSE” if it does not. You can do this by typing the following command into your R console:

letters$isB = as.factor(letters$letter == “B”)

Now split the data set into a training and testing set, putting 50% of the data in the training set. Set the seed to 1000 before making the split. The first argument to sample.split should be the dependent variable “letters$isB”. Remember that TRUE values from sample.split should go in the training set.

Before building models, let’s consider a baseline method that always predicts the most frequent outcome, which is “not B”. What is the accuracy of this baseline method on the test set?

library(caTools)
letters <- read.csv("letters_ABPR.csv")
str(letters)

## 'data.frame':    3116 obs. of  17 variables:
##  $ letter   : chr  "B" "A" "R" "B" ...
##  $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
##  $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
##  $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
##  $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
##  $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
##  $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
##  $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
##  $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
##  $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
##  $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
##  $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
##  $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
##  $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
##  $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
##  $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
##  $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...

letters$isB = as.factor(letters$letter == "B")
str(letters)

## 'data.frame':    3116 obs. of  18 variables:
##  $ letter   : chr  "B" "A" "R" "B" ...
##  $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
##  $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
##  $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
##  $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
##  $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
##  $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
##  $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
##  $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
##  $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
##  $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
##  $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
##  $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
##  $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
##  $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
##  $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
##  $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...
##  $ isB      : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 2 1 1 1 1 1 1 ...

set.seed(1000)
split_letters <- sample.split(letters$isB, SplitRatio = 0.5)

train_letters <- subset(letters, split_letters == TRUE)
test_letters <- subset(letters, split_letters == FALSE)

table(test_letters$isB)/nrow(test_letters)

## 
##    FALSE     TRUE 
## 0.754172 0.245828

Predicting B or not B

Now build a classification tree to predict whether a letter is a B or not, using the training set to build your model. Remember to remove the variable “letter” out of the model, as this is related to what we are trying to predict! To just remove one variable, you can either write out the other variables, or remember what we did in the Billboards problem in Week 3, and use the following notation:

CARTb = rpart(isB ~ . - letter, data=train, method=“class”)

We are just using the default parameters in our CART model, so we don’t need to add the minbucket or cp arguments at all. We also added the argument method=“class” since this is a classification problem.

What is the accuracy of the CART model on the test set? (Use type=“class” when making predictions on the test set.)

CARTb <-  rpart(isB ~ . - letter, data=train_letters, method="class")
pred_cartB <- predict(CARTb, newdata = test_letters, type = 'class')

table(pred_cartB, test_letters$isB)

##           
## pred_cartB FALSE TRUE
##      FALSE  1118   43
##      TRUE     57  340

(1118+340)/nrow(test_letters)

## [1] 0.9358151

Random Forest to predict model

Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set. You should use all of the other variables as independent variables, except letter (since it helped us define what we are trying to predict!). Use the default settings for ntree and nodesize (don’t include these arguments at all). Right before building the model, set the seed to 1000. (NOTE: You might get a slightly different answer on this problem, even if you set the random seed. This has to do with your operating system and the implementation of the random forest algorithm.)

What is the accuracy of the model on the test set?

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

RForestB <- randomForest(isB ~ . - letter, data=train_letters)
Pred_RforestB <- predict(RForestB, newdata = test_letters)
table(Pred_RforestB, test_letters$isB)

##              
## Pred_RforestB FALSE TRUE
##         FALSE  1164    9
##         TRUE     11  374

(1164+374)/nrow(test_letters)

## [1] 0.987163

Predicting letters A B P R

Predicting the letters A, B, P, R 0.0/2.0 points (graded)

Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R.

As we saw in the D2Hawkeye lecture, building a multiclass classification CART model in R is no harder than building the models for binary classification problems. Fortunately, building a random forest model is just as easy.

The variable in our data frame which we will be trying to predict is “letter”. Start by converting letter in the original data set (letters) to a factor by running the following command in R:

letters$letter = as.factor( letters$letter )

Now, generate new training and testing sets of the letters data frame using letters$letter as the first input to the sample.split function. Before splitting, set your seed to 2000. Again put 50% of the data in the training set. (Why do we need to split the data again? Remember that sample.split balances the outcome variable in the training and testing sets. With a new outcome variable, we want to re-generate our split.)

In a multiclass classification problem, a simple baseline model is to predict the most frequent class of all of the options.

What is the baseline accuracy on the testing set?

library(kableExtra)
letters$letter <- as.factor(letters$letter) 
str(letters)

## 'data.frame':    3116 obs. of  18 variables:
##  $ letter   : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
##  $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
##  $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
##  $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
##  $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
##  $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
##  $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
##  $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
##  $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
##  $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
##  $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
##  $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
##  $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
##  $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
##  $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
##  $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
##  $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...
##  $ isB      : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 2 1 1 1 1 1 1 ...

set.seed(2000)
split_2 <- sample.split(letters$letter, SplitRatio = 0.5)

train_2 <- subset(letters, split_2 == TRUE)

test_2 <- subset(letters, split_2 == FALSE)
str(test_2)

## 'data.frame':    1558 obs. of  18 variables:
##  $ letter   : Factor w/ 4 levels "A","B","P","R": 2 4 1 3 1 4 1 1 1 2 ...
##  $ xbox     : int  5 2 3 8 3 6 2 2 2 4 ...
##  $ ybox     : int  9 6 7 14 8 9 1 4 8 2 ...
##  $ width    : int  7 4 5 7 5 5 4 4 4 5 ...
##  $ height   : int  7 4 5 8 6 4 2 3 6 4 ...
##  $ onpix    : int  10 3 3 4 3 3 1 2 2 4 ...
##  $ xbar     : int  9 6 12 5 9 10 8 10 12 7 ...
##  $ ybar     : int  8 7 2 10 2 6 1 2 2 7 ...
##  $ x2bar    : int  4 5 3 6 2 5 2 2 4 5 ...
##  $ y2bar    : int  4 5 2 3 3 5 2 2 3 6 ...
##  $ xybar    : int  6 6 10 12 8 10 7 9 11 7 ...
##  $ x2ybar   : int  8 5 2 5 2 2 2 2 2 6 ...
##  $ xy2bar   : int  6 7 9 4 8 8 8 9 10 6 ...
##  $ xedge    : int  6 3 2 4 2 6 2 2 3 2 ...
##  $ xedgeycor: int  11 7 6 10 6 6 5 7 6 8 ...
##  $ yedge    : int  8 5 3 4 3 4 2 3 3 7 ...
##  $ yedgexcor: int  7 8 8 8 7 9 7 9 9 10 ...
##  $ isB      : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 1 2 ...

#Accuracy of baseline
table(test_2$letter)

## 
##   A   B   P   R 
## 395 383 401 379

sum(401)/nrow(test_2)

## [1] 0.2573813

y <- table(test_2$letter)

kable(y)

Var1	Freq
A	395
B	383
P	401
R	379

max(diag(y))/sum(y)

## [1] 0.2573813

CART model for letters

Now build a classification tree to predict “letter”, using the training set to build your model. You should use all of the other variables as independent variables, except “isB”, since it is related to what we are trying to predict! Just use the default parameters in your CART model. Add the argument method=“class” since this is a classification problem. Even though we have multiple classes here, nothing changes in how we build the model from the binary case.

What is the test set accuracy of your CART model? Use the argument type=“class” when making predictions.

(HINT: When you are computing the test set accuracy using the confusion matrix, you want to add everything on the main diagonal and divide by the total number of observations in the test set, which can be computed with nrow(test), where test is the name of your test set).

CARTc <- rpart(letter ~ . - isB, data = train_2)
predicted3 <- predict(CARTc, newdata = test_2, type = 'class')
table(predicted3, test_2$letter)

##           
## predicted3   A   B   P   R
##          A 348   8   2  10
##          B   4 318  21  24
##          P   0  12 363   5
##          R  43  45  15 340

(354+309+362+352)/nrow(test_2)

## [1] 0.8838254

While using randomForest

Now build a random forest model on the training data, using the same independent variables as in the previous problem – again, don’t forget to remove the isB variable. Just use the default parameter values for ntree and nodesize (you don’t need to include these arguments at all). Set the seed to 1000 right before building your model. (Remember that you might get a slightly different result even if you set the random seed.)

What is the test set accuracy of your random forest model?

RanFor2 <- randomForest(letter ~ . - isB, data = train_2)
predRF_2 <- predict(RanFor2, newdata = test_2)
table(predRF_2, test_2$letter)

##         
## predRF_2   A   B   P   R
##        A 391   0   0   0
##        B   0 379   4  12
##        P   3   2 394   0
##        R   1   2   3 367

z <- table(predRF_2, test_2$letter)
acc_RF <- sum(diag(z))/sum(z)
acc_RF

## [1] 0.9826701

Predicting Earnings from Census Data

The United States government periodically collects demographic information by conducting a census.

In this problem, we are going to use census information about an individual to predict how much a person earns – in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.

census <- read.csv("census.csv")
census$over50k <- ifelse(census$over50k == " <=50K", 0, 1)
str(census)

## 'data.frame':    31978 obs. of  13 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass    : chr  " State-gov" " Self-emp-not-inc" " Private" " Private" ...
##  $ education    : chr  " Bachelors" " Bachelors" " HS-grad" " 11th" ...
##  $ maritalstatus: chr  " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
##  $ occupation   : chr  " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
##  $ relationship : chr  " Not-in-family" " Husband" " Not-in-family" " Husband" ...
##  $ race         : chr  " White" " White" " White" " Black" ...
##  $ sex          : chr  " Male" " Male" " Male" " Male" ...
##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ nativecountry: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ over50k      : num  0 0 0 0 0 0 0 1 1 1 ...

summary(census)

##       age         workclass          education         maritalstatus     
##  Min.   :17.00   Length:31978       Length:31978       Length:31978      
##  1st Qu.:28.00   Class :character   Class :character   Class :character  
##  Median :37.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :38.58                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :90.00                                                           
##   occupation        relationship           race               sex           
##  Length:31978       Length:31978       Length:31978       Length:31978      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   capitalgain     capitalloss       hoursperweek   nativecountry     
##  Min.   :    0   Min.   :   0.00   Min.   : 1.00   Length:31978      
##  1st Qu.:    0   1st Qu.:   0.00   1st Qu.:40.00   Class :character  
##  Median :    0   Median :   0.00   Median :40.00   Mode  :character  
##  Mean   : 1064   Mean   :  86.74   Mean   :40.42                     
##  3rd Qu.:    0   3rd Qu.:   0.00   3rd Qu.:45.00                     
##  Max.   :99999   Max.   :4356.00   Max.   :99.00                     
##     over50k      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2406  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

# Splitting the data
set.seed(2000)
split_census <- sample.split(census$over50k, SplitRatio = 0.6)

train_census <- subset(census, split_census == TRUE)

test_census <- subset(census, split_census == FALSE)

# Logistic regression 

logCensus <- glm(over50k ~ ., family = "binomial", data = train_census)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logCensus)

## 
## Call:
## glm(formula = over50k ~ ., family = "binomial", data = train_census)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1065  -0.5037  -0.1804  -0.0008   3.3383  
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              -8.658e+00  1.379e+00  -6.279 3.41e-10
## age                                       2.548e-02  2.139e-03  11.916  < 2e-16
## workclass Federal-gov                     1.105e+00  2.014e-01   5.489 4.03e-08
## workclass Local-gov                       3.675e-01  1.821e-01   2.018 0.043641
## workclass Never-worked                   -1.283e+01  8.453e+02  -0.015 0.987885
## workclass Private                         6.012e-01  1.626e-01   3.698 0.000218
## workclass Self-emp-inc                    7.575e-01  1.950e-01   3.884 0.000103
## workclass Self-emp-not-inc                1.855e-01  1.774e-01   1.046 0.295646
## workclass State-gov                       4.012e-01  1.961e-01   2.046 0.040728
## workclass Without-pay                    -1.395e+01  6.597e+02  -0.021 0.983134
## education 11th                            2.225e-01  2.867e-01   0.776 0.437738
## education 12th                            6.380e-01  3.597e-01   1.774 0.076064
## education 1st-4th                        -7.075e-01  7.760e-01  -0.912 0.361897
## education 5th-6th                        -3.170e-01  4.880e-01  -0.650 0.516008
## education 7th-8th                        -3.498e-01  3.126e-01  -1.119 0.263152
## education 9th                            -1.258e-01  3.539e-01  -0.355 0.722228
## education Assoc-acdm                      1.602e+00  2.427e-01   6.601 4.10e-11
## education Assoc-voc                       1.541e+00  2.368e-01   6.506 7.74e-11
## education Bachelors                       2.177e+00  2.218e-01   9.817  < 2e-16
## education Doctorate                       2.761e+00  2.893e-01   9.544  < 2e-16
## education HS-grad                         1.006e+00  2.169e-01   4.638 3.52e-06
## education Masters                         2.421e+00  2.353e-01  10.289  < 2e-16
## education Preschool                      -2.237e+01  6.864e+02  -0.033 0.973996
## education Prof-school                     2.938e+00  2.753e-01  10.672  < 2e-16
## education Some-college                    1.365e+00  2.195e-01   6.219 5.00e-10
## maritalstatus Married-AF-spouse           2.540e+00  7.145e-01   3.555 0.000378
## maritalstatus Married-civ-spouse          2.458e+00  3.573e-01   6.880 6.00e-12
## maritalstatus Married-spouse-absent      -9.486e-02  3.204e-01  -0.296 0.767155
## maritalstatus Never-married              -4.515e-01  1.139e-01  -3.962 7.42e-05
## maritalstatus Separated                   3.609e-02  1.984e-01   0.182 0.855672
## maritalstatus Widowed                     1.858e-01  1.962e-01   0.947 0.343449
## occupation Adm-clerical                   9.470e-02  1.288e-01   0.735 0.462064
## occupation Armed-Forces                  -1.008e+00  1.487e+00  -0.677 0.498170
## occupation Craft-repair                   2.174e-01  1.109e-01   1.960 0.049972
## occupation Exec-managerial                9.400e-01  1.138e-01   8.257  < 2e-16
## occupation Farming-fishing               -1.068e+00  1.908e-01  -5.599 2.15e-08
## occupation Handlers-cleaners             -6.237e-01  1.946e-01  -3.204 0.001353
## occupation Machine-op-inspct             -1.862e-01  1.376e-01  -1.353 0.176061
## occupation Other-service                 -8.183e-01  1.641e-01  -4.987 6.14e-07
## occupation Priv-house-serv               -1.297e+01  2.267e+02  -0.057 0.954385
## occupation Prof-specialty                 6.331e-01  1.222e-01   5.180 2.22e-07
## occupation Protective-serv                6.267e-01  1.710e-01   3.664 0.000248
## occupation Sales                          3.276e-01  1.175e-01   2.789 0.005282
## occupation Tech-support                   6.173e-01  1.533e-01   4.028 5.63e-05
## occupation Transport-moving                      NA         NA      NA       NA
## relationship Not-in-family                7.881e-01  3.530e-01   2.233 0.025562
## relationship Other-relative              -2.194e-01  3.137e-01  -0.699 0.484263
## relationship Own-child                   -7.489e-01  3.507e-01  -2.136 0.032716
## relationship Unmarried                    7.041e-01  3.720e-01   1.893 0.058392
## relationship Wife                         1.324e+00  1.331e-01   9.942  < 2e-16
## race Asian-Pac-Islander                   4.830e-01  3.548e-01   1.361 0.173504
## race Black                                3.644e-01  2.882e-01   1.265 0.206001
## race Other                                2.204e-01  4.513e-01   0.488 0.625263
## race White                                4.108e-01  2.737e-01   1.501 0.133356
## sex Male                                  7.729e-01  1.024e-01   7.545 4.52e-14
## capitalgain                               3.280e-04  1.372e-05  23.904  < 2e-16
## capitalloss                               6.445e-04  4.854e-05  13.277  < 2e-16
## hoursperweek                              2.897e-02  2.101e-03  13.791  < 2e-16
## nativecountry Canada                      2.593e-01  1.308e+00   0.198 0.842879
## nativecountry China                      -9.695e-01  1.327e+00  -0.730 0.465157
## nativecountry Columbia                   -1.954e+00  1.526e+00  -1.280 0.200470
## nativecountry Cuba                        5.735e-02  1.323e+00   0.043 0.965432
## nativecountry Dominican-Republic         -1.435e+01  3.092e+02  -0.046 0.962972
## nativecountry Ecuador                    -3.550e-02  1.477e+00  -0.024 0.980829
## nativecountry El-Salvador                -6.095e-01  1.395e+00  -0.437 0.662181
## nativecountry England                    -6.707e-02  1.327e+00  -0.051 0.959686
## nativecountry France                      5.301e-01  1.419e+00   0.374 0.708642
## nativecountry Germany                     5.474e-02  1.306e+00   0.042 0.966572
## nativecountry Greece                     -2.646e+00  1.714e+00  -1.544 0.122527
## nativecountry Guatemala                  -1.293e+01  3.345e+02  -0.039 0.969180
## nativecountry Haiti                      -9.221e-01  1.615e+00  -0.571 0.568105
## nativecountry Holand-Netherlands         -1.282e+01  2.400e+03  -0.005 0.995736
## nativecountry Honduras                   -9.584e-01  3.412e+00  -0.281 0.778775
## nativecountry Hong                       -2.362e-01  1.492e+00  -0.158 0.874155
## nativecountry Hungary                     1.412e-01  1.555e+00   0.091 0.927653
## nativecountry India                      -8.218e-01  1.314e+00  -0.625 0.531661
## nativecountry Iran                       -3.299e-02  1.366e+00  -0.024 0.980736
## nativecountry Ireland                     1.579e-01  1.473e+00   0.107 0.914628
## nativecountry Italy                       6.100e-01  1.333e+00   0.458 0.647194
## nativecountry Jamaica                    -2.279e-01  1.387e+00  -0.164 0.869467
## nativecountry Japan                       5.072e-01  1.375e+00   0.369 0.712179
## nativecountry Laos                       -6.831e-01  1.661e+00  -0.411 0.680866
## nativecountry Mexico                     -9.182e-01  1.303e+00  -0.705 0.481103
## nativecountry Nicaragua                  -1.987e-01  1.507e+00  -0.132 0.895132
## nativecountry Outlying-US(Guam-USVI-etc) -1.373e+01  8.502e+02  -0.016 0.987115
## nativecountry Peru                       -9.660e-01  1.678e+00  -0.576 0.564797
## nativecountry Philippines                 4.393e-02  1.281e+00   0.034 0.972640
## nativecountry Poland                      2.410e-01  1.383e+00   0.174 0.861624
## nativecountry Portugal                    7.276e-01  1.477e+00   0.493 0.622327
## nativecountry Puerto-Rico                -5.769e-01  1.357e+00  -0.425 0.670837
## nativecountry Scotland                   -1.188e+00  1.719e+00  -0.691 0.489616
## nativecountry South                      -8.183e-01  1.341e+00  -0.610 0.541809
## nativecountry Taiwan                     -2.590e-01  1.350e+00  -0.192 0.847878
## nativecountry Thailand                   -1.693e+00  1.737e+00  -0.975 0.329678
## nativecountry Trinadad&Tobago            -1.346e+00  1.721e+00  -0.782 0.434105
## nativecountry United-States              -8.594e-02  1.269e+00  -0.068 0.946020
## nativecountry Vietnam                    -1.008e+00  1.523e+00  -0.662 0.507799
## nativecountry Yugoslavia                  1.402e+00  1.648e+00   0.851 0.394874
##                                             
## (Intercept)                              ***
## age                                      ***
## workclass Federal-gov                    ***
## workclass Local-gov                      *  
## workclass Never-worked                      
## workclass Private                        ***
## workclass Self-emp-inc                   ***
## workclass Self-emp-not-inc                  
## workclass State-gov                      *  
## workclass Without-pay                       
## education 11th                              
## education 12th                           .  
## education 1st-4th                           
## education 5th-6th                           
## education 7th-8th                           
## education 9th                               
## education Assoc-acdm                     ***
## education Assoc-voc                      ***
## education Bachelors                      ***
## education Doctorate                      ***
## education HS-grad                        ***
## education Masters                        ***
## education Preschool                         
## education Prof-school                    ***
## education Some-college                   ***
## maritalstatus Married-AF-spouse          ***
## maritalstatus Married-civ-spouse         ***
## maritalstatus Married-spouse-absent         
## maritalstatus Never-married              ***
## maritalstatus Separated                     
## maritalstatus Widowed                       
## occupation Adm-clerical                     
## occupation Armed-Forces                     
## occupation Craft-repair                  *  
## occupation Exec-managerial               ***
## occupation Farming-fishing               ***
## occupation Handlers-cleaners             ** 
## occupation Machine-op-inspct                
## occupation Other-service                 ***
## occupation Priv-house-serv                  
## occupation Prof-specialty                ***
## occupation Protective-serv               ***
## occupation Sales                         ** 
## occupation Tech-support                  ***
## occupation Transport-moving                 
## relationship Not-in-family               *  
## relationship Other-relative                 
## relationship Own-child                   *  
## relationship Unmarried                   .  
## relationship Wife                        ***
## race Asian-Pac-Islander                     
## race Black                                  
## race Other                                  
## race White                                  
## sex Male                                 ***
## capitalgain                              ***
## capitalloss                              ***
## hoursperweek                             ***
## nativecountry Canada                        
## nativecountry China                         
## nativecountry Columbia                      
## nativecountry Cuba                          
## nativecountry Dominican-Republic            
## nativecountry Ecuador                       
## nativecountry El-Salvador                   
## nativecountry England                       
## nativecountry France                        
## nativecountry Germany                       
## nativecountry Greece                        
## nativecountry Guatemala                     
## nativecountry Haiti                         
## nativecountry Holand-Netherlands            
## nativecountry Honduras                      
## nativecountry Hong                          
## nativecountry Hungary                       
## nativecountry India                         
## nativecountry Iran                          
## nativecountry Ireland                       
## nativecountry Italy                         
## nativecountry Jamaica                       
## nativecountry Japan                         
## nativecountry Laos                          
## nativecountry Mexico                        
## nativecountry Nicaragua                     
## nativecountry Outlying-US(Guam-USVI-etc)    
## nativecountry Peru                          
## nativecountry Philippines                   
## nativecountry Poland                        
## nativecountry Portugal                      
## nativecountry Puerto-Rico                   
## nativecountry Scotland                      
## nativecountry South                         
## nativecountry Taiwan                        
## nativecountry Thailand                      
## nativecountry Trinadad&Tobago               
## nativecountry United-States                 
## nativecountry Vietnam                       
## nativecountry Yugoslavia                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21175  on 19186  degrees of freedom
## Residual deviance: 12104  on 19090  degrees of freedom
## AIC: 12298
## 
## Number of Fisher Scoring iterations: 15

Logistic Regression Model

What is the accuracy of the model on the testing set? Use a threshold of 0.5. (You might see a warning message when you make predictions on the test set - you can safely ignore it.)

predLogEarnings <- predict(logCensus, newdata = test_census, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

table(test_census$over50k, predLogEarnings >= 0.5)

##    
##     FALSE TRUE
##   0  9051  662
##   1  1190 1888

(9051+1888)/nrow(test_census)

## [1] 0.8552107

# Baseline accuracy
table(test_census$over50k)/nrow(test_census)

## 
##         0         1 
## 0.7593621 0.2406379

AUC with ROCR Package

What is the area-under-the-curve (AUC) for this model on the test set?

library(ROCR)

ROCRpred <- prediction(predLogEarnings, test_census$over50k)
as.numeric(performance(ROCRpred, "auc")@y.values)

## [1] 0.9061598

perf1 <- performance(ROCRpred, "tpr", "fpr")
plot(perf1)

Start of CART Model

We have just seen how the logistic regression model for this data achieves a high accuracy. Moreover, the significances of the variables give us a way to gauge which variables are relevant for this prediction task. However, it is not immediately clear which variables are more important than the others, especially due to the large number of factor variables in this problem.

Let us now build a classification tree to predict “over50k”. Use the training set to build the model, and all of the other variables as independent variables. Use the default parameters, so don’t set a value for minbucket or cp. Remember to specify method=“class” as an argument to rpart, since this is a classification problem. After you are done building the model, plot the resulting tree.

How many splits does the tree have in total?

censusTree <- rpart(over50k ~ ., data = train_census, method = "class")
prp(censusTree)

# Accuracy

predTreeCensus <- predict(censusTree, newdata = test_census, type = "class")
table(predTreeCensus, test_census$over50k)

##               
## predTreeCensus    0    1
##              0 9243 1482
##              1  470 1596

(9243+1596)/nrow(test_census)

## [1] 0.8473927

ROC curve

Let us now consider the ROC curve and AUC for the CART model on the test set. You will need to get predicted probabilities for the observations in the test set to build the ROC curve and compute the AUC. Remember that you can do this by removing the type=“class” argument when making predictions, and taking the second column of the resulting object.

Plot the ROC curve for the CART model you have estimated. Observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve. Which of the following explanations for this behavior is most correct? (HINT: Think about what the ROC curve is plotting and what changing the threshold does.)

predROC2 <- predict(censusTree, newdata = test_census)
ROCpred2 <- prediction(predROC2[,2], test_census$over50k)
as.numeric(performance(ROCpred2, "auc")@y.values)

## [1] 0.8470256

# Lets now plot the curve
perf <- performance(ROCpred2, "tpr", "fpr")
plot(perf)

Random Forest Model

Before building a random forest model, we’ll down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than CART or Logistic Regression. For this reason, before continuing we will define a new training set to be used when building our random forest model, that contains 2000 randomly selected obervations from the original training set. Do this by running the following commands in your R console (assuming your training set is called “train”):

set.seed(1)

trainSmall = train[sample(nrow(train), 2000), ]

set.seed(1)

trainSmall <-  train_census[sample(nrow(train_census), 2000), ]
str(trainSmall)

## 'data.frame':    2000 obs. of  13 variables:
##  $ age          : int  40 23 47 32 49 30 56 24 60 31 ...
##  $ workclass    : chr  " Private" " Private" " Private" " Private" ...
##  $ education    : chr  " Bachelors" " 11th" " Bachelors" " Some-college" ...
##  $ maritalstatus: chr  " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" " Never-married" ...
##  $ occupation   : chr  " Exec-managerial" " Transport-moving" " Exec-managerial" " Tech-support" ...
##  $ relationship : chr  " Husband" " Husband" " Husband" " Not-in-family" ...
##  $ race         : chr  " Asian-Pac-Islander" " White" " White" " Black" ...
##  $ sex          : chr  " Male" " Male" " Male" " Male" ...
##  $ capitalgain  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 50 38 38 70 40 40 65 50 40 ...
##  $ nativecountry: chr  " China" " United-States" " United-States" " United-States" ...
##  $ over50k      : num  1 0 1 0 1 1 0 0 1 0 ...

trainSmall$over50k <- as.factor(trainSmall$over50k)

RF_census <- randomForest(over50k ~ ., data = trainSmall)
predRCcensus <- predict(RF_census, newdata = test_census)
table(predRCcensus, test_census$over50k)

##             
## predRCcensus    0    1
##            0 9146 1538
##            1  567 1540

(9146+1611)/nrow(test_census)

## [1] 0.8409819

Random Forest Model 2

As we discussed in lecture, random forest models work by building a large collection of trees. As a result, we lose some of the interpretability that comes with CART in terms of seeing how predictions are made and which variables are important. However, we can still compute metrics that give us insight into which variables are important.

One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split. To view this metric, run the following lines of R code (replace “MODEL” with the name of your random forest model):

vu <- varUsed(RF_census, count=TRUE)

vusorted <-  sort(vu, decreasing = FALSE, index.return = TRUE)

dotchart(vusorted$x, names(RF_census$forest$xlevels[vusorted$ix]))

Random Forest 3

A different metric we can look at is related to “impurity”, which measures how homogenous each bucket or leaf of the tree is. In each tree in the forest, whenever we select a variable and perform a split, the impurity is decreased. Therefore, one way to measure the importance of a variable is to average the reduction in impurity, taken over all the times that variable is selected for splitting in all of the trees in the forest. To compute this metric, run the following command in R (replace “MODEL” with the name of your random forest model):

varImpPlot(MODEL)

Which one of the following variables is the most important in terms of mean reduction in impurity?

varImpPlot(RF_census)

set cp by CROSS VALIDATION

We now conclude our study of this data set by looking at how CART behaves with different choices of its parameters.

Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. Do this by using the train function. Set the seed beforehand to 2. Test cp values from 0.002 to 0.1 in 0.002 increments, by using the following command:

cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))

Also, remember to use the entire training set “train” when building this model. The train function might take some time to run.

Which value of cp does the train function recommend?

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

numFolds <- trainControl(method = "cv", number = 10)
cartGrid <- expand.grid( .cp = seq(0.002,0.1,0.002)) 
str(train_census)

## 'data.frame':    19187 obs. of  13 variables:
##  $ age          : int  39 38 53 37 31 42 25 32 38 54 ...
##  $ workclass    : chr  " State-gov" " Private" " Private" " Private" ...
##  $ education    : chr  " Bachelors" " HS-grad" " 11th" " Masters" ...
##  $ maritalstatus: chr  " Never-married" " Divorced" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation   : chr  " Adm-clerical" " Handlers-cleaners" " Handlers-cleaners" " Exec-managerial" ...
##  $ relationship : chr  " Not-in-family" " Not-in-family" " Husband" " Wife" ...
##  $ race         : chr  " White" " White" " Black" " White" ...
##  $ sex          : chr  " Male" " Male" " Male" " Female" ...
##  $ capitalgain  : int  2174 0 0 0 14084 5178 0 0 0 0 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 40 40 40 50 40 35 40 50 20 ...
##  $ nativecountry: chr  " United-States" " United-States" " United-States" " United-States" ...
##  $ over50k      : num  0 0 0 0 1 1 0 0 0 0 ...

train_census$over50k <- as.factor(train_census$over50k)


train(over50k ~ ., data = train_census, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)

## CART 
## 
## 19187 samples
##    12 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17268, 17268, 17268, 17269, 17269, 17269, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa     
##   0.002  0.8520870  0.55975128
##   0.004  0.8492205  0.55821110
##   0.006  0.8458331  0.54275902
##   0.008  0.8446863  0.54226352
##   0.010  0.8448426  0.54338388
##   0.012  0.8448426  0.54338388
##   0.014  0.8448426  0.54338388
##   0.016  0.8429661  0.53426816
##   0.018  0.8417155  0.52616378
##   0.020  0.8398912  0.51464080
##   0.022  0.8387966  0.50559250
##   0.024  0.8387966  0.50559250
##   0.026  0.8387966  0.50559250
##   0.028  0.8387966  0.50559250
##   0.030  0.8387966  0.50559250
##   0.032  0.8381188  0.50037344
##   0.034  0.8354091  0.48221719
##   0.036  0.8324904  0.46493345
##   0.038  0.8265489  0.43853941
##   0.040  0.8248811  0.43137353
##   0.042  0.8248811  0.43137353
##   0.044  0.8248811  0.43137353
##   0.046  0.8248811  0.43137353
##   0.048  0.8248811  0.43137353
##   0.050  0.8195128  0.39351430
##   0.052  0.8181579  0.38123605
##   0.054  0.8131548  0.32749986
##   0.056  0.8118517  0.30765346
##   0.058  0.8118517  0.30765346
##   0.060  0.8118517  0.30765346
##   0.062  0.8118517  0.30765346
##   0.064  0.8118517  0.30765346
##   0.066  0.8077861  0.28466355
##   0.068  0.7996551  0.23718533
##   0.070  0.7958511  0.21454287
##   0.072  0.7958511  0.21454287
##   0.074  0.7958511  0.21454287
##   0.076  0.7757344  0.09711952
##   0.078  0.7593684  0.00000000
##   0.080  0.7593684  0.00000000
##   0.082  0.7593684  0.00000000
##   0.084  0.7593684  0.00000000
##   0.086  0.7593684  0.00000000
##   0.088  0.7593684  0.00000000
##   0.090  0.7593684  0.00000000
##   0.092  0.7593684  0.00000000
##   0.094  0.7593684  0.00000000
##   0.096  0.7593684  0.00000000
##   0.098  0.7593684  0.00000000
##   0.100  0.7593684  0.00000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002.

Using the selected cp

Fit a CART model to the training data using this value of cp. What is the prediction accuracy on the test set?

CART_upd <- rpart(over50k ~ ., data = train_census, method = "class", cp = 0.002)
pred_upt <- predict(CART_upd, newdata = test_census, type = "class")
table(pred_upt, test_census$over50k)

##         
## pred_upt    0    1
##        0 9178 1240
##        1  535 1838

(9178+1838)/nrow(test_census)

## [1] 0.8612306

prp(CART_upd)

MITx_4_Trees

Raka

10/19/2020