In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:
Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that ... "the casting of a single vote is of no significance where there is a multitude of electors"? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters' sense of civic duty has long been a leading explanation of vote turnout...
We will first get familiar with the data. Load the CSV file gerber.csv into R. What proportion of people in this dataset voted in this election?
vote <- read.csv("gerber.csv")
str(vote)
## 'data.frame': 344084 obs. of 8 variables:
## $ sex : int 0 1 1 1 0 1 0 0 1 0 ...
## $ yob : int 1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
## $ voting : int 0 0 1 1 1 1 1 0 0 0 ...
## $ hawthorne: int 0 0 1 1 1 0 0 0 0 0 ...
## $ civicduty: int 1 1 0 0 0 0 0 0 0 0 ...
## $ neighbors: int 0 0 0 0 0 0 0 0 0 0 ...
## $ self : int 0 0 0 0 0 0 0 0 0 0 ...
## $ control : int 0 0 0 0 0 1 1 1 1 1 ...
summary(vote)
## sex yob voting hawthorne
## Min. :0.0000 Min. :1900 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1947 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :1956 Median :0.0000 Median :0.000
## Mean :0.4993 Mean :1956 Mean :0.3159 Mean :0.111
## 3rd Qu.:1.0000 3rd Qu.:1965 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :1.0000 Max. :1986 Max. :1.0000 Max. :1.000
## civicduty neighbors self control
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.1111 Mean :0.111 Mean :0.1111 Mean :0.5558
## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
table(vote$voting)/nrow(vote)
##
## 0 1
## 0.6841004 0.3158996
Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?
table(vote$voting, vote$hawthorne)/ nrow(vote)
##
## 0 1
## 0 0.60886295 0.07523744
## 1 0.28010602 0.03579359
table(vote$voting, vote$civicduty)/ nrow(vote)
##
## 0 1
## 0 0.60796492 0.07613548
## 1 0.28096337 0.03493624
table(vote$voting, vote$neighbors)/ nrow(vote)
##
## 0 1
## 0 0.61503877 0.06906162
## 1 0.27393892 0.04196068
table(vote$voting, vote$self)/ nrow(vote)
##
## 0 1
## 0 0.61136525 0.07273515
## 1 0.27756304 0.03833657
table(vote$voting, vote$control)/ nrow(vote)
##
## 0 1
## 0 0.2931697 0.3909307
## 1 0.1510271 0.1648725
Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.
logVote <- glm(voting ~ civicduty + hawthorne + self + neighbors, data = vote, family = binomial)
summary(logVote)
##
## Call:
## glm(formula = voting ~ civicduty + hawthorne + self + neighbors,
## family = binomial, data = vote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9744 -0.8691 -0.8389 1.4586 1.5590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.863358 0.005006 -172.459 < 2e-16 ***
## civicduty 0.084368 0.012100 6.972 3.12e-12 ***
## hawthorne 0.120477 0.012037 10.009 < 2e-16 ***
## self 0.222937 0.011867 18.786 < 2e-16 ***
## neighbors 0.365092 0.011679 31.260 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428090 on 344079 degrees of freedom
## AIC: 428100
##
## Number of Fisher Scoring iterations: 4
Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don’t need to use the newdata argument since we didn’t split our data.)
predicted <- predict(logVote, type = 'response')
# confusion matrix
table(vote$voting, predicted > 0.3)
##
## FALSE TRUE
## 0 134513 100875
## 1 56730 51966
(134513+51966)/nrow(vote)
## [1] 0.5419578
# with higher threshold
table(vote$voting, predicted > 0.5)
##
## FALSE
## 0 235388
## 1 108696
(235388)/nrow(vote)
## [1] 0.6841004
Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?
library(ROCR)
ROCRpred <- prediction(predicted, vote$voting)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.5308461
We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before. Don’t set the option method=“class” - we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting. If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.
Leave all the parameters at their default values. You can use the following command in R to build the tree:
library(rpart)
library(rpart.plot)
CARTmodel <- rpart(voting ~ civicduty + hawthorne + self + neighbors, data=vote)
prp(CARTmodel)
# repeat with other model, prev result was bad
CARTmodel2 <- rpart(voting ~ civicduty + hawthorne + self + neighbors, data=vote, cp=0.0)
prp(CARTmodel2)
# Another model with sex
CARTmodel3 <- rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=vote, cp=0.0)
prp(CARTmodel3)
# Interaction terms
We know trees can handle “nonlinear” relationships, e.g. “in the ‘Civic Duty’ group and female”, but as we will see in the next few questions, it is possible to do the same for logistic regression. First, let’s explore what trees can tell us some more.
Let’s just focus on the “Control” treatment group. Create a regression tree using just the “control” variable, then create another tree with the “control” and “sex” variables, both with cp=0.0.
In the “control” only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group? You can use the absolute value function to get answer, i.e. abs(Control Prediction - Non-Control Prediction). Add the argument “digits = 6” to the prp command to get a more accurate estimate.
CARTmodel4 <- rpart(voting ~ control, data=vote, cp = 0.0)
prp(CARTmodel4, digits = 6)
abs(0.296638 - 0.34)
## [1] 0.043362
CARTmodel5 <- rpart(voting ~ control + sex, data=vote, cp = 0.0)
prp(CARTmodel5, digits = 6)
# Back to regression model
LogReg2 <- glm(voting ~ control + sex, data = vote, family = binomial)
summary(LogReg2)
##
## Call:
## glm(formula = voting ~ control + sex, family = binomial, data = vote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9220 -0.9012 -0.8290 1.4564 1.5717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.635538 0.006511 -97.616 < 2e-16 ***
## control -0.200142 0.007364 -27.179 < 2e-16 ***
## sex -0.055791 0.007343 -7.597 3.02e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428443 on 344081 degrees of freedom
## AIC: 428449
##
## Number of Fisher Scoring iterations: 4
The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). However, logistic regression on the “sex” and “control” variables considers these variables separately, not jointly, and therefore did not do as well.
We can quantify this precisely. Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function (where “LogModelSex” is the name of your logistic regression model that uses both control and sex):
Possibilities <- data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(LogReg2, newdata=Possibilities, type="response")
## 1 2 3 4
## 0.3462559 0.3024455 0.3337375 0.2908065
# the diff
abs(0.2908065 - 0.290456)
## [1] 0.0003505
So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group. We can do that with the following command:
# combination of sex and control together
LogModel2 <- glm(voting ~ sex + control + sex:control, data=vote, family="binomial")
summary(LogModel2)
##
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial",
## data = vote)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9213 -0.9019 -0.8284 1.4573 1.5724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.637471 0.007603 -83.843 < 2e-16 ***
## sex -0.051888 0.010801 -4.804 1.55e-06 ***
## control -0.196553 0.010356 -18.980 < 2e-16 ***
## sex:control -0.007259 0.014729 -0.493 0.622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429238 on 344083 degrees of freedom
## Residual deviance: 428442 on 344080 degrees of freedom
## AIC: 428450
##
## Number of Fisher Scoring iterations: 4
predicted_2 <- predict(LogModel2, newdata=Possibilities, type="response")
predicted_2
## 1 2 3 4
## 0.3458183 0.3027947 0.3341757 0.2904558
abs(0.2904558 - 0.290456)
## [1] 2e-07
Let’s warm up by attempting to predict just whether a letter is B or not. To begin, load the file letters_ABPR.csv into R, and call it letters. Then, create a new variable isB in the dataframe, which takes the value “TRUE” if the observation corresponds to the letter B, and “FALSE” if it does not. You can do this by typing the following command into your R console:
letters\(isB = as.factor(letters\)letter == “B”)
Now split the data set into a training and testing set, putting 50% of the data in the training set. Set the seed to 1000 before making the split. The first argument to sample.split should be the dependent variable “letters$isB”. Remember that TRUE values from sample.split should go in the training set.
Before building models, let’s consider a baseline method that always predicts the most frequent outcome, which is “not B”. What is the accuracy of this baseline method on the test set?
library(caTools)
letters <- read.csv("letters_ABPR.csv")
str(letters)
## 'data.frame': 3116 obs. of 17 variables:
## $ letter : chr "B" "A" "R" "B" ...
## $ xbox : int 4 1 5 5 3 8 2 3 8 6 ...
## $ ybox : int 2 1 9 9 6 10 6 7 14 10 ...
## $ width : int 5 3 5 7 4 8 4 5 7 8 ...
## $ height : int 4 2 7 7 4 6 4 5 8 8 ...
## $ onpix : int 4 1 6 10 2 6 3 3 4 7 ...
## $ xbar : int 8 8 6 9 4 7 6 12 5 8 ...
## $ ybar : int 7 2 11 8 14 7 7 2 10 5 ...
## $ x2bar : int 6 2 7 4 8 3 5 3 6 7 ...
## $ y2bar : int 6 2 3 4 1 5 5 2 3 5 ...
## $ xybar : int 7 8 7 6 11 8 6 10 12 7 ...
## $ x2ybar : int 6 2 3 8 6 4 5 2 5 6 ...
## $ xy2bar : int 6 8 9 6 3 8 7 9 4 6 ...
## $ xedge : int 2 1 2 6 0 6 3 2 4 3 ...
## $ xedgeycor: int 8 6 7 11 10 6 7 6 10 9 ...
## $ yedge : int 7 2 5 8 4 7 5 3 4 8 ...
## $ yedgexcor: int 10 7 11 7 8 7 8 8 8 9 ...
letters$isB = as.factor(letters$letter == "B")
str(letters)
## 'data.frame': 3116 obs. of 18 variables:
## $ letter : chr "B" "A" "R" "B" ...
## $ xbox : int 4 1 5 5 3 8 2 3 8 6 ...
## $ ybox : int 2 1 9 9 6 10 6 7 14 10 ...
## $ width : int 5 3 5 7 4 8 4 5 7 8 ...
## $ height : int 4 2 7 7 4 6 4 5 8 8 ...
## $ onpix : int 4 1 6 10 2 6 3 3 4 7 ...
## $ xbar : int 8 8 6 9 4 7 6 12 5 8 ...
## $ ybar : int 7 2 11 8 14 7 7 2 10 5 ...
## $ x2bar : int 6 2 7 4 8 3 5 3 6 7 ...
## $ y2bar : int 6 2 3 4 1 5 5 2 3 5 ...
## $ xybar : int 7 8 7 6 11 8 6 10 12 7 ...
## $ x2ybar : int 6 2 3 8 6 4 5 2 5 6 ...
## $ xy2bar : int 6 8 9 6 3 8 7 9 4 6 ...
## $ xedge : int 2 1 2 6 0 6 3 2 4 3 ...
## $ xedgeycor: int 8 6 7 11 10 6 7 6 10 9 ...
## $ yedge : int 7 2 5 8 4 7 5 3 4 8 ...
## $ yedgexcor: int 10 7 11 7 8 7 8 8 8 9 ...
## $ isB : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 2 1 1 1 1 1 1 ...
set.seed(1000)
split_letters <- sample.split(letters$isB, SplitRatio = 0.5)
train_letters <- subset(letters, split_letters == TRUE)
test_letters <- subset(letters, split_letters == FALSE)
table(test_letters$isB)/nrow(test_letters)
##
## FALSE TRUE
## 0.754172 0.245828
Now build a classification tree to predict whether a letter is a B or not, using the training set to build your model. Remember to remove the variable “letter” out of the model, as this is related to what we are trying to predict! To just remove one variable, you can either write out the other variables, or remember what we did in the Billboards problem in Week 3, and use the following notation:
CARTb = rpart(isB ~ . - letter, data=train, method=“class”)
We are just using the default parameters in our CART model, so we don’t need to add the minbucket or cp arguments at all. We also added the argument method=“class” since this is a classification problem.
What is the accuracy of the CART model on the test set? (Use type=“class” when making predictions on the test set.)
CARTb <- rpart(isB ~ . - letter, data=train_letters, method="class")
pred_cartB <- predict(CARTb, newdata = test_letters, type = 'class')
table(pred_cartB, test_letters$isB)
##
## pred_cartB FALSE TRUE
## FALSE 1118 43
## TRUE 57 340
(1118+340)/nrow(test_letters)
## [1] 0.9358151
Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set. You should use all of the other variables as independent variables, except letter (since it helped us define what we are trying to predict!). Use the default settings for ntree and nodesize (don’t include these arguments at all). Right before building the model, set the seed to 1000. (NOTE: You might get a slightly different answer on this problem, even if you set the random seed. This has to do with your operating system and the implementation of the random forest algorithm.)
What is the accuracy of the model on the test set?
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
RForestB <- randomForest(isB ~ . - letter, data=train_letters)
Pred_RforestB <- predict(RForestB, newdata = test_letters)
table(Pred_RforestB, test_letters$isB)
##
## Pred_RforestB FALSE TRUE
## FALSE 1164 9
## TRUE 11 374
(1164+374)/nrow(test_letters)
## [1] 0.987163
Predicting the letters A, B, P, R 0.0/2.0 points (graded)
Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R.
As we saw in the D2Hawkeye lecture, building a multiclass classification CART model in R is no harder than building the models for binary classification problems. Fortunately, building a random forest model is just as easy.
The variable in our data frame which we will be trying to predict is “letter”. Start by converting letter in the original data set (letters) to a factor by running the following command in R:
letters\(letter = as.factor( letters\)letter )
Now, generate new training and testing sets of the letters data frame using letters$letter as the first input to the sample.split function. Before splitting, set your seed to 2000. Again put 50% of the data in the training set. (Why do we need to split the data again? Remember that sample.split balances the outcome variable in the training and testing sets. With a new outcome variable, we want to re-generate our split.)
In a multiclass classification problem, a simple baseline model is to predict the most frequent class of all of the options.
What is the baseline accuracy on the testing set?
library(kableExtra)
letters$letter <- as.factor(letters$letter)
str(letters)
## 'data.frame': 3116 obs. of 18 variables:
## $ letter : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
## $ xbox : int 4 1 5 5 3 8 2 3 8 6 ...
## $ ybox : int 2 1 9 9 6 10 6 7 14 10 ...
## $ width : int 5 3 5 7 4 8 4 5 7 8 ...
## $ height : int 4 2 7 7 4 6 4 5 8 8 ...
## $ onpix : int 4 1 6 10 2 6 3 3 4 7 ...
## $ xbar : int 8 8 6 9 4 7 6 12 5 8 ...
## $ ybar : int 7 2 11 8 14 7 7 2 10 5 ...
## $ x2bar : int 6 2 7 4 8 3 5 3 6 7 ...
## $ y2bar : int 6 2 3 4 1 5 5 2 3 5 ...
## $ xybar : int 7 8 7 6 11 8 6 10 12 7 ...
## $ x2ybar : int 6 2 3 8 6 4 5 2 5 6 ...
## $ xy2bar : int 6 8 9 6 3 8 7 9 4 6 ...
## $ xedge : int 2 1 2 6 0 6 3 2 4 3 ...
## $ xedgeycor: int 8 6 7 11 10 6 7 6 10 9 ...
## $ yedge : int 7 2 5 8 4 7 5 3 4 8 ...
## $ yedgexcor: int 10 7 11 7 8 7 8 8 8 9 ...
## $ isB : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 2 1 1 1 1 1 1 ...
set.seed(2000)
split_2 <- sample.split(letters$letter, SplitRatio = 0.5)
train_2 <- subset(letters, split_2 == TRUE)
test_2 <- subset(letters, split_2 == FALSE)
str(test_2)
## 'data.frame': 1558 obs. of 18 variables:
## $ letter : Factor w/ 4 levels "A","B","P","R": 2 4 1 3 1 4 1 1 1 2 ...
## $ xbox : int 5 2 3 8 3 6 2 2 2 4 ...
## $ ybox : int 9 6 7 14 8 9 1 4 8 2 ...
## $ width : int 7 4 5 7 5 5 4 4 4 5 ...
## $ height : int 7 4 5 8 6 4 2 3 6 4 ...
## $ onpix : int 10 3 3 4 3 3 1 2 2 4 ...
## $ xbar : int 9 6 12 5 9 10 8 10 12 7 ...
## $ ybar : int 8 7 2 10 2 6 1 2 2 7 ...
## $ x2bar : int 4 5 3 6 2 5 2 2 4 5 ...
## $ y2bar : int 4 5 2 3 3 5 2 2 3 6 ...
## $ xybar : int 6 6 10 12 8 10 7 9 11 7 ...
## $ x2ybar : int 8 5 2 5 2 2 2 2 2 6 ...
## $ xy2bar : int 6 7 9 4 8 8 8 9 10 6 ...
## $ xedge : int 6 3 2 4 2 6 2 2 3 2 ...
## $ xedgeycor: int 11 7 6 10 6 6 5 7 6 8 ...
## $ yedge : int 8 5 3 4 3 4 2 3 3 7 ...
## $ yedgexcor: int 7 8 8 8 7 9 7 9 9 10 ...
## $ isB : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 1 2 ...
#Accuracy of baseline
table(test_2$letter)
##
## A B P R
## 395 383 401 379
sum(401)/nrow(test_2)
## [1] 0.2573813
y <- table(test_2$letter)
kable(y)
| Var1 | Freq |
|---|---|
| A | 395 |
| B | 383 |
| P | 401 |
| R | 379 |
max(diag(y))/sum(y)
## [1] 0.2573813
Now build a classification tree to predict “letter”, using the training set to build your model. You should use all of the other variables as independent variables, except “isB”, since it is related to what we are trying to predict! Just use the default parameters in your CART model. Add the argument method=“class” since this is a classification problem. Even though we have multiple classes here, nothing changes in how we build the model from the binary case.
What is the test set accuracy of your CART model? Use the argument type=“class” when making predictions.
(HINT: When you are computing the test set accuracy using the confusion matrix, you want to add everything on the main diagonal and divide by the total number of observations in the test set, which can be computed with nrow(test), where test is the name of your test set).
CARTc <- rpart(letter ~ . - isB, data = train_2)
predicted3 <- predict(CARTc, newdata = test_2, type = 'class')
table(predicted3, test_2$letter)
##
## predicted3 A B P R
## A 348 8 2 10
## B 4 318 21 24
## P 0 12 363 5
## R 43 45 15 340
(354+309+362+352)/nrow(test_2)
## [1] 0.8838254
Now build a random forest model on the training data, using the same independent variables as in the previous problem – again, don’t forget to remove the isB variable. Just use the default parameter values for ntree and nodesize (you don’t need to include these arguments at all). Set the seed to 1000 right before building your model. (Remember that you might get a slightly different result even if you set the random seed.)
What is the test set accuracy of your random forest model?
RanFor2 <- randomForest(letter ~ . - isB, data = train_2)
predRF_2 <- predict(RanFor2, newdata = test_2)
table(predRF_2, test_2$letter)
##
## predRF_2 A B P R
## A 391 0 0 0
## B 0 379 4 12
## P 3 2 394 0
## R 1 2 3 367
z <- table(predRF_2, test_2$letter)
acc_RF <- sum(diag(z))/sum(z)
acc_RF
## [1] 0.9826701
The United States government periodically collects demographic information by conducting a census.
In this problem, we are going to use census information about an individual to predict how much a person earns – in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.
census <- read.csv("census.csv")
census$over50k <- ifelse(census$over50k == " <=50K", 0, 1)
str(census)
## 'data.frame': 31978 obs. of 13 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr " State-gov" " Self-emp-not-inc" " Private" " Private" ...
## $ education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
## $ maritalstatus: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
## $ occupation : chr " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
## $ relationship : chr " Not-in-family" " Husband" " Not-in-family" " Husband" ...
## $ race : chr " White" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
## $ nativecountry: chr " United-States" " United-States" " United-States" " United-States" ...
## $ over50k : num 0 0 0 0 0 0 0 1 1 1 ...
summary(census)
## age workclass education maritalstatus
## Min. :17.00 Length:31978 Length:31978 Length:31978
## 1st Qu.:28.00 Class :character Class :character Class :character
## Median :37.00 Mode :character Mode :character Mode :character
## Mean :38.58
## 3rd Qu.:48.00
## Max. :90.00
## occupation relationship race sex
## Length:31978 Length:31978 Length:31978 Length:31978
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## capitalgain capitalloss hoursperweek nativecountry
## Min. : 0 Min. : 0.00 Min. : 1.00 Length:31978
## 1st Qu.: 0 1st Qu.: 0.00 1st Qu.:40.00 Class :character
## Median : 0 Median : 0.00 Median :40.00 Mode :character
## Mean : 1064 Mean : 86.74 Mean :40.42
## 3rd Qu.: 0 3rd Qu.: 0.00 3rd Qu.:45.00
## Max. :99999 Max. :4356.00 Max. :99.00
## over50k
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2406
## 3rd Qu.:0.0000
## Max. :1.0000
# Splitting the data
set.seed(2000)
split_census <- sample.split(census$over50k, SplitRatio = 0.6)
train_census <- subset(census, split_census == TRUE)
test_census <- subset(census, split_census == FALSE)
# Logistic regression
logCensus <- glm(over50k ~ ., family = "binomial", data = train_census)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logCensus)
##
## Call:
## glm(formula = over50k ~ ., family = "binomial", data = train_census)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1065 -0.5037 -0.1804 -0.0008 3.3383
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.658e+00 1.379e+00 -6.279 3.41e-10
## age 2.548e-02 2.139e-03 11.916 < 2e-16
## workclass Federal-gov 1.105e+00 2.014e-01 5.489 4.03e-08
## workclass Local-gov 3.675e-01 1.821e-01 2.018 0.043641
## workclass Never-worked -1.283e+01 8.453e+02 -0.015 0.987885
## workclass Private 6.012e-01 1.626e-01 3.698 0.000218
## workclass Self-emp-inc 7.575e-01 1.950e-01 3.884 0.000103
## workclass Self-emp-not-inc 1.855e-01 1.774e-01 1.046 0.295646
## workclass State-gov 4.012e-01 1.961e-01 2.046 0.040728
## workclass Without-pay -1.395e+01 6.597e+02 -0.021 0.983134
## education 11th 2.225e-01 2.867e-01 0.776 0.437738
## education 12th 6.380e-01 3.597e-01 1.774 0.076064
## education 1st-4th -7.075e-01 7.760e-01 -0.912 0.361897
## education 5th-6th -3.170e-01 4.880e-01 -0.650 0.516008
## education 7th-8th -3.498e-01 3.126e-01 -1.119 0.263152
## education 9th -1.258e-01 3.539e-01 -0.355 0.722228
## education Assoc-acdm 1.602e+00 2.427e-01 6.601 4.10e-11
## education Assoc-voc 1.541e+00 2.368e-01 6.506 7.74e-11
## education Bachelors 2.177e+00 2.218e-01 9.817 < 2e-16
## education Doctorate 2.761e+00 2.893e-01 9.544 < 2e-16
## education HS-grad 1.006e+00 2.169e-01 4.638 3.52e-06
## education Masters 2.421e+00 2.353e-01 10.289 < 2e-16
## education Preschool -2.237e+01 6.864e+02 -0.033 0.973996
## education Prof-school 2.938e+00 2.753e-01 10.672 < 2e-16
## education Some-college 1.365e+00 2.195e-01 6.219 5.00e-10
## maritalstatus Married-AF-spouse 2.540e+00 7.145e-01 3.555 0.000378
## maritalstatus Married-civ-spouse 2.458e+00 3.573e-01 6.880 6.00e-12
## maritalstatus Married-spouse-absent -9.486e-02 3.204e-01 -0.296 0.767155
## maritalstatus Never-married -4.515e-01 1.139e-01 -3.962 7.42e-05
## maritalstatus Separated 3.609e-02 1.984e-01 0.182 0.855672
## maritalstatus Widowed 1.858e-01 1.962e-01 0.947 0.343449
## occupation Adm-clerical 9.470e-02 1.288e-01 0.735 0.462064
## occupation Armed-Forces -1.008e+00 1.487e+00 -0.677 0.498170
## occupation Craft-repair 2.174e-01 1.109e-01 1.960 0.049972
## occupation Exec-managerial 9.400e-01 1.138e-01 8.257 < 2e-16
## occupation Farming-fishing -1.068e+00 1.908e-01 -5.599 2.15e-08
## occupation Handlers-cleaners -6.237e-01 1.946e-01 -3.204 0.001353
## occupation Machine-op-inspct -1.862e-01 1.376e-01 -1.353 0.176061
## occupation Other-service -8.183e-01 1.641e-01 -4.987 6.14e-07
## occupation Priv-house-serv -1.297e+01 2.267e+02 -0.057 0.954385
## occupation Prof-specialty 6.331e-01 1.222e-01 5.180 2.22e-07
## occupation Protective-serv 6.267e-01 1.710e-01 3.664 0.000248
## occupation Sales 3.276e-01 1.175e-01 2.789 0.005282
## occupation Tech-support 6.173e-01 1.533e-01 4.028 5.63e-05
## occupation Transport-moving NA NA NA NA
## relationship Not-in-family 7.881e-01 3.530e-01 2.233 0.025562
## relationship Other-relative -2.194e-01 3.137e-01 -0.699 0.484263
## relationship Own-child -7.489e-01 3.507e-01 -2.136 0.032716
## relationship Unmarried 7.041e-01 3.720e-01 1.893 0.058392
## relationship Wife 1.324e+00 1.331e-01 9.942 < 2e-16
## race Asian-Pac-Islander 4.830e-01 3.548e-01 1.361 0.173504
## race Black 3.644e-01 2.882e-01 1.265 0.206001
## race Other 2.204e-01 4.513e-01 0.488 0.625263
## race White 4.108e-01 2.737e-01 1.501 0.133356
## sex Male 7.729e-01 1.024e-01 7.545 4.52e-14
## capitalgain 3.280e-04 1.372e-05 23.904 < 2e-16
## capitalloss 6.445e-04 4.854e-05 13.277 < 2e-16
## hoursperweek 2.897e-02 2.101e-03 13.791 < 2e-16
## nativecountry Canada 2.593e-01 1.308e+00 0.198 0.842879
## nativecountry China -9.695e-01 1.327e+00 -0.730 0.465157
## nativecountry Columbia -1.954e+00 1.526e+00 -1.280 0.200470
## nativecountry Cuba 5.735e-02 1.323e+00 0.043 0.965432
## nativecountry Dominican-Republic -1.435e+01 3.092e+02 -0.046 0.962972
## nativecountry Ecuador -3.550e-02 1.477e+00 -0.024 0.980829
## nativecountry El-Salvador -6.095e-01 1.395e+00 -0.437 0.662181
## nativecountry England -6.707e-02 1.327e+00 -0.051 0.959686
## nativecountry France 5.301e-01 1.419e+00 0.374 0.708642
## nativecountry Germany 5.474e-02 1.306e+00 0.042 0.966572
## nativecountry Greece -2.646e+00 1.714e+00 -1.544 0.122527
## nativecountry Guatemala -1.293e+01 3.345e+02 -0.039 0.969180
## nativecountry Haiti -9.221e-01 1.615e+00 -0.571 0.568105
## nativecountry Holand-Netherlands -1.282e+01 2.400e+03 -0.005 0.995736
## nativecountry Honduras -9.584e-01 3.412e+00 -0.281 0.778775
## nativecountry Hong -2.362e-01 1.492e+00 -0.158 0.874155
## nativecountry Hungary 1.412e-01 1.555e+00 0.091 0.927653
## nativecountry India -8.218e-01 1.314e+00 -0.625 0.531661
## nativecountry Iran -3.299e-02 1.366e+00 -0.024 0.980736
## nativecountry Ireland 1.579e-01 1.473e+00 0.107 0.914628
## nativecountry Italy 6.100e-01 1.333e+00 0.458 0.647194
## nativecountry Jamaica -2.279e-01 1.387e+00 -0.164 0.869467
## nativecountry Japan 5.072e-01 1.375e+00 0.369 0.712179
## nativecountry Laos -6.831e-01 1.661e+00 -0.411 0.680866
## nativecountry Mexico -9.182e-01 1.303e+00 -0.705 0.481103
## nativecountry Nicaragua -1.987e-01 1.507e+00 -0.132 0.895132
## nativecountry Outlying-US(Guam-USVI-etc) -1.373e+01 8.502e+02 -0.016 0.987115
## nativecountry Peru -9.660e-01 1.678e+00 -0.576 0.564797
## nativecountry Philippines 4.393e-02 1.281e+00 0.034 0.972640
## nativecountry Poland 2.410e-01 1.383e+00 0.174 0.861624
## nativecountry Portugal 7.276e-01 1.477e+00 0.493 0.622327
## nativecountry Puerto-Rico -5.769e-01 1.357e+00 -0.425 0.670837
## nativecountry Scotland -1.188e+00 1.719e+00 -0.691 0.489616
## nativecountry South -8.183e-01 1.341e+00 -0.610 0.541809
## nativecountry Taiwan -2.590e-01 1.350e+00 -0.192 0.847878
## nativecountry Thailand -1.693e+00 1.737e+00 -0.975 0.329678
## nativecountry Trinadad&Tobago -1.346e+00 1.721e+00 -0.782 0.434105
## nativecountry United-States -8.594e-02 1.269e+00 -0.068 0.946020
## nativecountry Vietnam -1.008e+00 1.523e+00 -0.662 0.507799
## nativecountry Yugoslavia 1.402e+00 1.648e+00 0.851 0.394874
##
## (Intercept) ***
## age ***
## workclass Federal-gov ***
## workclass Local-gov *
## workclass Never-worked
## workclass Private ***
## workclass Self-emp-inc ***
## workclass Self-emp-not-inc
## workclass State-gov *
## workclass Without-pay
## education 11th
## education 12th .
## education 1st-4th
## education 5th-6th
## education 7th-8th
## education 9th
## education Assoc-acdm ***
## education Assoc-voc ***
## education Bachelors ***
## education Doctorate ***
## education HS-grad ***
## education Masters ***
## education Preschool
## education Prof-school ***
## education Some-college ***
## maritalstatus Married-AF-spouse ***
## maritalstatus Married-civ-spouse ***
## maritalstatus Married-spouse-absent
## maritalstatus Never-married ***
## maritalstatus Separated
## maritalstatus Widowed
## occupation Adm-clerical
## occupation Armed-Forces
## occupation Craft-repair *
## occupation Exec-managerial ***
## occupation Farming-fishing ***
## occupation Handlers-cleaners **
## occupation Machine-op-inspct
## occupation Other-service ***
## occupation Priv-house-serv
## occupation Prof-specialty ***
## occupation Protective-serv ***
## occupation Sales **
## occupation Tech-support ***
## occupation Transport-moving
## relationship Not-in-family *
## relationship Other-relative
## relationship Own-child *
## relationship Unmarried .
## relationship Wife ***
## race Asian-Pac-Islander
## race Black
## race Other
## race White
## sex Male ***
## capitalgain ***
## capitalloss ***
## hoursperweek ***
## nativecountry Canada
## nativecountry China
## nativecountry Columbia
## nativecountry Cuba
## nativecountry Dominican-Republic
## nativecountry Ecuador
## nativecountry El-Salvador
## nativecountry England
## nativecountry France
## nativecountry Germany
## nativecountry Greece
## nativecountry Guatemala
## nativecountry Haiti
## nativecountry Holand-Netherlands
## nativecountry Honduras
## nativecountry Hong
## nativecountry Hungary
## nativecountry India
## nativecountry Iran
## nativecountry Ireland
## nativecountry Italy
## nativecountry Jamaica
## nativecountry Japan
## nativecountry Laos
## nativecountry Mexico
## nativecountry Nicaragua
## nativecountry Outlying-US(Guam-USVI-etc)
## nativecountry Peru
## nativecountry Philippines
## nativecountry Poland
## nativecountry Portugal
## nativecountry Puerto-Rico
## nativecountry Scotland
## nativecountry South
## nativecountry Taiwan
## nativecountry Thailand
## nativecountry Trinadad&Tobago
## nativecountry United-States
## nativecountry Vietnam
## nativecountry Yugoslavia
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21175 on 19186 degrees of freedom
## Residual deviance: 12104 on 19090 degrees of freedom
## AIC: 12298
##
## Number of Fisher Scoring iterations: 15
What is the accuracy of the model on the testing set? Use a threshold of 0.5. (You might see a warning message when you make predictions on the test set - you can safely ignore it.)
predLogEarnings <- predict(logCensus, newdata = test_census, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
table(test_census$over50k, predLogEarnings >= 0.5)
##
## FALSE TRUE
## 0 9051 662
## 1 1190 1888
(9051+1888)/nrow(test_census)
## [1] 0.8552107
# Baseline accuracy
table(test_census$over50k)/nrow(test_census)
##
## 0 1
## 0.7593621 0.2406379
What is the area-under-the-curve (AUC) for this model on the test set?
library(ROCR)
ROCRpred <- prediction(predLogEarnings, test_census$over50k)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9061598
perf1 <- performance(ROCRpred, "tpr", "fpr")
plot(perf1)
We have just seen how the logistic regression model for this data achieves a high accuracy. Moreover, the significances of the variables give us a way to gauge which variables are relevant for this prediction task. However, it is not immediately clear which variables are more important than the others, especially due to the large number of factor variables in this problem.
Let us now build a classification tree to predict “over50k”. Use the training set to build the model, and all of the other variables as independent variables. Use the default parameters, so don’t set a value for minbucket or cp. Remember to specify method=“class” as an argument to rpart, since this is a classification problem. After you are done building the model, plot the resulting tree.
How many splits does the tree have in total?
censusTree <- rpart(over50k ~ ., data = train_census, method = "class")
prp(censusTree)
# Accuracy
predTreeCensus <- predict(censusTree, newdata = test_census, type = "class")
table(predTreeCensus, test_census$over50k)
##
## predTreeCensus 0 1
## 0 9243 1482
## 1 470 1596
(9243+1596)/nrow(test_census)
## [1] 0.8473927
Let us now consider the ROC curve and AUC for the CART model on the test set. You will need to get predicted probabilities for the observations in the test set to build the ROC curve and compute the AUC. Remember that you can do this by removing the type=“class” argument when making predictions, and taking the second column of the resulting object.
Plot the ROC curve for the CART model you have estimated. Observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve. Which of the following explanations for this behavior is most correct? (HINT: Think about what the ROC curve is plotting and what changing the threshold does.)
predROC2 <- predict(censusTree, newdata = test_census)
ROCpred2 <- prediction(predROC2[,2], test_census$over50k)
as.numeric(performance(ROCpred2, "auc")@y.values)
## [1] 0.8470256
# Lets now plot the curve
perf <- performance(ROCpred2, "tpr", "fpr")
plot(perf)
Before building a random forest model, we’ll down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than CART or Logistic Regression. For this reason, before continuing we will define a new training set to be used when building our random forest model, that contains 2000 randomly selected obervations from the original training set. Do this by running the following commands in your R console (assuming your training set is called “train”):
set.seed(1)
trainSmall = train[sample(nrow(train), 2000), ]
set.seed(1)
trainSmall <- train_census[sample(nrow(train_census), 2000), ]
str(trainSmall)
## 'data.frame': 2000 obs. of 13 variables:
## $ age : int 40 23 47 32 49 30 56 24 60 31 ...
## $ workclass : chr " Private" " Private" " Private" " Private" ...
## $ education : chr " Bachelors" " 11th" " Bachelors" " Some-college" ...
## $ maritalstatus: chr " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" " Never-married" ...
## $ occupation : chr " Exec-managerial" " Transport-moving" " Exec-managerial" " Tech-support" ...
## $ relationship : chr " Husband" " Husband" " Husband" " Not-in-family" ...
## $ race : chr " Asian-Pac-Islander" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capitalgain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 50 38 38 70 40 40 65 50 40 ...
## $ nativecountry: chr " China" " United-States" " United-States" " United-States" ...
## $ over50k : num 1 0 1 0 1 1 0 0 1 0 ...
trainSmall$over50k <- as.factor(trainSmall$over50k)
RF_census <- randomForest(over50k ~ ., data = trainSmall)
predRCcensus <- predict(RF_census, newdata = test_census)
table(predRCcensus, test_census$over50k)
##
## predRCcensus 0 1
## 0 9146 1538
## 1 567 1540
(9146+1611)/nrow(test_census)
## [1] 0.8409819
As we discussed in lecture, random forest models work by building a large collection of trees. As a result, we lose some of the interpretability that comes with CART in terms of seeing how predictions are made and which variables are important. However, we can still compute metrics that give us insight into which variables are important.
One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split. To view this metric, run the following lines of R code (replace “MODEL” with the name of your random forest model):
vu <- varUsed(RF_census, count=TRUE)
vusorted <- sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, names(RF_census$forest$xlevels[vusorted$ix]))
A different metric we can look at is related to “impurity”, which measures how homogenous each bucket or leaf of the tree is. In each tree in the forest, whenever we select a variable and perform a split, the impurity is decreased. Therefore, one way to measure the importance of a variable is to average the reduction in impurity, taken over all the times that variable is selected for splitting in all of the trees in the forest. To compute this metric, run the following command in R (replace “MODEL” with the name of your random forest model):
varImpPlot(MODEL)
Which one of the following variables is the most important in terms of mean reduction in impurity?
varImpPlot(RF_census)
We now conclude our study of this data set by looking at how CART behaves with different choices of its parameters.
Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. Do this by using the train function. Set the seed beforehand to 2. Test cp values from 0.002 to 0.1 in 0.002 increments, by using the following command:
cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))
Also, remember to use the entire training set “train” when building this model. The train function might take some time to run.
Which value of cp does the train function recommend?
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
numFolds <- trainControl(method = "cv", number = 10)
cartGrid <- expand.grid( .cp = seq(0.002,0.1,0.002))
str(train_census)
## 'data.frame': 19187 obs. of 13 variables:
## $ age : int 39 38 53 37 31 42 25 32 38 54 ...
## $ workclass : chr " State-gov" " Private" " Private" " Private" ...
## $ education : chr " Bachelors" " HS-grad" " 11th" " Masters" ...
## $ maritalstatus: chr " Never-married" " Divorced" " Married-civ-spouse" " Married-civ-spouse" ...
## $ occupation : chr " Adm-clerical" " Handlers-cleaners" " Handlers-cleaners" " Exec-managerial" ...
## $ relationship : chr " Not-in-family" " Not-in-family" " Husband" " Wife" ...
## $ race : chr " White" " White" " Black" " White" ...
## $ sex : chr " Male" " Male" " Male" " Female" ...
## $ capitalgain : int 2174 0 0 0 14084 5178 0 0 0 0 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 40 40 40 50 40 35 40 50 20 ...
## $ nativecountry: chr " United-States" " United-States" " United-States" " United-States" ...
## $ over50k : num 0 0 0 0 1 1 0 0 0 0 ...
train_census$over50k <- as.factor(train_census$over50k)
train(over50k ~ ., data = train_census, method = "rpart", trControl = numFolds, tuneGrid = cartGrid)
## CART
##
## 19187 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17268, 17268, 17268, 17269, 17269, 17269, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.002 0.8520870 0.55975128
## 0.004 0.8492205 0.55821110
## 0.006 0.8458331 0.54275902
## 0.008 0.8446863 0.54226352
## 0.010 0.8448426 0.54338388
## 0.012 0.8448426 0.54338388
## 0.014 0.8448426 0.54338388
## 0.016 0.8429661 0.53426816
## 0.018 0.8417155 0.52616378
## 0.020 0.8398912 0.51464080
## 0.022 0.8387966 0.50559250
## 0.024 0.8387966 0.50559250
## 0.026 0.8387966 0.50559250
## 0.028 0.8387966 0.50559250
## 0.030 0.8387966 0.50559250
## 0.032 0.8381188 0.50037344
## 0.034 0.8354091 0.48221719
## 0.036 0.8324904 0.46493345
## 0.038 0.8265489 0.43853941
## 0.040 0.8248811 0.43137353
## 0.042 0.8248811 0.43137353
## 0.044 0.8248811 0.43137353
## 0.046 0.8248811 0.43137353
## 0.048 0.8248811 0.43137353
## 0.050 0.8195128 0.39351430
## 0.052 0.8181579 0.38123605
## 0.054 0.8131548 0.32749986
## 0.056 0.8118517 0.30765346
## 0.058 0.8118517 0.30765346
## 0.060 0.8118517 0.30765346
## 0.062 0.8118517 0.30765346
## 0.064 0.8118517 0.30765346
## 0.066 0.8077861 0.28466355
## 0.068 0.7996551 0.23718533
## 0.070 0.7958511 0.21454287
## 0.072 0.7958511 0.21454287
## 0.074 0.7958511 0.21454287
## 0.076 0.7757344 0.09711952
## 0.078 0.7593684 0.00000000
## 0.080 0.7593684 0.00000000
## 0.082 0.7593684 0.00000000
## 0.084 0.7593684 0.00000000
## 0.086 0.7593684 0.00000000
## 0.088 0.7593684 0.00000000
## 0.090 0.7593684 0.00000000
## 0.092 0.7593684 0.00000000
## 0.094 0.7593684 0.00000000
## 0.096 0.7593684 0.00000000
## 0.098 0.7593684 0.00000000
## 0.100 0.7593684 0.00000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002.
Fit a CART model to the training data using this value of cp. What is the prediction accuracy on the test set?
CART_upd <- rpart(over50k ~ ., data = train_census, method = "class", cp = 0.002)
pred_upt <- predict(CART_upd, newdata = test_census, type = "class")
table(pred_upt, test_census$over50k)
##
## pred_upt 0 1
## 0 9178 1240
## 1 535 1838
(9178+1838)/nrow(test_census)
## [1] 0.8612306
prp(CART_upd)