Analytics Edge: Unit 4 - Predicting Earnings from Census Data

Letter Recognition

Background Information on the Dataset

The United States government periodically collects demographic information by conducting a census.

In this problem, we are going to use census information about an individual to predict how much a person earns – in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.

The file census.csv contains 1994 census data for 31,978 individuals in the United States.

The dataset includes the following 13 variables:

age = the age of the individual in years
workclass = the classification of the individual’s working status (does the person work for the federal government, work for the local government, work without pay, and so on)
education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
maritalstatus = the marital status of the individual
occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
relationship = relationship of individual to his/her household
race = the individual’s race
sex = the individual’s sex
capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
hoursperweek = the number of hours the individual works per week
nativecountry = the native country of the individual
over50k = whether or not the individual earned more than $50,000 in 1994

A Logistic Regression Model

Let’s begin by building a logistic regression model to predict whether an individual’s earnings are above $50,000 (the variable “over50k”) using all of the other variables as independent variables. First, read the dataset census.csv into R.

# Load the dataset
census = read.csv("census.csv")

Then, split the data randomly into a training set and a testing set, setting the seed to 2000 before creating the split. Split the data so that the training set contains 60% of the observations, while the testing set contains 40% of the observations.

# Split the data
library(caTools)
set.seed(2000)
spl = sample.split(census$over50k, SplitRatio = 0.6)
train = subset(census, spl==TRUE)
test = subset(census, spl==FALSE)

Next, build a logistic regression model to predict the dependent variable “over50k”, using all of the other variables in the dataset as independent variables. Use the training set to build the model.

# Logistic Regression
censusglm = glm( over50k ~ . , family="binomial", data = train)

Which variables are significant, or have factors that are significant?

# Ouput summary
summary(censusglm)
## 
## Call:
## glm(formula = over50k ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1065  -0.5037  -0.1804  -0.0008   3.3383  
## 
## Coefficients: (1 not defined because of singularities)
##                                               Estimate    Std. Error z value Pr(>|z|)    
## (Intercept)                                -8.65806856    1.37887064  -6.279 3.41e-10 ***
## age                                         0.02548377    0.00213862  11.916  < 2e-16 ***
## workclass Federal-gov                       1.10544684    0.20138058   5.489 4.03e-08 ***
## workclass Local-gov                         0.36745908    0.18213399   2.018 0.043641 *  
## workclass Never-worked                    -12.83463553  845.25237025  -0.015 0.987885    
## workclass Private                           0.60116720    0.16257796   3.698 0.000218 ***
## workclass Self-emp-inc                      0.75751201    0.19504818   3.884 0.000103 ***
## workclass Self-emp-not-inc                  0.18550593    0.17737915   1.046 0.295646    
## workclass State-gov                         0.40122758    0.19607576   2.046 0.040728 *  
## workclass Without-pay                     -13.94656116  659.74171825  -0.021 0.983134    
## education 11th                              0.22249968    0.28671982   0.776 0.437738    
## education 12th                              0.63803139    0.35965743   1.774 0.076064 .  
## education 1st-4th                          -0.70752228    0.77599976  -0.912 0.361897    
## education 5th-6th                          -0.31697638    0.48802273  -0.650 0.516008    
## education 7th-8th                          -0.34983906    0.31264328  -1.119 0.263152    
## education 9th                              -0.12582244    0.35394791  -0.355 0.722228    
## education Assoc-acdm                        1.60181447    0.24267843   6.601 4.10e-11 ***
## education Assoc-voc                         1.54077090    0.23683860   6.506 7.74e-11 ***
## education Bachelors                         2.17710549    0.22175847   9.817  < 2e-16 ***
## education Doctorate                         2.76090536    0.28929328   9.544  < 2e-16 ***
## education HS-grad                           1.00595477    0.21689434   4.638 3.52e-06 ***
## education Masters                           2.42099520    0.23530362  10.289  < 2e-16 ***
## education Preschool                       -22.37381585  686.38351401  -0.033 0.973996    
## education Prof-school                       2.93796400    0.27529757  10.672  < 2e-16 ***
## education Some-college                      1.36510102    0.21949622   6.219 5.00e-10 ***
## maritalstatus Married-AF-spouse             2.53981252    0.71446421   3.555 0.000378 ***
## maritalstatus Married-civ-spouse            2.45775343    0.35725463   6.880 6.00e-12 ***
## maritalstatus Married-spouse-absent        -0.09486159    0.32037254  -0.296 0.767155    
## maritalstatus Never-married                -0.45145986    0.11393384  -3.962 7.42e-05 ***
## maritalstatus Separated                     0.03609187    0.19843100   0.182 0.855672    
## maritalstatus Widowed                       0.18583975    0.19616353   0.947 0.343449    
## occupation Adm-clerical                     0.09470363    0.12876933   0.735 0.462064    
## occupation Armed-Forces                    -1.00754573    1.48743317  -0.677 0.498170    
## occupation Craft-repair                     0.21738177    0.11089755   1.960 0.049972 *  
## occupation Exec-managerial                  0.94002392    0.11384465   8.257  < 2e-16 ***
## occupation Farming-fishing                 -1.06829846    0.19079718  -5.599 2.15e-08 ***
## occupation Handlers-cleaners               -0.62368391    0.19463199  -3.204 0.001353 ** 
## occupation Machine-op-inspct               -0.18615509    0.13758876  -1.353 0.176061    
## occupation Other-service                   -0.81834269    0.16410606  -4.987 6.14e-07 ***
## occupation Priv-house-serv                -12.96803654  226.71118695  -0.057 0.954385    
## occupation Prof-specialty                   0.63312756    0.12223334   5.180 2.22e-07 ***
## occupation Protective-serv                  0.62671952    0.17103196   3.664 0.000248 ***
## occupation Sales                            0.32763049    0.11745841   2.789 0.005282 ** 
## occupation Tech-support                     0.61726217    0.15325185   4.028 5.63e-05 ***
## occupation Transport-moving                         NA            NA      NA       NA    
## relationship Not-in-family                  0.78813297    0.35297883   2.233 0.025562 *  
## relationship Other-relative                -0.21941041    0.31368455  -0.699 0.484263    
## relationship Own-child                     -0.74889371    0.35067956  -2.136 0.032716 *  
## relationship Unmarried                      0.70405924    0.37197780   1.893 0.058392 .  
## relationship Wife                           1.32352917    0.13312279   9.942  < 2e-16 ***
## race Asian-Pac-Islander                     0.48295113    0.35484186   1.361 0.173504    
## race Black                                  0.36440913    0.28815292   1.265 0.206001    
## race Other                                  0.22042314    0.45131248   0.488 0.625263    
## race White                                  0.41078057    0.27367168   1.501 0.133356    
## sex Male                                    0.77292572    0.10243965   7.545 4.52e-14 ***
## capitalgain                                 0.00032798    0.00001372  23.904  < 2e-16 ***
## capitalloss                                 0.00064450    0.00004854  13.277  < 2e-16 ***
## hoursperweek                                0.02896865    0.00210057  13.791  < 2e-16 ***
## nativecountry Canada                        0.25929834    1.30818152   0.198 0.842879    
## nativecountry China                        -0.96945665    1.32733033  -0.730 0.465157    
## nativecountry Columbia                     -1.95361879    1.52601139  -1.280 0.200470    
## nativecountry Cuba                          0.05734620    1.32323290   0.043 0.965432    
## nativecountry Dominican-Republic          -14.35418041  309.19185105  -0.046 0.962972    
## nativecountry Ecuador                      -0.03550051    1.47738336  -0.024 0.980829    
## nativecountry El-Salvador                  -0.60945436    1.39493987  -0.437 0.662181    
## nativecountry England                      -0.06706761    1.32683400  -0.051 0.959686    
## nativecountry France                        0.53008784    1.41856078   0.374 0.708642    
## nativecountry Germany                       0.05474290    1.30627868   0.042 0.966572    
## nativecountry Greece                       -2.64627293    1.71362412  -1.544 0.122527    
## nativecountry Guatemala                   -12.92569990  334.54909412  -0.039 0.969180    
## nativecountry Haiti                        -0.92212818    1.61537713  -0.571 0.568105    
## nativecountry Holand-Netherlands          -12.82337048 2399.54508211  -0.005 0.995736    
## nativecountry Honduras                     -0.95841482    3.41174875  -0.281 0.778775    
## nativecountry Hong                         -0.23623081    1.49151299  -0.158 0.874155    
## nativecountry Hungary                       0.14123283    1.55545975   0.091 0.927653    
## nativecountry India                        -0.82182200    1.31392333  -0.625 0.531661    
## nativecountry Iran                         -0.03298582    1.36606650  -0.024 0.980736    
## nativecountry Ireland                       0.15789631    1.47287088   0.107 0.914628    
## nativecountry Italy                         0.61000236    1.33286056   0.458 0.647194    
## nativecountry Jamaica                      -0.22791497    1.38689277  -0.164 0.869467    
## nativecountry Japan                         0.50724320    1.37489895   0.369 0.712179    
## nativecountry Laos                         -0.68309366    1.66088924  -0.411 0.680866    
## nativecountry Mexico                       -0.91817817    1.30324870  -0.705 0.481103    
## nativecountry Nicaragua                    -0.19868163    1.50729853  -0.132 0.895132    
## nativecountry Outlying-US(Guam-USVI-etc)  -13.73047828  850.17734225  -0.016 0.987115    
## nativecountry Peru                         -0.96599945    1.67786523  -0.576 0.564797    
## nativecountry Philippines                   0.04393410    1.28095162   0.034 0.972640    
## nativecountry Poland                        0.24102292    1.38274807   0.174 0.861624    
## nativecountry Portugal                      0.72758113    1.47715725   0.493 0.622327    
## nativecountry Puerto-Rico                  -0.57685949    1.35731803  -0.425 0.670837    
## nativecountry Scotland                     -1.18758848    1.71885316  -0.691 0.489616    
## nativecountry South                        -0.81828503    1.34127636  -0.610 0.541809    
## nativecountry Taiwan                       -0.25901687    1.35026471  -0.192 0.847878    
## nativecountry Thailand                     -1.69321308    1.73705227  -0.975 0.329678    
## nativecountry Trinadad&Tobago              -1.34619400    1.72106413  -0.782 0.434105    
## nativecountry United-States                -0.08593726    1.26927474  -0.068 0.946020    
## nativecountry Vietnam                      -1.00849873    1.52279370  -0.662 0.507799    
## nativecountry Yugoslavia                    1.40179163    1.64759288   0.851 0.394874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21175  on 19186  degrees of freedom
## Residual deviance: 12104  on 19090  degrees of freedom
## AIC: 12298
## 
## Number of Fisher Scoring iterations: 15

All variables besides race and nativecountry are significant.

What is the accuracy of the model on the testing set? Use a threshold of 0.5.

# Make predictions
predictTest = predict(censusglm, newdata = test, type = "response")
# Confusion matrix
z = table(test$over50k, predictTest >= 0.5)
kable(z)

	FALSE	TRUE
<=50K	9051	662
>50K	1190	1888

# Compute accuracy
sum(diag(z))/sum(z)
## [1] 0.8552107

Accuracy = 0.8552

What is the baseline accuracy for the testing set?

z = table(test$over50k)
kable(z)

Var1	Freq
<=50K	9713
>50K	3078

# Compute Accuracy
z[1]/sum(z)
##     <=50K 
## 0.7593621

Accuracy = 0.7594

What is the area-under-the-curve (AUC) for this model on the test set?

# Compute AUC
library(ROCR)
ROCRpred = prediction(predictTest, test$over50k)

as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.9061598

AUC = 0.9062

A CART Model

We have just seen how the logistic regression model for this data achieves a high accuracy. Moreover, the significances of the variables give us a way to gauge which variables are relevant for this prediction task. However, it is not immediately clear which variables are more important than the others, especially due to the large number of factor variables in this problem.

Let us now build a classification tree to predict “over50k”. Use the training set to build the model, and all of the other variables as independent variables. Use the default parameters, so don’t set a value for minbucket or cp. Remember to specify method=“class” as an argument to rpart, since this is a classification problem. After you are done building the model, plot the resulting tree.

# CART Model
library(rpart)
library(rpart.plot)
censustree = rpart( over50k ~ . , method="class", data = train)

How many splits does the tree have in total?

# Plot tree
prp(censustree)

4 splits.

Which variable does the tree split on at the first level (the very first split of the tree)?

relationship.

Which variables does the tree split on at the second level (immediately after the first split of the tree)?

education and capital gain.

What is the accuracy of the model on the testing set? Use a threshold of 0.5.

# Make predictions
predictTest = predict(censustree, newdata = test, type = "class")
# Confusion matrix
z = table(test$over50k, predictTest)
kable(z)

	<=50K	>50K
<=50K	9243	470
>50K	1482	1596

# Compute accuracy
sum(diag(z))/sum(z)
## [1] 0.8473927

Accuracy = 0.8474

Plot the ROC curve for the CART model you have estimated. Observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve. Which of the following explanations for this behavior is most correct?

library(ROCR)
# Calculate AUC
predictTest = predict(censustree, newdata = test)

ROCRpred = prediction(predictTest[,2], test$over50k)
as.numeric(performance(ROCRpred, "auc")@y.values)
## [1] 0.8470256

# Plot ROC
PredictROC = predict(censustree, newdata = test)
pred = prediction(PredictROC[,2], test$over50k)
perf = performance(pred, "tpr", "fpr")
plot(perf)

The breakpoints of the curve correspond to the false and true positive rates when the threshold is set to the five possible probability values.

What is the AUC of the CART model on the test set?

AUC = 0.8470256

A Random Forest Model

Before building a random forest model, we’ll down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than CART or Logistic Regression. For this reason, before continuing we will define a new training set to be used when building our random forest model, that contains 2000 randomly selected obervations from the original training set. Do this by running the following commands in your R console (assuming your training set is called “train”):

# New Training Set
set.seed(1)
trainSmall = train[sample(nrow(train), 2000), ]

Let us now build a random forest model to predict “over50k”, using the dataset “trainSmall” as the data used to build the model. Set the seed to 1 again right before building the model, and use all of the other variables in the dataset as independent variables. (If you get an error that random forest “can not handle categorical predictors with more than 32 categories”, re-build the model without the nativecountry variable as one of the independent variables.)

# Random Forest model
set.seed(1)
censusrf = randomForest(over50k ~ . , data = trainSmall)

Then, make predictions using this model on the entire test set.

# Make predictions
predictTest = predict(censusrf, newdata=test)

What is the accuracy of the model on the test set, using a threshold of 0.5?

# Confusion matrix
z = table(test$over50k, predictTest)
kable(z)

	<=50K	>50K
<=50K	8843	870
>50K	1029	2049

# compute Accuracy
sum(diag(z))/sum(z)
## [1] 0.8515362

Accuracy = 0.8337112

Which of the following variables is the most important in terms of the number of splits?

One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split. To view this metric, run the following lines of R code (replace “MODEL” with the name of your random forest model):

# Metric
vu = varUsed(censusrf, count=TRUE)

vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)

dotchart(vusorted$x, names(censusrf$forest$xlevels[vusorted$ix]))

Age is the most important in terms of the number of splits.

Which one of the following variables is the most important in terms of mean reduction in impurity?

A different metric we can look at is related to “impurity”, which measures how homogeneous each bucket or leaf of the tree is. In each tree in the forest, whenever we select a variable and perform a split, the impurity is decreased. Therefore, one way to measure the importance of a variable is to average the reduction in impurity, taken over all the times that variable is selected for splitting in all of the trees in the forest. To compute this metric, run the following command in R (replace “MODEL” with the name of your random forest model):

# Metric
varImpPlot(censusrf)

Occupation is the most important in terms of mean reduction in impurity.

Selecting cp by Cross-Validation

We now conclude our study of this data set by looking at how CART behaves with different choices of its parameters.

Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. Do this by using the train function. Set the seed beforehand to 2. Test cp values from 0.002 to 0.1 in 0.002 increments, by using the following command:

library(caret)
set.seed(2)
fitControl = trainControl( method = "cv", number = 10 )
# Select cp
cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002)) 
train( over50k ~ . , data = train, method = "rpart", trControl = fitControl, tuneGrid = cartGrid )
## CART 
## 
## 19187 samples
##    12 predictor
##     2 classes: ' <=50K', ' >50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17268, 17268, 17269, 17269, 17269, 17268, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa     
##   0.002  0.8510972  0.55404931
##   0.004  0.8482829  0.55537475
##   0.006  0.8452078  0.53914084
##   0.008  0.8442176  0.53817486
##   0.010  0.8433317  0.53305978
##   0.012  0.8433317  0.53305978
##   0.014  0.8433317  0.53305978
##   0.016  0.8413510  0.52349296
##   0.018  0.8400480  0.51528594
##   0.020  0.8381193  0.50351272
##   0.022  0.8381193  0.50351272
##   0.024  0.8381193  0.50351272
##   0.026  0.8381193  0.50351272
##   0.028  0.8381193  0.50351272
##   0.030  0.8381193  0.50351272
##   0.032  0.8381193  0.50351272
##   0.034  0.8352011  0.48749911
##   0.036  0.8326470  0.47340390
##   0.038  0.8267570  0.44688035
##   0.040  0.8248289  0.43893150
##   0.042  0.8248289  0.43893150
##   0.044  0.8248289  0.43893150
##   0.046  0.8248289  0.43893150
##   0.048  0.8248289  0.43893150
##   0.050  0.8231084  0.42467058
##   0.052  0.8174798  0.37478096
##   0.054  0.8138837  0.33679015
##   0.056  0.8118514  0.30751485
##   0.058  0.8118514  0.30751485
##   0.060  0.8118514  0.30751485
##   0.062  0.8118514  0.30751485
##   0.064  0.8118514  0.30751485
##   0.066  0.8099233  0.29697206
##   0.068  0.7971025  0.22226318
##   0.070  0.7958512  0.21465656
##   0.072  0.7958512  0.21465656
##   0.074  0.7958512  0.21465656
##   0.076  0.7689601  0.05701508
##   0.078  0.7593684  0.00000000
##   0.080  0.7593684  0.00000000
##   0.082  0.7593684  0.00000000
##   0.084  0.7593684  0.00000000
##   0.086  0.7593684  0.00000000
##   0.088  0.7593684  0.00000000
##   0.090  0.7593684  0.00000000
##   0.092  0.7593684  0.00000000
##   0.094  0.7593684  0.00000000
##   0.096  0.7593684  0.00000000
##   0.098  0.7593684  0.00000000
##   0.100  0.7593684  0.00000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002.

Fit a CART model to the training data using this value of cp. What is the prediction accuracy on the test set?

# CART model
CARTcv = rpart(over50k~., data=train, method="class", cp=0.002)
# Make predictions
predictTest = predict(CARTcv, newdata=test, type="class")
# Confusion matrix
table(test$over50k, predictTest)
##         predictTest
##           <=50K  >50K
##    <=50K   9178   535
##    >50K    1240  1838
# Compute accuracy
sum(diag(z))/sum(z)
## [1] 0.8515362

Accuracy = 0.8612

Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one – or should we? Plot the CART tree for this model. How many splits are there?

# Plot tree
prp(CARTcv)

There are 18 splits in the tree.

Analytics Edge: Unit 4 - Predicting Earnings from Census Data

Sulman Khan

October 26, 2018

Letter Recognition

Background Information on the Dataset

A Logistic Regression Model

Which variables are significant, or have factors that are significant?

What is the accuracy of the model on the testing set? Use a threshold of 0.5.

What is the baseline accuracy for the testing set?

What is the area-under-the-curve (AUC) for this model on the test set?

A CART Model

How many splits does the tree have in total?

Which variable does the tree split on at the first level (the very first split of the tree)?

Which variables does the tree split on at the second level (immediately after the first split of the tree)?

What is the accuracy of the model on the testing set? Use a threshold of 0.5.

Plot the ROC curve for the CART model you have estimated. Observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve. Which of the following explanations for this behavior is most correct?

What is the AUC of the CART model on the test set?

A Random Forest Model

What is the accuracy of the model on the test set, using a threshold of 0.5?

Which of the following variables is the most important in terms of the number of splits?

Which one of the following variables is the most important in terms of mean reduction in impurity?

Selecting cp by Cross-Validation

Fit a CART model to the training data using this value of cp. What is the prediction accuracy on the test set?

Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one – or should we? Plot the CART tree for this model. How many splits are there?

Analytics Edge: Unit 4 - Predicting Earnings from Census Data

Sulman Khan

October 26, 2018

Letter Recognition

Background Information on the Dataset

A Logistic Regression Model

Which variables are significant, or have factors that are significant?

What is the accuracy of the model on the testing set? Use a threshold of 0.5.

What is the baseline accuracy for the testing set?

What is the area-under-the-curve (AUC) for this model on the test set?

A CART Model

How many splits does the tree have in total?

Which variable does the tree split on at the first level (the very first split of the tree)?

Which variables does the tree split on at the second level (immediately after the first split of the tree)?

What is the accuracy of the model on the testing set? Use a threshold of 0.5.

Plot the ROC curve for the CART model you have estimated. Observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve. Which of the following explanations for this behavior is most correct?

What is the AUC of the CART model on the test set?

A Random Forest Model

What is the accuracy of the model on the test set, using a threshold of 0.5?

Which of the following variables is the most important in terms of the number of splits?

Which one of the following variables is the most important in terms of mean reduction in impurity?

Selecting cp by Cross-Validation

Which value of cp does the train function recommend?

Fit a CART model to the training data using this value of cp. What is the prediction accuracy on the test set?

Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one – or should we? Plot the CART tree for this model. How many splits are there?