§ 1.1 Which variables are significant, or have factors that are significant?
set.seed(2000)
names(census)
[1] "age" "workclass" "education" "maritalstatus" "occupation"
[6] "relationship" "race" "sex" "capitalgain" "capitalloss"
[11] "hoursperweek" "nativecountry" "over50k"
split = sample.split(census$over50k, SplitRatio = 0.6)
train = subset(census, split == TRUE)
test = subset(census, split == FALSE)
train$over50k = as.factor(train$over50k)
test$over50k = as.factor(test$over50k)
model1 = glm(over50k ~ ., data = train, family = binomial)
glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model1)
Call:
glm(formula = over50k ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.1065 -0.5037 -0.1804 -0.0008 3.3383
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.658e+00 1.379e+00 -6.279 3.41e-10 ***
age 2.548e-02 2.139e-03 11.916 < 2e-16 ***
workclassFederal-gov 1.105e+00 2.014e-01 5.489 4.03e-08 ***
workclassLocal-gov 3.675e-01 1.821e-01 2.018 0.043641 *
workclassNever-worked -1.283e+01 8.453e+02 -0.015 0.987885
workclassPrivate 6.012e-01 1.626e-01 3.698 0.000218 ***
workclassSelf-emp-inc 7.575e-01 1.950e-01 3.884 0.000103 ***
workclassSelf-emp-not-inc 1.855e-01 1.774e-01 1.046 0.295646
workclassState-gov 4.012e-01 1.961e-01 2.046 0.040728 *
workclassWithout-pay -1.395e+01 6.597e+02 -0.021 0.983134
education11th 2.225e-01 2.867e-01 0.776 0.437738
education12th 6.380e-01 3.597e-01 1.774 0.076064 .
education1st-4th -7.075e-01 7.760e-01 -0.912 0.361897
education5th-6th -3.170e-01 4.880e-01 -0.650 0.516008
education7th-8th -3.498e-01 3.126e-01 -1.119 0.263152
education9th -1.258e-01 3.539e-01 -0.355 0.722228
educationAssoc-acdm 1.602e+00 2.427e-01 6.601 4.10e-11 ***
educationAssoc-voc 1.541e+00 2.368e-01 6.506 7.74e-11 ***
educationBachelors 2.177e+00 2.218e-01 9.817 < 2e-16 ***
educationDoctorate 2.761e+00 2.893e-01 9.544 < 2e-16 ***
educationHS-grad 1.006e+00 2.169e-01 4.638 3.52e-06 ***
educationMasters 2.421e+00 2.353e-01 10.289 < 2e-16 ***
educationPreschool -2.237e+01 6.864e+02 -0.033 0.973996
educationProf-school 2.938e+00 2.753e-01 10.672 < 2e-16 ***
educationSome-college 1.365e+00 2.195e-01 6.219 5.00e-10 ***
maritalstatusMarried-AF-spouse 2.540e+00 7.145e-01 3.555 0.000378 ***
maritalstatusMarried-civ-spouse 2.458e+00 3.573e-01 6.880 6.00e-12 ***
maritalstatusMarried-spouse-absent -9.486e-02 3.204e-01 -0.296 0.767155
maritalstatusNever-married -4.515e-01 1.139e-01 -3.962 7.42e-05 ***
maritalstatusSeparated 3.609e-02 1.984e-01 0.182 0.855672
maritalstatusWidowed 1.858e-01 1.962e-01 0.947 0.343449
occupationAdm-clerical 9.470e-02 1.288e-01 0.735 0.462064
occupationArmed-Forces -1.008e+00 1.487e+00 -0.677 0.498170
occupationCraft-repair 2.174e-01 1.109e-01 1.960 0.049972 *
occupationExec-managerial 9.400e-01 1.138e-01 8.257 < 2e-16 ***
occupationFarming-fishing -1.068e+00 1.908e-01 -5.599 2.15e-08 ***
occupationHandlers-cleaners -6.237e-01 1.946e-01 -3.204 0.001353 **
occupationMachine-op-inspct -1.862e-01 1.376e-01 -1.353 0.176061
occupationOther-service -8.183e-01 1.641e-01 -4.987 6.14e-07 ***
occupationPriv-house-serv -1.297e+01 2.267e+02 -0.057 0.954385
occupationProf-specialty 6.331e-01 1.222e-01 5.180 2.22e-07 ***
occupationProtective-serv 6.267e-01 1.710e-01 3.664 0.000248 ***
occupationSales 3.276e-01 1.175e-01 2.789 0.005282 **
occupationTech-support 6.173e-01 1.533e-01 4.028 5.63e-05 ***
occupationTransport-moving NA NA NA NA
relationshipNot-in-family 7.881e-01 3.530e-01 2.233 0.025562 *
relationshipOther-relative -2.194e-01 3.137e-01 -0.699 0.484263
relationshipOwn-child -7.489e-01 3.507e-01 -2.136 0.032716 *
relationshipUnmarried 7.041e-01 3.720e-01 1.893 0.058392 .
relationshipWife 1.324e+00 1.331e-01 9.942 < 2e-16 ***
raceAsian-Pac-Islander 4.830e-01 3.548e-01 1.361 0.173504
raceBlack 3.644e-01 2.882e-01 1.265 0.206001
raceOther 2.204e-01 4.513e-01 0.488 0.625263
raceWhite 4.108e-01 2.737e-01 1.501 0.133356
sexMale 7.729e-01 1.024e-01 7.545 4.52e-14 ***
capitalgain 3.280e-04 1.372e-05 23.904 < 2e-16 ***
capitalloss 6.445e-04 4.854e-05 13.277 < 2e-16 ***
hoursperweek 2.897e-02 2.101e-03 13.791 < 2e-16 ***
nativecountryCanada 2.593e-01 1.308e+00 0.198 0.842879
nativecountryChina -9.695e-01 1.327e+00 -0.730 0.465157
nativecountryColumbia -1.954e+00 1.526e+00 -1.280 0.200470
nativecountryCuba 5.735e-02 1.323e+00 0.043 0.965432
nativecountryDominican-Republic -1.435e+01 3.092e+02 -0.046 0.962972
nativecountryEcuador -3.550e-02 1.477e+00 -0.024 0.980829
nativecountryEl-Salvador -6.095e-01 1.395e+00 -0.437 0.662181
nativecountryEngland -6.707e-02 1.327e+00 -0.051 0.959686
nativecountryFrance 5.301e-01 1.419e+00 0.374 0.708642
nativecountryGermany 5.474e-02 1.306e+00 0.042 0.966572
nativecountryGreece -2.646e+00 1.714e+00 -1.544 0.122527
nativecountryGuatemala -1.293e+01 3.345e+02 -0.039 0.969180
nativecountryHaiti -9.221e-01 1.615e+00 -0.571 0.568105
nativecountryHoland-Netherlands -1.282e+01 2.400e+03 -0.005 0.995736
nativecountryHonduras -9.584e-01 3.412e+00 -0.281 0.778775
nativecountryHong -2.362e-01 1.492e+00 -0.158 0.874155
nativecountryHungary 1.412e-01 1.555e+00 0.091 0.927653
nativecountryIndia -8.218e-01 1.314e+00 -0.625 0.531661
nativecountryIran -3.299e-02 1.366e+00 -0.024 0.980736
nativecountryIreland 1.579e-01 1.473e+00 0.107 0.914628
nativecountryItaly 6.100e-01 1.333e+00 0.458 0.647194
nativecountryJamaica -2.279e-01 1.387e+00 -0.164 0.869467
nativecountryJapan 5.072e-01 1.375e+00 0.369 0.712179
nativecountryLaos -6.831e-01 1.661e+00 -0.411 0.680866
nativecountryMexico -9.182e-01 1.303e+00 -0.705 0.481103
nativecountryNicaragua -1.987e-01 1.507e+00 -0.132 0.895132
nativecountryOutlying-US(Guam-USVI-etc) -1.373e+01 8.502e+02 -0.016 0.987115
nativecountryPeru -9.660e-01 1.678e+00 -0.576 0.564797
nativecountryPhilippines 4.393e-02 1.281e+00 0.034 0.972640
nativecountryPoland 2.410e-01 1.383e+00 0.174 0.861624
nativecountryPortugal 7.276e-01 1.477e+00 0.493 0.622327
nativecountryPuerto-Rico -5.769e-01 1.357e+00 -0.425 0.670837
nativecountryScotland -1.188e+00 1.719e+00 -0.691 0.489616
nativecountrySouth -8.183e-01 1.341e+00 -0.610 0.541809
nativecountryTaiwan -2.590e-01 1.350e+00 -0.192 0.847878
nativecountryThailand -1.693e+00 1.737e+00 -0.975 0.329678
nativecountryTrinadad&Tobago -1.346e+00 1.721e+00 -0.782 0.434105
nativecountryUnited-States -8.594e-02 1.269e+00 -0.068 0.946020
nativecountryVietnam -1.008e+00 1.523e+00 -0.662 0.507799
nativecountryYugoslavia 1.402e+00 1.648e+00 0.851 0.394874
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 21175 on 19186 degrees of freedom
Residual deviance: 12104 on 19090 degrees of freedom
AIC: 12298
Number of Fisher Scoring iterations: 15
print("age, workclass, education, maritalstatus, occupation, relationship, sex, capitalgain, capitalloss, hoursperweek")
[1] "age, workclass, education, maritalstatus, occupation, relationship, sex, capitalgain, capitalloss, hoursperweek"
§ 1.2 What is the accuracy of the model on the testing set? Use a threshold of 0.5.
pred1 = predict(model1, newdata = test, type = "response")
prediction from a rank-deficient fit may be misleading
table(test$over50k, pred1 > 0.5)
FALSE TRUE
<=50K 9051 662
>50K 1190 1888
(9051+1888)/nrow(test)
[1] 0.8552107
§ 1.3 What is the baseline accuracy for the testing set?
table(test$over50k)
<=50K >50K
9713 3078
9713/nrow(test)
[1] 0.7593621
§ 1.4 What is the area-under-the-curve (AUC) for this model on the test set?
predroc = prediction(pred1, test$over50k)
auc = as.numeric(performance(predroc, "auc")@y.values)
auc
[1] 0.9061598
§ 2.1 How many splits does the tree have in total?
cart1 = rpart(over50k ~., data = train, method = "class")
rpart.plot(cart1)
print("4")
[1] "4"
§ 2.2 Which variable does the tree split on at the first level (the very first split of the tree)?
print("relationship")
[1] "relationship"
§ 2.3 Which variables does the tree split on at the second level (immediately after the first split of the tree)? Select all that apply.
print("education, capitalgain")
[1] "education, capitalgain"
§ 2.4 What is the accuracy of the model on the testing set?
predcart = predict(cart1, newdata = test, type = "class")
table(test$over50k, predcart)
predcart
<=50K >50K
<=50K 9243 470
>50K 1482 1596
(9243+1596)/nrow(test)
[1] 0.8473927
§ 2.5 Which of the following explanations for this behavior is most correct?
print("The probabilities from the CART model take only a handful of values (five, one for each end bucket/leaf of the tree); the changes in the ROC curve correspond to setting the threshold to one of those values.")
[1] "The probabilities from the CART model take only a handful of values (five, one for each end bucket/leaf of the tree); the changes in the ROC curve correspond to setting the threshold to one of those values."
§ 2.6
predcart1 = predict(cart1, newdata = test)
predrocart = prediction(predcart1[ ,2], test$over50k)
auc2 = as.numeric(performance(predrocart, "auc")@y.values)
auc2
[1] 0.8470256
§ 3.1 What is the accuracy of the model on the test set, using a threshold of 0.5?
set.seed(1)
forest1 = randomForest(over50k ~., data = trainSmall)
predforest = predict(forest1, newdata = test)
table(test$over50k, predforest)
predforest
<=50K >50K
<=50K 7377 2336
>50K 413 2665
(7377+2665)/nrow(test)
[1] 0.7850833
§ 3.2 Which of the following variables is the most important in terms of the number of splits?
vu = varUsed(forest1, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, names(forest1$forest$xlevels[vusorted$ix]))
print("age")
[1] "age"
§ 3.3 Which one of the following variables is the most important in terms of mean reduction in impurity?
varImpPlot(forest1)
print("occupation")
[1] "occupation"
§ 4.1 Which value of cp does the train function recommend?
numfolds = trainControl(method = "cv", number = 10)
cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))
train( over50k ~ . , data = train, method = "rpart", trControl = numfolds, tuneGrid = cartGrid )
CART
19187 samples
12 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 17268, 17268, 17268, 17269, 17268, 17268, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.002 0.8502111 0.55601532
0.004 0.8471886 0.55415827
0.006 0.8459896 0.54375064
0.008 0.8455206 0.54350297
0.010 0.8442693 0.53756685
0.012 0.8442693 0.53756685
0.014 0.8442693 0.53756685
0.016 0.8433834 0.53309955
0.018 0.8424975 0.52667618
0.020 0.8399959 0.50868798
0.022 0.8393185 0.50362546
0.024 0.8393185 0.50362546
0.026 0.8393185 0.50362546
0.028 0.8393185 0.50362546
0.030 0.8393185 0.50362546
0.032 0.8384326 0.49832772
0.034 0.8373383 0.48968068
0.036 0.8308235 0.45610628
0.038 0.8265495 0.43827197
0.040 0.8247774 0.43064907
0.042 0.8247774 0.43064907
0.044 0.8247774 0.43064907
0.046 0.8247774 0.43064907
0.048 0.8247774 0.43064907
0.050 0.8209723 0.40652195
0.052 0.8162815 0.36828149
0.054 0.8131549 0.32649620
0.056 0.8118518 0.30758669
0.058 0.8118518 0.30758669
0.060 0.8118518 0.30758669
0.062 0.8118518 0.30758669
0.064 0.8118518 0.30758669
0.066 0.8060655 0.27546182
0.068 0.7996560 0.23828903
0.070 0.7958512 0.21460562
0.072 0.7958512 0.21460562
0.074 0.7913697 0.18867561
0.076 0.7725539 0.07784227
0.078 0.7593684 0.00000000
0.080 0.7593684 0.00000000
0.082 0.7593684 0.00000000
0.084 0.7593684 0.00000000
0.086 0.7593684 0.00000000
0.088 0.7593684 0.00000000
0.090 0.7593684 0.00000000
0.092 0.7593684 0.00000000
0.094 0.7593684 0.00000000
0.096 0.7593684 0.00000000
0.098 0.7593684 0.00000000
0.100 0.7593684 0.00000000
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.002.
§ 4.2 What is the prediction accuracy on the test set?
modelfinal = rpart(over50k ~ . , data = train, method = "class", cp = 0.002)
predfinal = predict(modelfinal, newdata = test, type = "class")
table(test$over50k, predfinal)
predfinal
<=50K >50K
<=50K 9178 535
>50K 1240 1838
(9178+1838)/nrow(test)
[1] 0.8612306
§ 4.3 Plot the CART tree for this model. How many splits are there?
prp(modelfinal)
print("18")
[1] "18"