1. (5+5+5=15 pts) Ensembles
In this problem we are going to analyze the same diabetes dataset that we used in HW4 to predict whether or not an individual suffers from diabetes. We have already preprocessed the train and test sets for your convenience (diabetes train-std.csv, diabetes test-std.csv). We will be using decision tree classifier with and without the meta methods – Bagging and Boosting. You are free to use either Python or R (take your pick) for this problem.
A) Fit a classification tree. Plot the tree, and report the mean error rate (fraction of incorrect labels) on test data. Report the confusion matrix. You can use rpart in R to fit the decision tree. In python, use the scikit learn’s tree package.
set.seed(2)
library(rpart)
library(caret)
#reading in files
train <- read.csv("diabetes_train.csv", header = TRUE)
test <- read.csv("diabetes_test.csv", header = TRUE)
tree <- rpart(classvariable~., method = "class", data = train)
plot(tree, uniform = TRUE, branch = .6, margin = .05)
text(tree, all = TRUE, use.n = TRUE, cex = 1, col='red')
title("Training Set Classification Tree")
#summary(tree)
#plots cv error
#10 splits gives lowest error in table
rsq.rpart(tree)
##
## Classification tree:
## rpart(formula = classvariable ~ ., data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] age BMI pedigreefunction plasmacon
##
## Root node error: 154/400 = 0.385
##
## n= 400
##
## CP nsplit rel error xerror xstd
## 1 0.279221 0 1.00000 1.00000 0.063194
## 2 0.032468 1 0.72078 0.79870 0.059930
## 3 0.016234 6 0.55195 0.85714 0.061067
## 4 0.012987 8 0.51948 0.80519 0.060064
## 5 0.010000 10 0.49351 0.82468 0.060455
predict1 <- predict(tree, newdata = test)
cutoff <- ifelse(predict1 >.5, 1, 0)
confusionMatrix(test$classvariable, cutoff[,1])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 209
## 1 62 50
##
## Accuracy : 0.2409
## 95% CI : (0.1974, 0.2887)
## No Information Rate : 0.7255
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.2998
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.3673
## Specificity : 0.1931
## Pos Pred Value : 0.1469
## Neg Pred Value : 0.4464
## Prevalence : 0.2745
## Detection Rate : 0.1008
## Detection Prevalence : 0.6863
## Balanced Accuracy : 0.2802
##
## 'Positive' Class : 0
##
B) Analyze the data using random forests. Report the mean error rate and the confusion matrix.
library(randomForest)
RF <- randomForest(classvariable~., data=train)
plot(RF, main="Error vs. Number of Trees", col="red", lwd=6)
print(RF)
##
## Call:
## randomForest(formula = classvariable ~ ., data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1821668
## % Var explained: 23.06
importance(RF)
## IncNodePurity
## numpreg 6.943031
## plasmacon 20.110726
## bloodpress 7.313031
## skinfold 6.015276
## seruminsulin 6.295963
## BMI 14.875083
## pedigreefunction 11.372439
## age 11.080470
RF_predict <- predict(RF, newdata=test)
pred2 = ifelse(RF_predict >.5, 1, 0)
confusionMatrix(test$classvariable, pred2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 217 28
## 1 43 69
##
## Accuracy : 0.8011
## 95% CI : (0.7559, 0.8413)
## No Information Rate : 0.7283
## P-Value [Acc > NIR] : 0.0009002
##
## Kappa : 0.5207
## Mcnemar's Test P-Value : 0.0966142
##
## Sensitivity : 0.8346
## Specificity : 0.7113
## Pos Pred Value : 0.8857
## Neg Pred Value : 0.6161
## Prevalence : 0.7283
## Detection Rate : 0.6078
## Detection Prevalence : 0.6863
## Balanced Accuracy : 0.7730
##
## 'Positive' Class : 0
##
C) Use gradient boosted decision tree (gbdt) to analyze the data. You can use gbm package in R. Report the mean error rate and the confusion matrix.
library(gbm)
rows = nrow(train)
#training split
xtrain = train[,1:8]
ytrain = train[,9]
#test split
xtest = test[,1:8]
ytest = test[,9]
boostdt = gbm.fit(
x = xtrain,
y = ytrain,
distribution = "bernoulli",
n.trees = 5000,
interaction.depth = 3,
shrinkage = .01,
n.minobsinnode = 10,
nTrain = round(rows*.8),
verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.3348 1.2957 0.0100 0.0021
## 2 1.3304 1.2915 0.0100 0.0018
## 3 1.3252 1.2882 0.0100 0.0019
## 4 1.3213 1.2849 0.0100 0.0009
## 5 1.3164 1.2801 0.0100 0.0021
## 6 1.3119 1.2744 0.0100 0.0022
## 7 1.3073 1.2716 0.0100 0.0018
## 8 1.3024 1.2677 0.0100 0.0016
## 9 1.2980 1.2628 0.0100 0.0019
## 10 1.2943 1.2582 0.0100 0.0014
## 20 1.2523 1.2176 0.0100 0.0013
## 40 1.1843 1.1530 0.0100 0.0012
## 60 1.1261 1.1023 0.0100 0.0002
## 80 1.0814 1.0607 0.0100 0.0001
## 100 1.0459 1.0306 0.0100 0.0003
## 120 1.0137 1.0086 0.0100 0.0001
## 140 0.9889 0.9984 0.0100 0.0002
## 160 0.9665 0.9892 0.0100 -0.0003
## 180 0.9477 0.9787 0.0100 0.0002
## 200 0.9313 0.9718 0.0100 -0.0004
## 220 0.9159 0.9630 0.0100 0.0000
## 240 0.9024 0.9543 0.0100 -0.0001
## 260 0.8906 0.9524 0.0100 0.0001
## 280 0.8782 0.9502 0.0100 -0.0002
## 300 0.8671 0.9478 0.0100 -0.0002
## 320 0.8557 0.9459 0.0100 -0.0004
## 340 0.8467 0.9442 0.0100 -0.0003
## 360 0.8371 0.9408 0.0100 -0.0002
## 380 0.8282 0.9410 0.0100 -0.0003
## 400 0.8200 0.9382 0.0100 -0.0001
## 420 0.8121 0.9423 0.0100 -0.0005
## 440 0.8049 0.9470 0.0100 -0.0003
## 460 0.7978 0.9475 0.0100 -0.0005
## 480 0.7890 0.9500 0.0100 -0.0004
## 500 0.7817 0.9522 0.0100 -0.0006
## 520 0.7747 0.9536 0.0100 -0.0002
## 540 0.7678 0.9574 0.0100 -0.0003
## 560 0.7610 0.9566 0.0100 -0.0004
## 580 0.7537 0.9581 0.0100 -0.0002
## 600 0.7476 0.9537 0.0100 -0.0006
## 620 0.7401 0.9549 0.0100 -0.0002
## 640 0.7336 0.9583 0.0100 -0.0002
## 660 0.7272 0.9618 0.0100 -0.0003
## 680 0.7212 0.9634 0.0100 -0.0003
## 700 0.7158 0.9607 0.0100 -0.0002
## 720 0.7096 0.9635 0.0100 -0.0001
## 740 0.7052 0.9663 0.0100 -0.0005
## 760 0.6998 0.9743 0.0100 -0.0003
## 780 0.6947 0.9751 0.0100 -0.0003
## 800 0.6898 0.9783 0.0100 -0.0005
## 820 0.6848 0.9804 0.0100 -0.0002
## 840 0.6789 0.9874 0.0100 -0.0004
## 860 0.6731 0.9864 0.0100 -0.0004
## 880 0.6680 0.9876 0.0100 -0.0005
## 900 0.6630 0.9922 0.0100 -0.0002
## 920 0.6580 0.9924 0.0100 -0.0004
## 940 0.6529 0.9962 0.0100 -0.0001
## 960 0.6480 0.9983 0.0100 -0.0003
## 980 0.6431 1.0016 0.0100 -0.0003
## 1000 0.6383 1.0050 0.0100 -0.0004
## 1020 0.6338 1.0073 0.0100 -0.0002
## 1040 0.6292 1.0146 0.0100 -0.0003
## 1060 0.6248 1.0192 0.0100 -0.0004
## 1080 0.6202 1.0191 0.0100 -0.0002
## 1100 0.6153 1.0231 0.0100 -0.0003
## 1120 0.6111 1.0207 0.0100 -0.0003
## 1140 0.6066 1.0224 0.0100 -0.0003
## 1160 0.6021 1.0283 0.0100 -0.0004
## 1180 0.5982 1.0312 0.0100 -0.0003
## 1200 0.5946 1.0349 0.0100 -0.0004
## 1220 0.5904 1.0401 0.0100 -0.0002
## 1240 0.5864 1.0410 0.0100 -0.0006
## 1260 0.5826 1.0438 0.0100 -0.0004
## 1280 0.5782 1.0463 0.0100 -0.0002
## 1300 0.5744 1.0470 0.0100 -0.0002
## 1320 0.5702 1.0475 0.0100 -0.0002
## 1340 0.5664 1.0525 0.0100 -0.0002
## 1360 0.5628 1.0585 0.0100 -0.0003
## 1380 0.5584 1.0609 0.0100 -0.0003
## 1400 0.5550 1.0586 0.0100 -0.0002
## 1420 0.5509 1.0615 0.0100 -0.0002
## 1440 0.5477 1.0626 0.0100 -0.0002
## 1460 0.5439 1.0626 0.0100 -0.0003
## 1480 0.5400 1.0642 0.0100 -0.0002
## 1500 0.5368 1.0631 0.0100 -0.0001
## 1520 0.5338 1.0666 0.0100 -0.0003
## 1540 0.5305 1.0691 0.0100 -0.0003
## 1560 0.5271 1.0750 0.0100 -0.0003
## 1580 0.5233 1.0738 0.0100 -0.0002
## 1600 0.5200 1.0712 0.0100 -0.0002
## 1620 0.5161 1.0726 0.0100 -0.0002
## 1640 0.5129 1.0782 0.0100 -0.0002
## 1660 0.5090 1.0805 0.0100 -0.0002
## 1680 0.5057 1.0836 0.0100 -0.0002
## 1700 0.5027 1.0854 0.0100 -0.0003
## 1720 0.4985 1.0842 0.0100 -0.0002
## 1740 0.4950 1.0847 0.0100 -0.0001
## 1760 0.4916 1.0890 0.0100 -0.0002
## 1780 0.4881 1.0954 0.0100 -0.0002
## 1800 0.4852 1.0940 0.0100 -0.0004
## 1820 0.4824 1.0967 0.0100 -0.0003
## 1840 0.4792 1.0966 0.0100 -0.0003
## 1860 0.4767 1.0969 0.0100 -0.0003
## 1880 0.4731 1.0999 0.0100 -0.0001
## 1900 0.4693 1.1073 0.0100 -0.0002
## 1920 0.4661 1.1111 0.0100 -0.0002
## 1940 0.4634 1.1127 0.0100 -0.0002
## 1960 0.4607 1.1161 0.0100 -0.0004
## 1980 0.4577 1.1205 0.0100 -0.0002
## 2000 0.4548 1.1201 0.0100 -0.0003
## 2020 0.4517 1.1209 0.0100 -0.0002
## 2040 0.4490 1.1212 0.0100 -0.0004
## 2060 0.4457 1.1220 0.0100 -0.0003
## 2080 0.4427 1.1218 0.0100 -0.0001
## 2100 0.4402 1.1225 0.0100 -0.0003
## 2120 0.4372 1.1286 0.0100 -0.0002
## 2140 0.4349 1.1299 0.0100 -0.0002
## 2160 0.4321 1.1299 0.0100 -0.0003
## 2180 0.4289 1.1321 0.0100 -0.0002
## 2200 0.4262 1.1339 0.0100 -0.0003
## 2220 0.4236 1.1333 0.0100 -0.0001
## 2240 0.4212 1.1327 0.0100 -0.0002
## 2260 0.4188 1.1349 0.0100 -0.0003
## 2280 0.4165 1.1372 0.0100 -0.0001
## 2300 0.4138 1.1425 0.0100 -0.0002
## 2320 0.4111 1.1454 0.0100 -0.0002
## 2340 0.4091 1.1519 0.0100 -0.0002
## 2360 0.4070 1.1521 0.0100 -0.0002
## 2380 0.4046 1.1540 0.0100 -0.0002
## 2400 0.4023 1.1510 0.0100 -0.0003
## 2420 0.4002 1.1549 0.0100 -0.0002
## 2440 0.3978 1.1575 0.0100 -0.0002
## 2460 0.3959 1.1574 0.0100 -0.0001
## 2480 0.3936 1.1584 0.0100 -0.0003
## 2500 0.3915 1.1583 0.0100 -0.0002
## 2520 0.3892 1.1621 0.0100 -0.0002
## 2540 0.3868 1.1612 0.0100 -0.0004
## 2560 0.3847 1.1608 0.0100 -0.0002
## 2580 0.3822 1.1608 0.0100 -0.0000
## 2600 0.3799 1.1601 0.0100 -0.0001
## 2620 0.3773 1.1628 0.0100 -0.0002
## 2640 0.3748 1.1654 0.0100 -0.0002
## 2660 0.3726 1.1665 0.0100 -0.0003
## 2680 0.3704 1.1694 0.0100 -0.0001
## 2700 0.3679 1.1687 0.0100 -0.0002
## 2720 0.3657 1.1669 0.0100 -0.0001
## 2740 0.3634 1.1731 0.0100 -0.0001
## 2760 0.3610 1.1747 0.0100 -0.0001
## 2780 0.3590 1.1781 0.0100 -0.0002
## 2800 0.3569 1.1786 0.0100 -0.0001
## 2820 0.3546 1.1813 0.0100 -0.0002
## 2840 0.3524 1.1822 0.0100 -0.0002
## 2860 0.3503 1.1846 0.0100 -0.0002
## 2880 0.3485 1.1904 0.0100 -0.0002
## 2900 0.3461 1.1910 0.0100 -0.0001
## 2920 0.3442 1.1934 0.0100 -0.0001
## 2940 0.3421 1.1976 0.0100 -0.0002
## 2960 0.3396 1.2001 0.0100 -0.0002
## 2980 0.3379 1.1996 0.0100 -0.0002
## 3000 0.3362 1.2022 0.0100 -0.0002
## 3020 0.3341 1.2039 0.0100 -0.0002
## 3040 0.3324 1.2089 0.0100 -0.0001
## 3060 0.3307 1.2100 0.0100 -0.0001
## 3080 0.3287 1.2100 0.0100 -0.0003
## 3100 0.3268 1.2127 0.0100 -0.0002
## 3120 0.3250 1.2120 0.0100 -0.0001
## 3140 0.3231 1.2119 0.0100 -0.0002
## 3160 0.3211 1.2172 0.0100 -0.0001
## 3180 0.3192 1.2160 0.0100 -0.0003
## 3200 0.3173 1.2172 0.0100 -0.0001
## 3220 0.3154 1.2169 0.0100 -0.0001
## 3240 0.3137 1.2160 0.0100 -0.0002
## 3260 0.3117 1.2210 0.0100 -0.0002
## 3280 0.3097 1.2207 0.0100 -0.0001
## 3300 0.3081 1.2223 0.0100 -0.0003
## 3320 0.3062 1.2263 0.0100 -0.0001
## 3340 0.3045 1.2274 0.0100 -0.0002
## 3360 0.3029 1.2278 0.0100 -0.0002
## 3380 0.3013 1.2286 0.0100 -0.0001
## 3400 0.2996 1.2305 0.0100 -0.0001
## 3420 0.2980 1.2317 0.0100 -0.0001
## 3440 0.2964 1.2334 0.0100 -0.0001
## 3460 0.2949 1.2353 0.0100 -0.0002
## 3480 0.2935 1.2368 0.0100 -0.0001
## 3500 0.2917 1.2376 0.0100 -0.0001
## 3520 0.2898 1.2388 0.0100 -0.0002
## 3540 0.2882 1.2426 0.0100 -0.0001
## 3560 0.2869 1.2441 0.0100 -0.0002
## 3580 0.2852 1.2445 0.0100 -0.0001
## 3600 0.2837 1.2410 0.0100 -0.0001
## 3620 0.2817 1.2426 0.0100 -0.0002
## 3640 0.2797 1.2479 0.0100 -0.0001
## 3660 0.2782 1.2488 0.0100 -0.0001
## 3680 0.2765 1.2540 0.0100 -0.0002
## 3700 0.2751 1.2575 0.0100 -0.0002
## 3720 0.2738 1.2589 0.0100 -0.0002
## 3740 0.2722 1.2569 0.0100 -0.0002
## 3760 0.2705 1.2587 0.0100 -0.0002
## 3780 0.2691 1.2601 0.0100 -0.0001
## 3800 0.2675 1.2611 0.0100 -0.0000
## 3820 0.2658 1.2628 0.0100 -0.0002
## 3840 0.2643 1.2627 0.0100 -0.0001
## 3860 0.2629 1.2664 0.0100 -0.0001
## 3880 0.2614 1.2654 0.0100 -0.0001
## 3900 0.2600 1.2657 0.0100 -0.0002
## 3920 0.2587 1.2655 0.0100 -0.0001
## 3940 0.2570 1.2651 0.0100 -0.0002
## 3960 0.2557 1.2644 0.0100 -0.0001
## 3980 0.2544 1.2635 0.0100 -0.0001
## 4000 0.2529 1.2685 0.0100 -0.0001
## 4020 0.2515 1.2707 0.0100 -0.0001
## 4040 0.2501 1.2699 0.0100 -0.0001
## 4060 0.2489 1.2696 0.0100 -0.0001
## 4080 0.2472 1.2680 0.0100 -0.0001
## 4100 0.2458 1.2715 0.0100 -0.0001
## 4120 0.2443 1.2731 0.0100 -0.0002
## 4140 0.2428 1.2765 0.0100 -0.0002
## 4160 0.2416 1.2783 0.0100 -0.0001
## 4180 0.2402 1.2789 0.0100 -0.0001
## 4200 0.2388 1.2828 0.0100 -0.0001
## 4220 0.2376 1.2831 0.0100 -0.0001
## 4240 0.2363 1.2849 0.0100 -0.0002
## 4260 0.2350 1.2846 0.0100 -0.0002
## 4280 0.2335 1.2892 0.0100 -0.0001
## 4300 0.2320 1.2920 0.0100 -0.0001
## 4320 0.2307 1.2952 0.0100 -0.0001
## 4340 0.2295 1.2953 0.0100 -0.0001
## 4360 0.2282 1.2932 0.0100 -0.0001
## 4380 0.2270 1.2943 0.0100 -0.0002
## 4400 0.2258 1.2968 0.0100 -0.0001
## 4420 0.2245 1.2976 0.0100 -0.0001
## 4440 0.2233 1.3015 0.0100 -0.0001
## 4460 0.2219 1.3031 0.0100 -0.0001
## 4480 0.2208 1.3051 0.0100 -0.0001
## 4500 0.2197 1.3060 0.0100 -0.0001
## 4520 0.2186 1.3059 0.0100 -0.0001
## 4540 0.2175 1.3091 0.0100 -0.0001
## 4560 0.2163 1.3087 0.0100 -0.0001
## 4580 0.2154 1.3129 0.0100 -0.0001
## 4600 0.2140 1.3142 0.0100 -0.0001
## 4620 0.2129 1.3161 0.0100 -0.0001
## 4640 0.2117 1.3172 0.0100 -0.0001
## 4660 0.2109 1.3161 0.0100 -0.0001
## 4680 0.2096 1.3201 0.0100 -0.0001
## 4700 0.2085 1.3190 0.0100 -0.0001
## 4720 0.2072 1.3229 0.0100 -0.0001
## 4740 0.2060 1.3252 0.0100 -0.0001
## 4760 0.2051 1.3260 0.0100 -0.0002
## 4780 0.2041 1.3280 0.0100 -0.0001
## 4800 0.2030 1.3288 0.0100 -0.0002
## 4820 0.2019 1.3285 0.0100 -0.0001
## 4840 0.2006 1.3288 0.0100 -0.0001
## 4860 0.1995 1.3355 0.0100 -0.0001
## 4880 0.1984 1.3361 0.0100 -0.0001
## 4900 0.1971 1.3415 0.0100 -0.0001
## 4920 0.1959 1.3467 0.0100 -0.0001
## 4940 0.1947 1.3504 0.0100 -0.0001
## 4960 0.1935 1.3534 0.0100 -0.0001
## 4980 0.1924 1.3555 0.0100 -0.0001
## 5000 0.1916 1.3574 0.0100 -0.0001
gbm.perf(boostdt)
## Using test method...
## [1] 399
summary(boostdt)
## var rel.inf
## plasmacon plasmacon 20.833809
## pedigreefunction pedigreefunction 18.841435
## BMI BMI 18.453293
## age age 12.487988
## numpreg numpreg 7.636792
## bloodpress bloodpress 7.462160
## seruminsulin seruminsulin 7.384304
## skinfold skinfold 6.900219
oos_pred <- predict(object = boostdt, newdata = xtest, n.trees = gbm.perf(boostdt), type = "response")
## Using test method...
#rounding test predictions to be zero or 1
pred_results <- round(oos_pred)
confusionMatrix(ytest, pred_results)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 218 27
## 1 44 68
##
## Accuracy : 0.8011
## 95% CI : (0.7559, 0.8413)
## No Information Rate : 0.7339
## P-Value [Acc > NIR] : 0.001929
##
## Kappa : 0.5183
## Mcnemar's Test P-Value : 0.057584
##
## Sensitivity : 0.8321
## Specificity : 0.7158
## Pos Pred Value : 0.8898
## Neg Pred Value : 0.6071
## Prevalence : 0.7339
## Detection Rate : 0.6106
## Detection Prevalence : 0.6863
## Balanced Accuracy : 0.7739
##
## 'Positive' Class : 0
##
Model accuracy is given below:
Accuracy:
Classification Tree: 24.09%
Random Forest: 80.11%
Gradient Boosted Decision Tree: 80.11%
Interestingly, the random forest model and the gradient boosted decision tree provided the same accuracy and approximately the same sensitivity and specificity rates. Both models substantially outperformed the original classification tree.