APM HW 5: Ensemble Methods

1. (5+5+5=15 pts) Ensembles

In this problem we are going to analyze the same diabetes dataset that we used in HW4 to predict whether or not an individual suffers from diabetes. We have already preprocessed the train and test sets for your convenience (diabetes train-std.csv, diabetes test-std.csv). We will be using decision tree classifier with and without the meta methods – Bagging and Boosting. You are free to use either Python or R (take your pick) for this problem.

A) Fit a classification tree. Plot the tree, and report the mean error rate (fraction of incorrect labels) on test data. Report the confusion matrix. You can use rpart in R to fit the decision tree. In python, use the scikit learn’s tree package.

set.seed(2)

library(rpart)
library(caret)

#reading in files
train <- read.csv("diabetes_train.csv", header = TRUE)
test <- read.csv("diabetes_test.csv", header = TRUE)

tree <- rpart(classvariable~., method = "class", data = train)

plot(tree, uniform = TRUE, branch = .6, margin = .05)
text(tree, all = TRUE, use.n = TRUE, cex = 1, col='red')
title("Training Set Classification Tree")

plot of chunk unnamed-chunk-1

#summary(tree)

#plots cv error
#10 splits gives lowest error in table
rsq.rpart(tree)

## 
## Classification tree:
## rpart(formula = classvariable ~ ., data = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] age              BMI              pedigreefunction plasmacon       
## 
## Root node error: 154/400 = 0.385
## 
## n= 400 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.279221      0   1.00000 1.00000 0.063194
## 2 0.032468      1   0.72078 0.79870 0.059930
## 3 0.016234      6   0.55195 0.85714 0.061067
## 4 0.012987      8   0.51948 0.80519 0.060064
## 5 0.010000     10   0.49351 0.82468 0.060455

plot of chunk unnamed-chunk-1

predict1 <- predict(tree, newdata = test)

cutoff <- ifelse(predict1 >.5, 1, 0)
confusionMatrix(test$classvariable, cutoff[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  36 209
##          1  62  50
##                                           
##                Accuracy : 0.2409          
##                  95% CI : (0.1974, 0.2887)
##     No Information Rate : 0.7255          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.2998         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.3673          
##             Specificity : 0.1931          
##          Pos Pred Value : 0.1469          
##          Neg Pred Value : 0.4464          
##              Prevalence : 0.2745          
##          Detection Rate : 0.1008          
##    Detection Prevalence : 0.6863          
##       Balanced Accuracy : 0.2802          
##                                           
##        'Positive' Class : 0               
##

B) Analyze the data using random forests. Report the mean error rate and the confusion matrix.

library(randomForest)

RF <- randomForest(classvariable~., data=train)
plot(RF, main="Error vs. Number of Trees", col="red", lwd=6)

plot of chunk unnamed-chunk-2

print(RF)

## 
## Call:
##  randomForest(formula = classvariable ~ ., data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1821668
##                     % Var explained: 23.06

importance(RF)

##                  IncNodePurity
## numpreg               6.943031
## plasmacon            20.110726
## bloodpress            7.313031
## skinfold              6.015276
## seruminsulin          6.295963
## BMI                  14.875083
## pedigreefunction     11.372439
## age                  11.080470

RF_predict <- predict(RF, newdata=test)

pred2 = ifelse(RF_predict >.5, 1, 0)
confusionMatrix(test$classvariable, pred2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 217  28
##          1  43  69
##                                           
##                Accuracy : 0.8011          
##                  95% CI : (0.7559, 0.8413)
##     No Information Rate : 0.7283          
##     P-Value [Acc > NIR] : 0.0009002       
##                                           
##                   Kappa : 0.5207          
##  Mcnemar's Test P-Value : 0.0966142       
##                                           
##             Sensitivity : 0.8346          
##             Specificity : 0.7113          
##          Pos Pred Value : 0.8857          
##          Neg Pred Value : 0.6161          
##              Prevalence : 0.7283          
##          Detection Rate : 0.6078          
##    Detection Prevalence : 0.6863          
##       Balanced Accuracy : 0.7730          
##                                           
##        'Positive' Class : 0               
##

C) Use gradient boosted decision tree (gbdt) to analyze the data. You can use gbm package in R. Report the mean error rate and the confusion matrix.

library(gbm)


rows = nrow(train)

#training split
xtrain = train[,1:8]
ytrain = train[,9]

#test split
xtest = test[,1:8]
ytest = test[,9]

boostdt = gbm.fit(
  x = xtrain, 
  y = ytrain,
  distribution = "bernoulli",
  n.trees = 5000,
  interaction.depth = 3,
  shrinkage = .01,
  n.minobsinnode = 10,
  nTrain = round(rows*.8),
  verbose = TRUE)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3348          1.2957     0.0100    0.0021
##      2        1.3304          1.2915     0.0100    0.0018
##      3        1.3252          1.2882     0.0100    0.0019
##      4        1.3213          1.2849     0.0100    0.0009
##      5        1.3164          1.2801     0.0100    0.0021
##      6        1.3119          1.2744     0.0100    0.0022
##      7        1.3073          1.2716     0.0100    0.0018
##      8        1.3024          1.2677     0.0100    0.0016
##      9        1.2980          1.2628     0.0100    0.0019
##     10        1.2943          1.2582     0.0100    0.0014
##     20        1.2523          1.2176     0.0100    0.0013
##     40        1.1843          1.1530     0.0100    0.0012
##     60        1.1261          1.1023     0.0100    0.0002
##     80        1.0814          1.0607     0.0100    0.0001
##    100        1.0459          1.0306     0.0100    0.0003
##    120        1.0137          1.0086     0.0100    0.0001
##    140        0.9889          0.9984     0.0100    0.0002
##    160        0.9665          0.9892     0.0100   -0.0003
##    180        0.9477          0.9787     0.0100    0.0002
##    200        0.9313          0.9718     0.0100   -0.0004
##    220        0.9159          0.9630     0.0100    0.0000
##    240        0.9024          0.9543     0.0100   -0.0001
##    260        0.8906          0.9524     0.0100    0.0001
##    280        0.8782          0.9502     0.0100   -0.0002
##    300        0.8671          0.9478     0.0100   -0.0002
##    320        0.8557          0.9459     0.0100   -0.0004
##    340        0.8467          0.9442     0.0100   -0.0003
##    360        0.8371          0.9408     0.0100   -0.0002
##    380        0.8282          0.9410     0.0100   -0.0003
##    400        0.8200          0.9382     0.0100   -0.0001
##    420        0.8121          0.9423     0.0100   -0.0005
##    440        0.8049          0.9470     0.0100   -0.0003
##    460        0.7978          0.9475     0.0100   -0.0005
##    480        0.7890          0.9500     0.0100   -0.0004
##    500        0.7817          0.9522     0.0100   -0.0006
##    520        0.7747          0.9536     0.0100   -0.0002
##    540        0.7678          0.9574     0.0100   -0.0003
##    560        0.7610          0.9566     0.0100   -0.0004
##    580        0.7537          0.9581     0.0100   -0.0002
##    600        0.7476          0.9537     0.0100   -0.0006
##    620        0.7401          0.9549     0.0100   -0.0002
##    640        0.7336          0.9583     0.0100   -0.0002
##    660        0.7272          0.9618     0.0100   -0.0003
##    680        0.7212          0.9634     0.0100   -0.0003
##    700        0.7158          0.9607     0.0100   -0.0002
##    720        0.7096          0.9635     0.0100   -0.0001
##    740        0.7052          0.9663     0.0100   -0.0005
##    760        0.6998          0.9743     0.0100   -0.0003
##    780        0.6947          0.9751     0.0100   -0.0003
##    800        0.6898          0.9783     0.0100   -0.0005
##    820        0.6848          0.9804     0.0100   -0.0002
##    840        0.6789          0.9874     0.0100   -0.0004
##    860        0.6731          0.9864     0.0100   -0.0004
##    880        0.6680          0.9876     0.0100   -0.0005
##    900        0.6630          0.9922     0.0100   -0.0002
##    920        0.6580          0.9924     0.0100   -0.0004
##    940        0.6529          0.9962     0.0100   -0.0001
##    960        0.6480          0.9983     0.0100   -0.0003
##    980        0.6431          1.0016     0.0100   -0.0003
##   1000        0.6383          1.0050     0.0100   -0.0004
##   1020        0.6338          1.0073     0.0100   -0.0002
##   1040        0.6292          1.0146     0.0100   -0.0003
##   1060        0.6248          1.0192     0.0100   -0.0004
##   1080        0.6202          1.0191     0.0100   -0.0002
##   1100        0.6153          1.0231     0.0100   -0.0003
##   1120        0.6111          1.0207     0.0100   -0.0003
##   1140        0.6066          1.0224     0.0100   -0.0003
##   1160        0.6021          1.0283     0.0100   -0.0004
##   1180        0.5982          1.0312     0.0100   -0.0003
##   1200        0.5946          1.0349     0.0100   -0.0004
##   1220        0.5904          1.0401     0.0100   -0.0002
##   1240        0.5864          1.0410     0.0100   -0.0006
##   1260        0.5826          1.0438     0.0100   -0.0004
##   1280        0.5782          1.0463     0.0100   -0.0002
##   1300        0.5744          1.0470     0.0100   -0.0002
##   1320        0.5702          1.0475     0.0100   -0.0002
##   1340        0.5664          1.0525     0.0100   -0.0002
##   1360        0.5628          1.0585     0.0100   -0.0003
##   1380        0.5584          1.0609     0.0100   -0.0003
##   1400        0.5550          1.0586     0.0100   -0.0002
##   1420        0.5509          1.0615     0.0100   -0.0002
##   1440        0.5477          1.0626     0.0100   -0.0002
##   1460        0.5439          1.0626     0.0100   -0.0003
##   1480        0.5400          1.0642     0.0100   -0.0002
##   1500        0.5368          1.0631     0.0100   -0.0001
##   1520        0.5338          1.0666     0.0100   -0.0003
##   1540        0.5305          1.0691     0.0100   -0.0003
##   1560        0.5271          1.0750     0.0100   -0.0003
##   1580        0.5233          1.0738     0.0100   -0.0002
##   1600        0.5200          1.0712     0.0100   -0.0002
##   1620        0.5161          1.0726     0.0100   -0.0002
##   1640        0.5129          1.0782     0.0100   -0.0002
##   1660        0.5090          1.0805     0.0100   -0.0002
##   1680        0.5057          1.0836     0.0100   -0.0002
##   1700        0.5027          1.0854     0.0100   -0.0003
##   1720        0.4985          1.0842     0.0100   -0.0002
##   1740        0.4950          1.0847     0.0100   -0.0001
##   1760        0.4916          1.0890     0.0100   -0.0002
##   1780        0.4881          1.0954     0.0100   -0.0002
##   1800        0.4852          1.0940     0.0100   -0.0004
##   1820        0.4824          1.0967     0.0100   -0.0003
##   1840        0.4792          1.0966     0.0100   -0.0003
##   1860        0.4767          1.0969     0.0100   -0.0003
##   1880        0.4731          1.0999     0.0100   -0.0001
##   1900        0.4693          1.1073     0.0100   -0.0002
##   1920        0.4661          1.1111     0.0100   -0.0002
##   1940        0.4634          1.1127     0.0100   -0.0002
##   1960        0.4607          1.1161     0.0100   -0.0004
##   1980        0.4577          1.1205     0.0100   -0.0002
##   2000        0.4548          1.1201     0.0100   -0.0003
##   2020        0.4517          1.1209     0.0100   -0.0002
##   2040        0.4490          1.1212     0.0100   -0.0004
##   2060        0.4457          1.1220     0.0100   -0.0003
##   2080        0.4427          1.1218     0.0100   -0.0001
##   2100        0.4402          1.1225     0.0100   -0.0003
##   2120        0.4372          1.1286     0.0100   -0.0002
##   2140        0.4349          1.1299     0.0100   -0.0002
##   2160        0.4321          1.1299     0.0100   -0.0003
##   2180        0.4289          1.1321     0.0100   -0.0002
##   2200        0.4262          1.1339     0.0100   -0.0003
##   2220        0.4236          1.1333     0.0100   -0.0001
##   2240        0.4212          1.1327     0.0100   -0.0002
##   2260        0.4188          1.1349     0.0100   -0.0003
##   2280        0.4165          1.1372     0.0100   -0.0001
##   2300        0.4138          1.1425     0.0100   -0.0002
##   2320        0.4111          1.1454     0.0100   -0.0002
##   2340        0.4091          1.1519     0.0100   -0.0002
##   2360        0.4070          1.1521     0.0100   -0.0002
##   2380        0.4046          1.1540     0.0100   -0.0002
##   2400        0.4023          1.1510     0.0100   -0.0003
##   2420        0.4002          1.1549     0.0100   -0.0002
##   2440        0.3978          1.1575     0.0100   -0.0002
##   2460        0.3959          1.1574     0.0100   -0.0001
##   2480        0.3936          1.1584     0.0100   -0.0003
##   2500        0.3915          1.1583     0.0100   -0.0002
##   2520        0.3892          1.1621     0.0100   -0.0002
##   2540        0.3868          1.1612     0.0100   -0.0004
##   2560        0.3847          1.1608     0.0100   -0.0002
##   2580        0.3822          1.1608     0.0100   -0.0000
##   2600        0.3799          1.1601     0.0100   -0.0001
##   2620        0.3773          1.1628     0.0100   -0.0002
##   2640        0.3748          1.1654     0.0100   -0.0002
##   2660        0.3726          1.1665     0.0100   -0.0003
##   2680        0.3704          1.1694     0.0100   -0.0001
##   2700        0.3679          1.1687     0.0100   -0.0002
##   2720        0.3657          1.1669     0.0100   -0.0001
##   2740        0.3634          1.1731     0.0100   -0.0001
##   2760        0.3610          1.1747     0.0100   -0.0001
##   2780        0.3590          1.1781     0.0100   -0.0002
##   2800        0.3569          1.1786     0.0100   -0.0001
##   2820        0.3546          1.1813     0.0100   -0.0002
##   2840        0.3524          1.1822     0.0100   -0.0002
##   2860        0.3503          1.1846     0.0100   -0.0002
##   2880        0.3485          1.1904     0.0100   -0.0002
##   2900        0.3461          1.1910     0.0100   -0.0001
##   2920        0.3442          1.1934     0.0100   -0.0001
##   2940        0.3421          1.1976     0.0100   -0.0002
##   2960        0.3396          1.2001     0.0100   -0.0002
##   2980        0.3379          1.1996     0.0100   -0.0002
##   3000        0.3362          1.2022     0.0100   -0.0002
##   3020        0.3341          1.2039     0.0100   -0.0002
##   3040        0.3324          1.2089     0.0100   -0.0001
##   3060        0.3307          1.2100     0.0100   -0.0001
##   3080        0.3287          1.2100     0.0100   -0.0003
##   3100        0.3268          1.2127     0.0100   -0.0002
##   3120        0.3250          1.2120     0.0100   -0.0001
##   3140        0.3231          1.2119     0.0100   -0.0002
##   3160        0.3211          1.2172     0.0100   -0.0001
##   3180        0.3192          1.2160     0.0100   -0.0003
##   3200        0.3173          1.2172     0.0100   -0.0001
##   3220        0.3154          1.2169     0.0100   -0.0001
##   3240        0.3137          1.2160     0.0100   -0.0002
##   3260        0.3117          1.2210     0.0100   -0.0002
##   3280        0.3097          1.2207     0.0100   -0.0001
##   3300        0.3081          1.2223     0.0100   -0.0003
##   3320        0.3062          1.2263     0.0100   -0.0001
##   3340        0.3045          1.2274     0.0100   -0.0002
##   3360        0.3029          1.2278     0.0100   -0.0002
##   3380        0.3013          1.2286     0.0100   -0.0001
##   3400        0.2996          1.2305     0.0100   -0.0001
##   3420        0.2980          1.2317     0.0100   -0.0001
##   3440        0.2964          1.2334     0.0100   -0.0001
##   3460        0.2949          1.2353     0.0100   -0.0002
##   3480        0.2935          1.2368     0.0100   -0.0001
##   3500        0.2917          1.2376     0.0100   -0.0001
##   3520        0.2898          1.2388     0.0100   -0.0002
##   3540        0.2882          1.2426     0.0100   -0.0001
##   3560        0.2869          1.2441     0.0100   -0.0002
##   3580        0.2852          1.2445     0.0100   -0.0001
##   3600        0.2837          1.2410     0.0100   -0.0001
##   3620        0.2817          1.2426     0.0100   -0.0002
##   3640        0.2797          1.2479     0.0100   -0.0001
##   3660        0.2782          1.2488     0.0100   -0.0001
##   3680        0.2765          1.2540     0.0100   -0.0002
##   3700        0.2751          1.2575     0.0100   -0.0002
##   3720        0.2738          1.2589     0.0100   -0.0002
##   3740        0.2722          1.2569     0.0100   -0.0002
##   3760        0.2705          1.2587     0.0100   -0.0002
##   3780        0.2691          1.2601     0.0100   -0.0001
##   3800        0.2675          1.2611     0.0100   -0.0000
##   3820        0.2658          1.2628     0.0100   -0.0002
##   3840        0.2643          1.2627     0.0100   -0.0001
##   3860        0.2629          1.2664     0.0100   -0.0001
##   3880        0.2614          1.2654     0.0100   -0.0001
##   3900        0.2600          1.2657     0.0100   -0.0002
##   3920        0.2587          1.2655     0.0100   -0.0001
##   3940        0.2570          1.2651     0.0100   -0.0002
##   3960        0.2557          1.2644     0.0100   -0.0001
##   3980        0.2544          1.2635     0.0100   -0.0001
##   4000        0.2529          1.2685     0.0100   -0.0001
##   4020        0.2515          1.2707     0.0100   -0.0001
##   4040        0.2501          1.2699     0.0100   -0.0001
##   4060        0.2489          1.2696     0.0100   -0.0001
##   4080        0.2472          1.2680     0.0100   -0.0001
##   4100        0.2458          1.2715     0.0100   -0.0001
##   4120        0.2443          1.2731     0.0100   -0.0002
##   4140        0.2428          1.2765     0.0100   -0.0002
##   4160        0.2416          1.2783     0.0100   -0.0001
##   4180        0.2402          1.2789     0.0100   -0.0001
##   4200        0.2388          1.2828     0.0100   -0.0001
##   4220        0.2376          1.2831     0.0100   -0.0001
##   4240        0.2363          1.2849     0.0100   -0.0002
##   4260        0.2350          1.2846     0.0100   -0.0002
##   4280        0.2335          1.2892     0.0100   -0.0001
##   4300        0.2320          1.2920     0.0100   -0.0001
##   4320        0.2307          1.2952     0.0100   -0.0001
##   4340        0.2295          1.2953     0.0100   -0.0001
##   4360        0.2282          1.2932     0.0100   -0.0001
##   4380        0.2270          1.2943     0.0100   -0.0002
##   4400        0.2258          1.2968     0.0100   -0.0001
##   4420        0.2245          1.2976     0.0100   -0.0001
##   4440        0.2233          1.3015     0.0100   -0.0001
##   4460        0.2219          1.3031     0.0100   -0.0001
##   4480        0.2208          1.3051     0.0100   -0.0001
##   4500        0.2197          1.3060     0.0100   -0.0001
##   4520        0.2186          1.3059     0.0100   -0.0001
##   4540        0.2175          1.3091     0.0100   -0.0001
##   4560        0.2163          1.3087     0.0100   -0.0001
##   4580        0.2154          1.3129     0.0100   -0.0001
##   4600        0.2140          1.3142     0.0100   -0.0001
##   4620        0.2129          1.3161     0.0100   -0.0001
##   4640        0.2117          1.3172     0.0100   -0.0001
##   4660        0.2109          1.3161     0.0100   -0.0001
##   4680        0.2096          1.3201     0.0100   -0.0001
##   4700        0.2085          1.3190     0.0100   -0.0001
##   4720        0.2072          1.3229     0.0100   -0.0001
##   4740        0.2060          1.3252     0.0100   -0.0001
##   4760        0.2051          1.3260     0.0100   -0.0002
##   4780        0.2041          1.3280     0.0100   -0.0001
##   4800        0.2030          1.3288     0.0100   -0.0002
##   4820        0.2019          1.3285     0.0100   -0.0001
##   4840        0.2006          1.3288     0.0100   -0.0001
##   4860        0.1995          1.3355     0.0100   -0.0001
##   4880        0.1984          1.3361     0.0100   -0.0001
##   4900        0.1971          1.3415     0.0100   -0.0001
##   4920        0.1959          1.3467     0.0100   -0.0001
##   4940        0.1947          1.3504     0.0100   -0.0001
##   4960        0.1935          1.3534     0.0100   -0.0001
##   4980        0.1924          1.3555     0.0100   -0.0001
##   5000        0.1916          1.3574     0.0100   -0.0001

gbm.perf(boostdt)

## Using test method...

plot of chunk unnamed-chunk-3

## [1] 399

summary(boostdt)

plot of chunk unnamed-chunk-3

##                               var   rel.inf
## plasmacon               plasmacon 20.833809
## pedigreefunction pedigreefunction 18.841435
## BMI                           BMI 18.453293
## age                           age 12.487988
## numpreg                   numpreg  7.636792
## bloodpress             bloodpress  7.462160
## seruminsulin         seruminsulin  7.384304
## skinfold                 skinfold  6.900219

oos_pred <- predict(object = boostdt, newdata = xtest, n.trees = gbm.perf(boostdt), type = "response")

## Using test method...

plot of chunk unnamed-chunk-3

#rounding test predictions to be zero or 1
pred_results <- round(oos_pred)
confusionMatrix(ytest, pred_results)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 218  27
##          1  44  68
##                                           
##                Accuracy : 0.8011          
##                  95% CI : (0.7559, 0.8413)
##     No Information Rate : 0.7339          
##     P-Value [Acc > NIR] : 0.001929        
##                                           
##                   Kappa : 0.5183          
##  Mcnemar's Test P-Value : 0.057584        
##                                           
##             Sensitivity : 0.8321          
##             Specificity : 0.7158          
##          Pos Pred Value : 0.8898          
##          Neg Pred Value : 0.6071          
##              Prevalence : 0.7339          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6863          
##       Balanced Accuracy : 0.7739          
##                                           
##        'Positive' Class : 0               
##

Model accuracy is given below:

Accuracy:

Classification Tree: 24.09%

Random Forest: 80.11%

Gradient Boosted Decision Tree: 80.11%

Interestingly, the random forest model and the gradient boosted decision tree provided the same accuracy and approximately the same sensitivity and specificity rates. Both models substantially outperformed the original classification tree.