Notes June 24

Tree Models

These notes follow the models presented in the Datacamp course on tree-based models with the exception of the chapter on regression. There are for models used to solve binary classification problems. These notes use the caret package consistently.

Basic recursive partitioning
Bagging
Random Forest
Boosting

For each model, we examine the list of models available in caret to determine the specifics of using caret to implement the model. We need to know the required packages , the method to invoke the model, and the hyperparameters to be tuned. The list is at https://topepo.github.io/caret/available-models.html.

Libraries

library(tidyverse)

## ── Attaching packages ────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(rpart)
library(ipred)
library(e1071)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(gbm)

## Loaded gbm 2.1.5

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(ranger)

Get German Credit Data

We will use the German credit data, which can be found in many places on the web. The one below will work. I renamed the target variable to “OK” to reduce typing.

url="http://freakonometrics.free.fr/german_credit.csv"
credit=read.csv(url, header = TRUE, sep = ",")
credit %>% rename(OK = Creditability) %>%  
           mutate( OK = factor(OK) ) -> credit
str(credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ OK                               : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

Split.

This is the standard caret splitting process.

set.seed(123)
ind = createDataPartition(credit$OK,
                          p = .8,
                          list = FALSE)
traind = credit[ind,]
testd = credit[-ind,]
table(traind$OK)

## 
##   0   1 
## 240 560

table(testd$OK)

## 
##   0   1 
##  60 140

Recursive Partitioning

myTc = trainControl(method = "cv",number = 10)
mod_rpart <- train(OK ~ .,
                   method = "rpart",
                   data = traind,
                   tuneLength = 10, 
                   metric = "Accuracy",
                   trControl = myTc)
mod_rpart$bestTune

##           cp
## 4 0.02175926

pred_rpart = predict(mod_rpart,newdata = testd)
confusionMatrix(pred_rpart,testd$OK)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  24  27
##          1  36 113
##                                           
##                Accuracy : 0.685           
##                  95% CI : (0.6157, 0.7487)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.7079          
##                                           
##                   Kappa : 0.2164          
##  Mcnemar's Test P-Value : 0.3135          
##                                           
##             Sensitivity : 0.4000          
##             Specificity : 0.8071          
##          Pos Pred Value : 0.4706          
##          Neg Pred Value : 0.7584          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1200          
##    Detection Prevalence : 0.2550          
##       Balanced Accuracy : 0.6036          
##                                           
##        'Positive' Class : 0               
##

Bagging

mod_bag <- train(OK ~ .,
                   method = "treebag",
                   data = traind,
                   trControl = myTc,
                   metric = "Accuracy")
pred_bag = predict(mod_bag,newdata = testd)
confusionMatrix(pred_bag,testd$OK)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  29  17
##          1  31 123
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6947, 0.8174)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.03595         
##                                           
##                   Kappa : 0.3878          
##  Mcnemar's Test P-Value : 0.06060         
##                                           
##             Sensitivity : 0.4833          
##             Specificity : 0.8786          
##          Pos Pred Value : 0.6304          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1450          
##    Detection Prevalence : 0.2300          
##       Balanced Accuracy : 0.6810          
##                                           
##        'Positive' Class : 0               
##

mod_bag

## Bagged CART 
## 
## 800 samples
##  20 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.75375   0.3759284

Ranger

mod_ranger <- train(OK ~ .,
                   method = "ranger",
                   data = traind,
                   tuneLength = 10, 
                   metric = "Accuracy",
                   trControl = myTc)
mod_ranger$bestTune

##   mtry  splitrule min.node.size
## 4    4 extratrees             1

pred_ranger = predict(mod_ranger,newdata = testd)
confusionMatrix(pred_ranger,testd$OK)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  24  15
##          1  36 125
##                                           
##                Accuracy : 0.745           
##                  95% CI : (0.6787, 0.8039)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.093436        
##                                           
##                   Kappa : 0.3254          
##  Mcnemar's Test P-Value : 0.005101        
##                                           
##             Sensitivity : 0.4000          
##             Specificity : 0.8929          
##          Pos Pred Value : 0.6154          
##          Neg Pred Value : 0.7764          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1200          
##    Detection Prevalence : 0.1950          
##       Balanced Accuracy : 0.6464          
##                                           
##        'Positive' Class : 0               
##

gbm

Gradient Boosting.

mod_gbm <- train(OK ~ .,
                   method = "gbm",
                   data = traind,
                   tuneLength = 10, 
                   metric = "Accuracy",
                   trControl = myTc,
                   verbose = FALSE)
mod_gbm$bestTune

##    n.trees interaction.depth shrinkage n.minobsinnode
## 23     150                 3       0.1             10

pred_gbm = predict(mod_gbm,newdata = testd)
confusionMatrix(pred_gbm,testd$OK)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  29  15
##          1  31 125
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.7054, 0.8264)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.01687         
##                                           
##                   Kappa : 0.4072          
##  Mcnemar's Test P-Value : 0.02699         
##                                           
##             Sensitivity : 0.4833          
##             Specificity : 0.8929          
##          Pos Pred Value : 0.6591          
##          Neg Pred Value : 0.8013          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1450          
##    Detection Prevalence : 0.2200          
##       Balanced Accuracy : 0.6881          
##                                           
##        'Positive' Class : 0               
##

Resampling

Caret makes it easy to compare a collection of models focused on the same target.

results <- resamples(list(rpart=mod_rpart, bag=mod_bag, ranger=mod_ranger, gbm = mod_gbm))
# summarize the distributions
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: rpart, bag, ranger, gbm 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.  1st Qu.  Median    Mean  3rd Qu.   Max. NA's
## rpart  0.6625 0.731250 0.76875 0.75750 0.787500 0.8125    0
## bag    0.6500 0.715625 0.77500 0.75375 0.787500 0.8375    0
## ranger 0.7250 0.737500 0.75000 0.76750 0.784375 0.8750    0
## gbm    0.7625 0.775000 0.78125 0.78875 0.809375 0.8250    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rpart  0.22413793 0.3009868 0.4074943 0.3836654 0.4364583 0.5370370    0
## bag    0.05405405 0.2840441 0.4355644 0.3759284 0.4878049 0.5666667    0
## ranger 0.21428571 0.2856164 0.3424242 0.3698351 0.4133222 0.6621622    0
## gbm    0.36666667 0.4152961 0.4695857 0.4631315 0.5104167 0.5481928    0

# boxplots of results
bwplot(results)