Week 5 Coding Practice - Decision Trees

Part 1. Decision Tree in R

Step 1. Follow Along Chapter 5.5 - Supervised Learning: Random Forest

Overview of Random Forest Models:

  • Accurate and non-linear models
  • Robust to over-fitting
  • Require manual hyperparameter tuning
  • Built from generating a high number of individual decision trees.
  • Boot strap aggregation or bagging: Technique of building different trees using different inputs (with bootstrapped inputs, features, and observations) so we can explore a broad search space, and combining to produce accurate models.

5.5.1 Decision Trees

  • Advantage: make a complex decision simpler by breaking it down into smaller, simpler decisions using a divide-and-conquer strategy.
  • Essentially split data according to the value of the features through a set of if-else conditions.
  • Decision trees choose splits based on most homogeneous partitions, and lead to smaller and more homogeneous partitions over iterations.
  • Disdavantage: Become large and complex, corresponding with over-fitting (modeling noise rather than data patterns).
data(Sonar)
library("rpart") ## recursive partitioning
m <- rpart(Class ~ ., data = Sonar,
           method = "class")
library("rpart.plot")
rpart.plot(m)

p <- predict(m, Sonar, type = "class")
table(p, Sonar$Class)
##    
## p    M  R
##   M 95 10
##   R 16 87

Pruning to Avoid Over-fitting

  • Pre-pruning: Stop growing after a certain number of iterations, or require a minimum number of observations in each mode to allow splitting.
  • Post-pruning: Grow a large and comple tree, nad reduce its size. Remove nodes and branches with negligible effct on classification accuracy.

5.5.2 Training A Random Forest

# Use train() function from caret
set.seed(12)
model <- train(Class ~ .,
               data = Sonar,
               method = "ranger")
print(model)
## Random Forest 
## 
## 208 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 208, 208, 208, 208, 208, 208, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.8090731  0.6131571
##    2    extratrees  0.8136902  0.6234492
##   31    gini        0.7736954  0.5423516
##   31    extratrees  0.8285153  0.6521921
##   60    gini        0.7597299  0.5140905
##   60    extratrees  0.8157646  0.6255929
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 31, splitrule = extratrees
##  and min.node.size = 1.
plot(model)

  • The main hyperparameter: mtry, i.e., the number of randomly selected variables used at each split.
  • 2 variables produce random models, while hundreds of variables tend to be less random, but risk overfitting.
  • The caret package can automate hyperparameter tuning using grid search, which can be parametrised by setting tunelength (sets the number of hyperparemter values to test) or directly defining tuneGrid (the hyperparameter values), which requires knowledge of the model.
model <- train(Class ~ .,
               data = Sonar,
               method = "ranger",
               tuneLength = 5)
set.seed(42)
myGrid <- expand.grid(mtry = c(5, 10, 20, 40, 60),
                      splitrule = c("gini", "extratrees"),
                      min.node.size = 1) ## Minimal node size; default 1 for classification
model <- train(Class ~ .,
               data = Sonar,
               method = "ranger",
               tuneGrid = myGrid,
               trControl = trainControl(method = "cv",
                                       number = 5,
                                       verboseIter = FALSE))
print(model)
## Random Forest 
## 
## 208 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 166, 167, 167, 167, 165 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    5    gini        0.8076277  0.6098253
##    5    extratrees  0.8416579  0.6784745
##   10    gini        0.7927667  0.5799348
##   10    extratrees  0.8418848  0.6791453
##   20    gini        0.7882316  0.5718852
##   20    extratrees  0.8516355  0.6991879
##   40    gini        0.7880048  0.5716461
##   40    extratrees  0.8371229  0.6695638
##   60    gini        0.7833482  0.5613525
##   60    extratrees  0.8322448  0.6599318
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 20, splitrule = extratrees
##  and min.node.size = 1.
plot(model)

Experiment with training a RF model by using 5-fold cross validation and setting a tunelength of 5:

set.seed(42)
model <- train(Class ~ .,
               data = Sonar,
               method = "ranger",
               tuneLength = 5,
               trControl = trainControl(method = "cv",
                                        number = 5,
                                        verboseIter = FALSE))
plot(model)

Step 2.

Load Data

chocolate <- read_csv("data/chocolate_tibble.csv", show_col_types=FALSE)
names(chocolate) <- make.names(names(chocolate)) # normalizes columns names

kable(head(chocolate,5), "html", caption="Table 1. Top 5 columns") %>%
kable_styling("striped")
Table 1. Top 5 columns
final_grade review_date cocoa_percent company_location bean_type broad_bean_origin
3.75 2016 0.63 France NA Sao Tome
2.75 2015 0.70 France NA Togo
3.00 2015 0.70 France NA Togo
3.50 2015 0.70 France NA Togo
3.50 2015 0.70 France NA Peru

1. Randomly split your data into a training set (80%) and a test set (20%)

set.seed(3456)
trainIndex <- createDataPartition(chocolate$final_grade, p = .8,
                                  list = FALSE)
chocolate_train <- chocolate[ trainIndex,]
chocolate_test <- chocolate[-trainIndex,]

2. Build the decision tree retgression model using the training set (library rpart, party, decision_tree())

# library parsnip
spec <- decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
print(spec)
## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart
model <- spec %>%
parsnip::fit(formula = final_grade ~ .,data = chocolate_train)

#predict(model, new_data = chocolate_test)

#library rpart
model2 <- rpart(final_grade ~ cocoa_percent + company_location, data=chocolate_train, method="anova")

# library party
model3 <- ctree(
  final_grade ~ cocoa_percent + as.factor(company_location), 
  data = chocolate_train)

3. Visualize the decision trees with rpart.plot() and plot()

# Visualize the decision tree built with rpart using rpart.plot
rpart.plot(model2, box.palette="RdBu", shadow.col="gray", nn=TRUE)

# Plot the tree.
plot(model3)

Hyperparameters

decision_tree(tree_depth = 1) %>%
set_mode("regression") %>%
set_engine("rpart") %>%
parsnip::fit(formula = final_grade ~ .,
data = chocolate_train)
## parsnip model object
## 
## n= 1438 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 1438 321.2239 3.183936  
##   2) cocoa_percent>=0.885 31  11.5000 2.500000 *
##   3) cocoa_percent< 0.885 1407 294.9036 3.199005 *

Random Forest

library(MASS)
data(package="MASS")
boston<-Boston
dim(boston)
## [1] 506  14
names(boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Training Sample with 300 observations:

train=sample(1:nrow(Boston),300)
Boston.rf=randomForest(medv ~ . , data = Boston , subset = train)

Plotting the Error vs Number of Trees Graph shows how the error is dropping as we are adding more trees and average them.

plot(Boston.rf)

Evaluating variable importance

importance(Boston.rf)
##         IncNodePurity
## crim       1167.57012
## zn          105.68616
## indus      1382.61992
## chas         76.04384
## nox        1448.30458
## rm         6039.15927
## age         616.75173
## dis        1167.19886
## rad         186.45117
## tax         874.09457
## ptratio    1697.38179
## black       480.12419
## lstat      6329.55370
varImpPlot(Boston.rf)