We will be using this lab to explore decision trees and random forests using the palmerpenguins package. We will also use a couple of other packages such as rpart.plot, rpart, ranger, and vip.

library(tidymodels)
library(palmerpenguins)

Split the Data

penguins_split <- initial_split(penguins)
set.seed(1234)
penguins_train <- training(penguins_split)
penguins_test <- testing(penguins_split)

Decision Tree

Fit a decision tree using decision_tree(), and visualize the structure of the tree.

The system can automatically generalize the model specification by clicking the Addins tab in the header of RStudio

decision_tree_rpart_spec <-
  decision_tree() %>%
  set_engine('rpart') %>%
  set_mode('classification')

dt_fit <- fit(decision_tree_rpart_spec, species ~., data = penguins_train)
dt_fit

## parsnip model object
## 
## Fit time:  9ms 
## n= 258 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 258 148 Adelie (0.42635659 0.20542636 0.36821705)  
##   2) flipper_length_mm< 206.5 160  51 Adelie (0.68125000 0.31250000 0.00625000)  
##     4) bill_length_mm< 44.2 110   3 Adelie (0.97272727 0.02727273 0.00000000) *
##     5) bill_length_mm>=44.2 50   3 Chinstrap (0.04000000 0.94000000 0.02000000) *
##   3) flipper_length_mm>=206.5 98   4 Gentoo (0.01020408 0.03061224 0.95918367) *

The first node is flipper_length_mm, 206 mm is the critical value in the first layer. The second nodes are bill_length_mm and bill_depth_mm, which are used to predict species.

library(rpart.plot)

## Loading required package: rpart

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

rpart.plot(dt_fit$fit)

Try different values of the hyperparameters for the tree and see how the shape of the tree changes.

The complexity parameter (cp) is used to control the size of the decision tree and to determine the best tree size. When we change cost complexity to 0.5, we can see that the plot now only has one node.

decision_tree_rpart_spec_lambda <-
  decision_tree(cost_complexity = 0.5) %>%
  set_engine('rpart') %>%
  set_mode('classification')
dt_fit_lambda <- fit(decision_tree_rpart_spec_lambda, species ~., data = penguins_train)
dt_fit_lambda

## parsnip model object
## 
## Fit time:  4ms 
## n= 258 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 258 148 Adelie (0.42635659 0.20542636 0.36821705)  
##   2) flipper_length_mm< 206.5 160  51 Adelie (0.68125000 0.31250000 0.00625000) *
##   3) flipper_length_mm>=206.5 98   4 Gentoo (0.01020408 0.03061224 0.95918367) *

rpart.plot(dt_fit_lambda$fit)

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Variable Importance Plots

Use the vip package to showcase the variable importance.

With respect to the species variable, the vi() and vip() show the important variables in descending order. We can see the first important variable is flipper_length_mm.

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

vi(dt_fit)

## # A tibble: 5 x 2
##   Variable          Importance
##   <chr>                  <dbl>
## 1 flipper_length_mm       96.1
## 2 bill_length_mm          93.4
## 3 bill_depth_mm           73.3
## 4 body_mass_g             63.5
## 5 island                  52.8

vip(dt_fit)

Random Forest Models

Fit a random forest model using rand_forest(). What Do you see in the output?

We set the importance as impurity in the model specification.

rand_forest_ranger_spec <-
  rand_forest() %>%
  set_engine('ranger', importance = "impurity") %>%
  set_mode('classification')
rand_forest_ranger_spec

## Random Forest Model Specification (classification)
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

We can see the target node size is 10 in the algorithm.

set.seed(4321)
rf_fit <- fit(rand_forest_ranger_spec,species ~., data = penguins_train)
rf_fit

## parsnip model object
## 
## Fit time:  55ms 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      251 
## Number of independent variables:  7 
## Mtry:                             2 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.01752356

Use the vip package to showcase the variable importance for the random forest.

According to the results of the random forest model, we can now conclude that the most important variable is bill_length_mm, which has a large value greater than 50.

vi(rf_fit$fit)

## # A tibble: 7 x 2
##   Variable          Importance
##   <chr>                  <dbl>
## 1 bill_length_mm        57.3  
## 2 flipper_length_mm     36.6  
## 3 bill_depth_mm         25.1  
## 4 body_mass_g           19.3  
## 5 island                15.7  
## 6 sex                    0.768
## 7 year                   0.719

vip(rf_fit$fit)

Reference

https://stats.stackexchange.com/questions/179541/complexity-parameter-in-decision-tree

Decision Trees and Random Forests

Yunting Chiu

2021-06-16

Split the Data

Decision Tree

Variable Importance Plots

Random Forest Models

Reference