We will be using this lab to explore decision trees and random forests using the palmerpenguins package. We will also use a couple of other packages such as rpart.plot, rpart, ranger, and vip.
library(tidymodels)
library(palmerpenguins)penguins_split <- initial_split(penguins)
set.seed(1234)
penguins_train <- training(penguins_split)
penguins_test <- testing(penguins_split)decision_tree(), and visualize the structure of the tree.decision_tree_rpart_spec <-
decision_tree() %>%
set_engine('rpart') %>%
set_mode('classification')dt_fit <- fit(decision_tree_rpart_spec, species ~., data = penguins_train)
dt_fit## parsnip model object
##
## Fit time: 9ms
## n= 258
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 258 148 Adelie (0.42635659 0.20542636 0.36821705)
## 2) flipper_length_mm< 206.5 160 51 Adelie (0.68125000 0.31250000 0.00625000)
## 4) bill_length_mm< 44.2 110 3 Adelie (0.97272727 0.02727273 0.00000000) *
## 5) bill_length_mm>=44.2 50 3 Chinstrap (0.04000000 0.94000000 0.02000000) *
## 3) flipper_length_mm>=206.5 98 4 Gentoo (0.01020408 0.03061224 0.95918367) *
The first node is flipper_length_mm, 206 mm is the critical value in the first layer. The second nodes are bill_length_mm and bill_depth_mm, which are used to predict species.
library(rpart.plot)## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
rpart.plot(dt_fit$fit)The complexity parameter (cp) is used to control the size of the decision tree and to determine the best tree size. When we change cost complexity to 0.5, we can see that the plot now only has one node.
decision_tree_rpart_spec_lambda <-
decision_tree(cost_complexity = 0.5) %>%
set_engine('rpart') %>%
set_mode('classification')
dt_fit_lambda <- fit(decision_tree_rpart_spec_lambda, species ~., data = penguins_train)
dt_fit_lambda## parsnip model object
##
## Fit time: 4ms
## n= 258
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 258 148 Adelie (0.42635659 0.20542636 0.36821705)
## 2) flipper_length_mm< 206.5 160 51 Adelie (0.68125000 0.31250000 0.00625000) *
## 3) flipper_length_mm>=206.5 98 4 Gentoo (0.01020408 0.03061224 0.95918367) *
rpart.plot(dt_fit_lambda$fit)## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
vip package to showcase the variable importance.With respect to the species variable, the vi() and vip() show the important variables in descending order. We can see the first important variable is flipper_length_mm.
library(vip)##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
vi(dt_fit)## # A tibble: 5 x 2
## Variable Importance
## <chr> <dbl>
## 1 flipper_length_mm 96.1
## 2 bill_length_mm 93.4
## 3 bill_depth_mm 73.3
## 4 body_mass_g 63.5
## 5 island 52.8
vip(dt_fit)rand_forest(). What Do you see in the output?We set the importance as impurity in the model specification.
rand_forest_ranger_spec <-
rand_forest() %>%
set_engine('ranger', importance = "impurity") %>%
set_mode('classification')
rand_forest_ranger_spec## Random Forest Model Specification (classification)
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
We can see the target node size is 10 in the algorithm.
set.seed(4321)
rf_fit <- fit(rand_forest_ranger_spec,species ~., data = penguins_train)
rf_fit## parsnip model object
##
## Fit time: 55ms
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 251
## Number of independent variables: 7
## Mtry: 2
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.01752356
vip package to showcase the variable importance for the random forest.According to the results of the random forest model, we can now conclude that the most important variable is bill_length_mm, which has a large value greater than 50.
vi(rf_fit$fit)## # A tibble: 7 x 2
## Variable Importance
## <chr> <dbl>
## 1 bill_length_mm 57.3
## 2 flipper_length_mm 36.6
## 3 bill_depth_mm 25.1
## 4 body_mass_g 19.3
## 5 island 15.7
## 6 sex 0.768
## 7 year 0.719
vip(rf_fit$fit)