S3729C Data Analytics Seminar

04 Machine Learning

Class on 28 August 2021

Decision Trees and Random Forests

In this section, we will continue to use the Community2Campus dataset to develop predictive models on stigma with decision trees and random forests. These are supervised learning algorithms that can be used for classification or regression.

Brief Introduction to Decision Trees

A decision tree is a supervised machine learning algorithm that can be used for both classification and regression problems. A decision tree is simply a series of sequential decisions made to reach a specific result. We will use the example below to briefly explain what this is about. Let’s understand how this tree works.

First, it checks if the customer has a good credit history. Based on that, it classifies the customer into two groups, i.e., customers with good credit history and customers with bad credit history. Then, it checks the income of the customer and again classifies him/her into two groups. Finally, it checks the loan amount requested by the customer. Based on the outcomes from checking these three features, the decision tree decides if the customer’s loan should be approved or not.

The features/attributes and conditions can change based on the data and complexity of the problem but the overall idea remains the same. So, a decision tree makes a series of decisions based on a set of features/attributes present in the data, which in this case were credit history, income, and loan amount.

Now, you might be wondering:

Why did the decision tree check the credit score first and not the income?

This is known as feature importance and the sequence of attributes to be checked is decided on the basis of criteria like Gini Impurity Index or Information Gain.

An Overview of Random Forest

The decision tree algorithm is quite easy to understand and interpret. But often, a single tree is not sufficient for producing effective results. This is where the Random Forest algorithm comes into the picture.

Random Forest is a tree-based machine learning algorithm that leverages the power of multiple decision trees for making decisions. As the name suggests, it is a “forest” of trees!

But why do we call it a “random” forest? That’s because it is a forest of randomly created decision trees. Each node in the decision tree works on a random subset of features to calculate the output. The random forest then combines the output of individual decision trees to generate the final output.

In simple words:

The Random Forest Algorithm combines the output of multiple (randomly created) Decision Trees to generate the final output. This process of combining the output of multiple individual models (also known as weak learners) is called Ensemble Learning.

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. You have used it in Section 01, 02 and 03 too.

install.packages("pacman",repos = "http://cran.us.r-project.org")
## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\Rtmp6ZYZZ0\downloaded_packages

Load the packages required for this section

pacman::p_load(pacman, psych, rio, magrittr, ggplot2, tidyverse, tidymodels, vip, rpart, rpart.plot, ranger)

You can now import the source data from the csv file which I have placed online at Github via the link below using the read.csv command.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2Cdecisiontree.csv

C2CdTx <- read.csv(file = "https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/C2Cdecisiontree.csv", header = TRUE, sep = ",")

Check on the output by reading the column names

C2CdTx %>% colnames()
##  [1] "UNQ_ID"        "AGE.RANGE"     "GENDER"        "NCSS_B03A"    
##  [5] "NCSS_B04"      "NCSS_B05"      "PARTICIPATION" "Process_E_LS" 
##  [9] "Process_E_EX"  "Process_E_CT"  "X_MR1"         "X_MR2"        
## [13] "X_MR3"         "X_MR4"         "X_MR5"         "I_MR1"        
## [17] "I_MR2"         "I_MR3"         "I_MR4"         "I_MR5"

Lets now prepare the data that we want to use for the decision tree model.

C2CdT <- C2CdTx%>%
select(AGE.RANGE, GENDER, NCSS_B03A, NCSS_B04, NCSS_B05, PARTICIPATION, Process_E_LS, Process_E_EX, Process_E_CT, I_MR1)

We will now convert some variables to factors so that the process of decision tree construction can carry on as required.

C2CdT$AGE.RANGE <- factor(C2CdT$AGE.RANGE)
C2CdT$GENDER <- factor(C2CdT$GENDER)
C2CdT$NCSS_B03A <- factor(C2CdT$NCSS_B03A)
C2CdT$NCSS_B04 <- factor(C2CdT$NCSS_B04)
C2CdT$NCSS_B05 <- factor(C2CdT$NCSS_B05)
C2CdT$I_MR1 <- factor(C2CdT$I_MR1)

We will be working with the C2CdT data frame in this session. This code shows the first 6 rows of the dataset.

head(C2CdT)
##         AGE.RANGE GENDER NCSS_B03A NCSS_B04            NCSS_B05 PARTICIPATION
## 1 18-24 years old Female        No       No       7 Polytechnic             7
## 2 18-24 years old   Male       Yes       No       7 Polytechnic             6
## 3 18-24 years old Female       Yes       No       7 Polytechnic             4
## 4 18-24 years old Female       Yes       No       7 Polytechnic             3
## 5 18-24 years old Female       Yes       No       7 Polytechnic             9
## 6 18-24 years old Female       Yes       No 8 University Degree            10
##   Process_E_LS Process_E_EX Process_E_CT I_MR1
## 1            5            5            5     N
## 2            5            5            5     N
## 3            5            5            5     Y
## 4            5            5            5     N
## 5            5            5            5     N
## 6            5            5            5     N

A row in this data frame represents a participant in any of the Community2Campus events. Each participant has completed both pre- and post- event surveys which contain the specific data fields.

The response variable in this data is I_MR1 which indicates whether there is a significant positive impact on the aggregated variable MR1 which is a measure of the impact on self- and social stigma associated with mental health.

Decision Trees

Decision Trees

To demonstrate fitting decision trees, we will use the C2CdT data set and predict I_MR1 using all available predictor variables.

A decision tree is specified with the decision_tree() function from tidymodels and has three hyperparameters, cost_complexity, tree_depth, and min_n. Since we will need to perform hyperparameter tuning, we will create cross validation folds from our training data.

Data Splitting

We will split the data into a training and test set. The training data will be further divided into 5 folds for hyperparameter tuning.

set.seed(314) # Remember to always set your seed. Any integer will work

C2CdT_split <- initial_split(C2CdT, prop = 0.75, 
                             strata = I_MR1)

C2CdT_training <- C2CdT_split %>% training()

C2CdT_test <- C2CdT_split %>% testing()

# Create folds for cross validation on the training data set
## These will be used to tune model hyperparameters
set.seed(314)

C2CdT_folds <- vfold_cv(C2CdT_training, v = 5)

Feature Engineering

Now we can create a feature engineering recipe for this data. We will train the following transformations on our training data.

C2CdT_recipe <- recipe(I_MR1 ~ ., data = C2CdT_training) %>% 
                       step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                       step_normalize(all_numeric(), -all_outcomes()) %>% 
                       step_dummy(all_nominal(), -all_outcomes())

Let’s check to see if the feature engineering steps have been carried out correctly.

C2CdT_recipe %>% 
  prep() %>% 
  bake(new_data = C2CdT_training)
## # A tibble: 721 x 22
##    PARTICIPATION Process_E_LS Process_E_EX Process_E_CT I_MR1 AGE.RANGE_X18.24.~
##            <dbl>        <dbl>        <dbl>        <dbl> <fct>              <dbl>
##  1       -0.0525        0.735        0.684        0.667 N                      1
##  2       -1.66          0.735        0.684        0.667 N                      1
##  3        0.876         0.735        0.684        0.667 N                      1
##  4        1.36          0.735        0.684        0.667 N                      1
##  5        0.403         0.735        0.684        0.667 N                      1
##  6       -0.0525        0.735        0.684        0.667 N                      1
##  7        0.403         0.735        0.684        0.667 N                      1
##  8        1.36          0.735        0.684        0.667 N                      1
##  9       -1.30          0.735        0.684        0.667 N                      1
## 10        0.876         0.735        0.684        0.667 N                      1
## # ... with 711 more rows, and 16 more variables:
## #   AGE.RANGE_X25.34.years.old <dbl>, AGE.RANGE_X35.44.years.old <dbl>,
## #   AGE.RANGE_X45.54.years.old <dbl>, AGE.RANGE_X55.64.years.old <dbl>,
## #   AGE.RANGE_X65.74.years.old <dbl>, AGE.RANGE_Below.12.years.old <dbl>,
## #   GENDER_Male <dbl>, NCSS_B03A_Yes <dbl>, NCSS_B04_Yes <dbl>,
## #   NCSS_B05_X2.Primary.School <dbl>, NCSS_B05_X3.Lower.Secondary.School <dbl>,
## #   NCSS_B05_X4.Upper.Secondary.School <dbl>, ...

Model Specification

Next, we specify a decision tree classifier with the following hyperparameters:

To specify a decision tree model with tidymodels, we need the decision_tree() function. The hyperparameters of the model are arguments within the decision_tree() function and may be set to specify values.

However, if tuning is required, then each of these parameters must be set to tune().

We will be using the rpart engine since it will allow us to easily make plots of our decision tree models with the rpart.plot() function.

tree_model <- decision_tree(cost_complexity = tune(),
                            tree_depth = tune(),
                            min_n = tune()) %>% 
              set_engine('rpart') %>% 
              set_mode('classification')

Workflow

Next, we combine our model and recipe into a workflow to easily manage the model-building process.

## Create a grid of hyperparameter values to test
tree_workflow <- workflow() %>% 
                 add_model(tree_model) %>% 
                 add_recipe(C2CdT_recipe)

Hyperparameter Tuning

We will perform a grid search on the decision tree hyperparameters and select the best performing model based on the area under the ROC curve during cross validation.

For models with multiple hyperparameters, tidymodels has the grid_regular() function for automatically creating a tuning grid with combinations of suggested hyperparameter values.

The grid_regular() function takes the hyperparameters as arguments and a levels option. The levels option is used to determine the number of values to create for each unique hyperparameter.

The number of rows in the grid will be levels of hyperparameters

For example, with levels = 2 and 3 model hyperparameters, the grid_regular() function will produce a tuning grid with 23 = 8 hyperparameter combinations.

See the example below, where we create the tree_grid object.

## Create a grid of hyperparameter values to test
tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          min_n(), 
                          levels = 2)
# View grid
tree_grid
## # A tibble: 8 x 3
##   cost_complexity tree_depth min_n
##             <dbl>      <int> <int>
## 1    0.0000000001          1     2
## 2    0.1                   1     2
## 3    0.0000000001         15     2
## 4    0.1                  15     2
## 5    0.0000000001          1    40
## 6    0.1                   1    40
## 7    0.0000000001         15    40
## 8    0.1                  15    40

Tuning Hyperparameters with tune_grid()

To find the optimal combination of hyperparameters from our tuning grid, we will use the tune_grid() function.

This function takes a model object or workflow as the first argument, cross valdidation folds for the second, and a tuning grid data frame as the third argument.

It is always sugguested to use set.seed() before using tune_grid() in order to be able to reproduce the results at a later time.

## Tune decision tree workflow
set.seed(314)

tree_tuning <- tree_workflow %>% 
               tune_grid(resamples = C2CdT_folds,
                         grid = tree_grid)
## Warning: package 'rlang' was built under R version 3.6.3
## Warning: package 'vctrs' was built under R version 3.6.3

To view the results of our hyperparameter tuning, we can use the show_best() function. We must pass the type of performance metric we would like to see into the show_best() function.

From the results below, we see that for each combination of hyperparameters in our grid, tune_grid() fit a decision tree model with that combination 5 times (since we have 5 folds in our cross validation object).

The mean column in the results below indicates the average value of the performance metric that was obtained.

## Tune decision tree workflow
## Show the top 5 best models based on roc_auc metric
tree_tuning %>% show_best('roc_auc')
## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.828     5  0.0178
## 2    0.0000000001         15     2 roc_auc binary     0.758     5  0.0198
## 3    0.1                  15     2 roc_auc binary     0.653     5  0.0266
## 4    0.1                  15    40 roc_auc binary     0.653     5  0.0266
## 5    0.0000000001          1     2 roc_auc binary     0.627     5  0.0169
## # ... with 1 more variable: .config <fct>

We can use the select_best() model to select the model from our tuning results that had the best overall performance. In the code below, we specify to select the best performing model based on the roc_auc metric.

We can see which model is best by looking at the cost compexity and tree depth which provides the best roc_auc metric. In this case it is the model with cost complexity of 0.0000000001 and tree depth of 15.

## Select best model based on roc_auc
best_tree <- tree_tuning %>% 
             select_best(metric = 'roc_auc')

# View the best tree parameters
best_tree
## # A tibble: 1 x 4
##   cost_complexity tree_depth min_n .config             
##             <dbl>      <int> <int> <fct>               
## 1    0.0000000001         15    40 Preprocessor1_Model7

Finalize Workflow

The last step in hyperparameter tuning is to use finalize_workflow() to add our optimal model to our workflow object.

final_tree_workflow <- tree_workflow %>% 
                       finalize_workflow(best_tree)

Visualize Results

In order to visualize our decision tree model, we will need to manually train our workflow object with the fit() function.

This step is optional, since visualizing the model is not necessary in all applications. However, if the goal is to understand why the model is predicting certain values, then it is recommended.

The next section shows how to fit the model with last_fit() for automatically obtaining test set performance.

Fit the Model

Next we fit our workflow to the training data. This is done by passing our workflow object to the fit() function.

tree_wf_fit <- final_tree_workflow %>% 
               fit(data = C2CdT_training)

Exploring our Trained Model

Once we have trained our model on our training data, we can study variable importance with the vip() function.

The first step is to extract the trained model from our workflow fit, tree_wf fit. This can be done by passing tree_wf fit to the pull_workflow_fit() function.

tree_fit <- tree_wf_fit %>% 
            pull_workflow_fit()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

Variable Importance

Next we pass tree_fit to the vip() function. This will return a ggplot object with the variable importance scores from our model. The importance scores are based on the splitting criteria of the trained decision tree.

From the results below, we can identify which variables are the most important predictors of I_MR1. From the plot, this would include PARTICIPATION, AGE RANGE, EDUCATION LEVEL and Process Indicators for LOGISTICS & EQUIPMENT as well as EVENT PERSONNEL.

vip(tree_fit)

Decision Tree Plot

We can visualize the trained decision tree by using the rpart.plot() function from the rpart.plot package. We must pass the fit object stored within tree_fit into the rpart.plot() function to make a decision tree plot.

When passing a tidymodels object into this function we also need to pass the addition parameter, roundint = FALSE.

This visualization is a tool to communicate the prediction rules of the trained decision tree (the splits in the predictor space).

Many times, the decision tree plot will be quite large and difficult to read. This is the case for our example below. There are specialized packages in R for zooming into regions of a decision tree plot. However, we do not have time to get into the details this semester.

rpart.plot(tree_fit$fit, roundint = FALSE)

Train and Evaluate With last_fit()

Next we fit our final model workflow to the training data and evaluate performance on the test data.

The last_fit() function will fit our workflow to the training data and generate predictions on the test data as defined by our C2CdT_split object.

tree_last_fit <- final_tree_workflow %>% 
                 last_fit(C2CdT_split)

We can view our performance metrics on the test data

tree_last_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <fct>               
## 1 accuracy binary         0.743 Preprocessor1_Model1
## 2 roc_auc  binary         0.810 Preprocessor1_Model1

ROC Curve

We can plot the ROC curve to visualize test set performance of our tuned decision tree

tree_last_fit %>% collect_predictions() %>% 
                  roc_curve(truth = I_MR1, estimate = .pred_N) %>% 
                  autoplot()

Confusion Matrix

We also see the number of false negatives and false positives in our test data set by constructing a confusion matrix.

tree_predictions <- tree_last_fit %>% collect_predictions()

conf_mat(tree_predictions, truth = I_MR1, estimate = .pred_class)
##           Truth
## Prediction  N  Y
##          N 97 30
##          Y 32 82

Random Forests

In this section, we will fit a random forest model to the C2CdT data set which we have used earlier on to generate the decision tree. Random forests take decision trees and construct more powerful models in terms of prediction accuracy. The main mechanism that powers this algorithm is repeated sampling (with replacement) of the training data to produce a sequence of decision tree models. These models are then averaged to obtain a single prediction for a given value in the predictor space.

The random forest model selects a random subset of predictor variables for splitting the predictor space in the tree building process. This is done for every iteration of the algorithm, typically 100 to 2,000 times.

Data Spitting and Feature Engineering

We have already split our data into training, test, and cross validation sets as well as trained our feature engineering recipe, C2CdT_recipe. These can be reused in our random forest workflow.

Model Specification

Next, we specify a random forest classifier with the following hyperparameters:

To specify a random forest model with tidymodels, we need the rand_forest() function. The hyperparameters of the model are arguments within the rand_forest() function and may be set to specific values. However, if tuning is required, then each of these parameters must be set to tune().

We will be using the ranger engine. This engine has an optional importance argument which can be used to track variable importance measures based on the Gini index. In order to make a variable importance plot with vip(), we must add importance = ‘impurity’ inside our set_engine() function.

rf_model <- rand_forest(mtry = tune(),
                        trees = tune(),
                        min_n = tune()) %>% 
            set_engine('ranger', importance = "impurity") %>% 
            set_mode('classification')

Workflow

Next, we combine our model and recipe into a workflow to easily manage the model-building process.

rf_workflow <- workflow() %>% 
               add_model(rf_model) %>% 
               add_recipe(C2CdT_recipe)

Hyperparameter Tuning

Random Grid Search

We will perform a grid search on the random forest hyperparameters and select the best performing model based on the area under the ROC curve during cross validation.

In the previous section, we used grid_regular() to create a grid of hyperparameter values. This created a regualr grid of recommended default values.

Another way to do hyperparameter tuning is by creating a random grid of values. Many studies have shown that this method does better than the regular grid method.

Random grid search is implemented with the grid_random() function in tidymodels. Like grid_regular() it takes a sequence of hyperparameter names to create the grid. It also has a size parameter that specifies the number of random combinations to create.

The mtry() hyperparameter requires a pre-set range of values to test since it cannot exceed the number of columns in our data. When we add this to grid_random() we can pass mtry() into the range_set() function and set a range for the hyperparameter with a numeric vector.

In the code below, we set the range from 2 to 9. This is because we have 10 columns in C2CdT and we would like to test mtry() values somewhere in the middle between 1 and 10, trying to avoid values close to the ends.

When using grid_random(), it is suggested to use set.seed() for reproducability.

## Create a grid of hyperparameter values to test

set.seed(314)

rf_grid <- grid_random(mtry() %>% range_set(c(2, 9)),
                       trees(),
                       min_n(),
                       size = 10)
# View grid
rf_grid
## # A tibble: 10 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     7  1644    34
##  2     9  1440    20
##  3     5  1549    31
##  4     4  1734    39
##  5     5   332    11
##  6     4  1064     3
##  7     8   218    37
##  8     3  1304    24
##  9     7   477    32
## 10     8  1621     6

Tuning Hyperparameters with tune_grid()

To find the optimal combination of hyperparameters from our tuning grid, we will use the tune_grid() function.

## Tune random forest workflow
set.seed(314)

rf_tuning <- rf_workflow %>% 
             tune_grid(resamples = C2CdT_folds,
                       grid = rf_grid)

To view the results of our hyperparameter tuning, we can use the show_best() function. We must pass the type of performance metric we would like to see into the show_best() function.

## Show the top 5 best models based on roc_auc metric
rf_tuning %>% show_best('roc_auc')
## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>                
## 1     9  1440    20 roc_auc binary     0.855     5  0.0139 Preprocessor1_Model02
## 2     7   477    32 roc_auc binary     0.853     5  0.0167 Preprocessor1_Model09
## 3     5   332    11 roc_auc binary     0.852     5  0.0162 Preprocessor1_Model05
## 4     7  1644    34 roc_auc binary     0.851     5  0.0168 Preprocessor1_Model01
## 5     8   218    37 roc_auc binary     0.851     5  0.0182 Preprocessor1_Model07

We can use the select_best() model to select the model from our tuning results that had the best overall performance. In the code below, we specify to select the best performing model based on the roc_auc metric.

## Select best model based on roc_auc
best_rf <- rf_tuning %>% 
           select_best(metric = 'roc_auc')

# View the best parameters
best_rf
## # A tibble: 1 x 4
##    mtry trees min_n .config              
##   <int> <int> <int> <fct>                
## 1     9  1440    20 Preprocessor1_Model02

Finalize Workflow

The last step in hyperparameter tuning is to use finalize_workflow() to add our optimal model to our workflow object.

final_rf_workflow <- rf_workflow %>% 
                     finalize_workflow(best_rf)

Variable Importance

In order to visualize the variable importance scores of our random forest model, we will need to manually train our workflow object with the fit() function.

Fit the Model

Next we fit our workflow to the training data. This is done by passing our workflow object to the fit() function.

rf_wf_fit <- final_rf_workflow %>% 
             fit(data = C2CdT_training)

Once we have trained our model on our training data, we can study variable importance with the vip() function.

The first step is to extract the trained model from our workflow fit, rf_wf_fit. This can be done by passing rf_wf_fit to the pull_workflow_fit() function.

rf_fit <- rf_wf_fit %>% 
          pull_workflow_fit()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

Variable Importance

Next we pass rf_fit to the vip() function. This will return a ggplot object with the variable importance scores from our model. The importance scores are based on the randomly chosen predictor variables, through the mtry() hyperparameter, that had the greatest predictive power.

vip(rf_fit)

Similar to the decision tree model, the most important variables for predicting of impact on MR1 are PARTICIPATION, EDUCATION LEVEL, AGE RANGE, as well as Process Indicators on EQUIPMENT & LOGISTICS as well as PROGRAMME PERSONNEL.

Train and Evaluate With last_fit()

Next we fit our final model workflow to the training data and evaluate performance on the test data.

The last_fit() function will fit our workflow to the training data and generate predictions on the test data as defined by our C2CdT_split object.

rf_last_fit <- final_rf_workflow %>% 
               last_fit(C2CdT_split)

We can view our performance metrics on the test data

rf_last_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <fct>               
## 1 accuracy binary         0.776 Preprocessor1_Model1
## 2 roc_auc  binary         0.822 Preprocessor1_Model1

ROC Curve

We can plot the ROC curve to visualize test set performance of our random forest model.

rf_last_fit %>% collect_predictions() %>% 
                roc_curve(truth = I_MR1, estimate = .pred_N) %>% 
                autoplot()

Confusion Matrix

We can see the number of false negatives and false positives in our random forest model, performed on our test data set, and whether it outperforms the decision tree model created earlier.

rf_predictions <- rf_last_fit %>% collect_predictions()

conf_mat(rf_predictions, truth = I_MR1, estimate = .pred_class)
##           Truth
## Prediction   N   Y
##          N 109  34
##          Y  20  78

Congratulations !

You have completed session 4 of the S3729C Data Analytics Seminar.