Info about the activity

Learning aim

The aim of this lab is learn how to train and tune random forest in R.

Objectives

By the end of this lab session, students should be able to

How do to undersampling for highly imbalanced data
Train random forest and
Tune the hyperparameters in random forest to minimize the out-of-bag error.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Data

We will again work with a dataset of credict card frauds.

Source: https://www.kaggle.com/mlg-ulb/creditcardfraud

This is a collection of credict card transactions in Europe. Original features have been transformed to ensure anonimity.

library(tidyverse)  # data transformation and plotting

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(caret)  # framework for training ML models

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ranger) # for training random forest

C <- read_csv("creditcard.csv") %>%
  select(-Time) %>%
  mutate(Class = ifelse(Class == 1, "yes", "no")) %>%
  mutate(Class = as.factor(Class)) %>%
  arrange(-Amount)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.

head(C)

Training and test sets

Our dataset is large and highly imbalanced:

C %>%
  group_by(Class) %>%
  summarise(N = n()) %>%
  ungroup %>%
  mutate(Frequency = N / sum(N))

## `summarise()` ungrouping output (override with `.groups` argument)

We will create a new dataset by keeping, say 5000 records with “no” class and all the records with the “yes” class:

set.seed(78)
ind_subset_of_our_data <- c(sample(which(C$Class == "no"), 5000), which(C$Class == "yes"))
new_data <- C %>% slice(ind_subset_of_our_data)
new_data %>%
  group_by(Class) %>%
  summarise(N = n()) %>%
  ungroup %>%
  mutate(Frequency = N / sum(N))

## `summarise()` ungrouping output (override with `.groups` argument)

The data is still imbalanced, but it’s alright. And then we will split this new dataset into 50% training and 50% test sets.

set.seed(128)
ind_train <- sample(1:nrow(new_data), round(nrow(new_data)/2))
train_data <- new_data %>% slice(ind_train)
test_data <- new_data %>% slice(-ind_train)
cat("Dimensions of training data are", dim(train_data), "\n")

## Dimensions of training data are 2746 30

cat("Dimensions of test data are", dim(test_data), "\n")

## Dimensions of test data are 2746 30

Random forest

We will use the library ranger because it allows tuning more than just one hyperparameter.

Training

First, we train a random forest model and print the result. Note that by default, it tries three values of mtry (the number of predictors allowed at each step), \(2\), \(p/2\) and \(p\) and two values of splitrule (this is something that we did not cover in class and you can either ignore it or google what it is) and only one value of min.node.size, 5.

Here we set train control to oob, i.e., out-of-bag error (with 5-fold cross validation, training time will be 5 times slower) and num.trees to 50 (with default value of 500, training time will be 10 times slower).

set.seed(100)
mod_rf <- train(Class ~ . , data = train_data, method = "ranger",
                num.trees = 50,
                importance = 'impurity',
                trControl = trainControl("oob"))

print(mod_rf)

## Random Forest 
## 
## 2746 samples
##   29 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa    
##    2    gini        0.9868900  0.9165982
##    2    extratrees  0.9854334  0.9066255
##   15    gini        0.9857975  0.9108277
##   15    extratrees  0.9868900  0.9175333
##   29    gini        0.9854334  0.9093862
##   29    extratrees  0.9857975  0.9104938
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 1.

Predictions

Here we will construct predictions and report the test error

mod_rf %>% 
  predict(test_data) %>%
  confusionMatrix(test_data$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2504   42
##        yes    3  197
##                                          
##                Accuracy : 0.9836         
##                  95% CI : (0.9781, 0.988)
##     No Information Rate : 0.913          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8887         
##                                          
##  Mcnemar's Test P-Value : 1.473e-08      
##                                          
##             Sensitivity : 0.9988         
##             Specificity : 0.8243         
##          Pos Pred Value : 0.9835         
##          Neg Pred Value : 0.9850         
##              Prevalence : 0.9130         
##          Detection Rate : 0.9119         
##    Detection Prevalence : 0.9272         
##       Balanced Accuracy : 0.9115         
##                                          
##        'Positive' Class : no             
##

Variable importance

Here is variable importance for top 20 most important variables:

varImp(mod_rf)

## ranger variable importance
## 
##   only 20 most important variables shown (out of 29)
## 
##     Overall
## V17 100.000
## V12  96.986
## V10  89.289
## V11  65.045
## V4   64.709
## V14  62.480
## V16  52.945
## V7   42.886
## V2   36.009
## V9   33.874
## V18  31.327
## V3   30.557
## V6   25.167
## V27  24.855
## V5   23.095
## V19  17.406
## V1   16.190
## V28  11.731
## V8    7.853
## V20   6.261

If we want all, here is how we can get the full information (we only printed the top 10 most important variables, but you can easily print the whole vector):

var_importance <- mod_rf$finalModel$variable.importance %>% 
  sort(decreasing = TRUE)

var_importance %>% head(10)

##      V17      V12      V10      V11       V4      V14      V16       V7 
## 47.65450 46.28231 42.77736 31.73847 31.58564 30.57044 26.22929 21.64892 
##       V2       V9 
## 18.51773 17.54532

Here is the built-in plot of variable importance:

varImp(mod_rf) %>%
  plot(top = 10)

And here is how you can make a custom plot with ggplot2:

var_importance <- mod_rf$finalModel$variable.importance %>% 
  sort(decreasing = TRUE) %>% head(10)

data.frame(variable = names(var_importance),
           importance = var_importance) %>%
  mutate(word = gsub("w_", "", variable)) %>%
  ggplot(aes(x = reorder(word, -importance), y = importance)) +
  geom_col() + xlab("word") + ylab("importance") +
  theme(axis.text.x = element_text(angle = 45))

Tuning random forest

Random forest has a number of hyperparameters that can be tuned with OOB error (faster) or with cross-validation (slower).

set.seed(199)

rfGrid <- expand.grid(mtry = c(5, 10, 15, 20), 
                      min.node.size = c(5, 10, 20, 40),
                      splitrule = "gini")

mod_rf_tune <- train(Class ~ . , data = train_data, method = "ranger",
                num.trees = 100,
                importance = 'impurity',
                tuneGrid = rfGrid,
                trControl = trainControl("oob"))
mod_rf_tune

## Random Forest 
## 
## 2746 samples
##   29 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa    
##    5     5             0.9868900  0.9172239
##    5    10             0.9865259  0.9147647
##    5    20             0.9865259  0.9147647
##    5    40             0.9865259  0.9144432
##   10     5             0.9865259  0.9150839
##   10    10             0.9857975  0.9108277
##   10    20             0.9868900  0.9178403
##   10    40             0.9861617  0.9126252
##   15     5             0.9857975  0.9108277
##   15    10             0.9861617  0.9129518
##   15    20             0.9861617  0.9132759
##   15    40             0.9850692  0.9059037
##   20     5             0.9854334  0.9090501
##   20    10             0.9850692  0.9062547
##   20    20             0.9847050  0.9037888
##   20    40             0.9847050  0.9037888
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 5, splitrule = gini
##  and min.node.size = 5.

The tuning process can be plotted to get a better picture of what is going on:

plot(mod_rf_tune)

The optimal values of the hyperparameters are

mod_rf_tune$bestTune

Here is the new confusion matrix. It should be a bit better than the first version of random forest.

mod_rf_tune %>% 
  predict(test_data) %>%
  confusionMatrix(test_data$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2500   39
##        yes    7  200
##                                           
##                Accuracy : 0.9832          
##                  95% CI : (0.9777, 0.9877)
##     No Information Rate : 0.913           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8878          
##                                           
##  Mcnemar's Test P-Value : 4.861e-06       
##                                           
##             Sensitivity : 0.9972          
##             Specificity : 0.8368          
##          Pos Pred Value : 0.9846          
##          Neg Pred Value : 0.9662          
##              Prevalence : 0.9130          
##          Detection Rate : 0.9104          
##    Detection Prevalence : 0.9246          
##       Balanced Accuracy : 0.9170          
##                                           
##        'Positive' Class : no              
##

Question

Retrain random forest with 500 trees. Also, try more values for mtry and min.node.size. Print the confusion matrix on the test set.

set.seed(100)

rfGrid <- expand.grid(mtry = c(2, 3, 4, 5, 8, 10, 15), 
                      min.node.size = c(2, 3, 5, 8, 13),
                      splitrule = "gini")

mod_rf_final <- train(Class ~ . , data = train_data, method = "ranger",
                num.trees = 500,
                importance = 'impurity',
                tuneGrid = rfGrid,
                trControl = trainControl("oob"))

mod_rf_final %>% 
  predict(test_data) %>%
  confusionMatrix(test_data$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2503   40
##        yes    4  199
##                                           
##                Accuracy : 0.984           
##                  95% CI : (0.9785, 0.9883)
##     No Information Rate : 0.913           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8918          
##                                           
##  Mcnemar's Test P-Value : 1.317e-07       
##                                           
##             Sensitivity : 0.9984          
##             Specificity : 0.8326          
##          Pos Pred Value : 0.9843          
##          Neg Pred Value : 0.9803          
##              Prevalence : 0.9130          
##          Detection Rate : 0.9115          
##    Detection Prevalence : 0.9261          
##       Balanced Accuracy : 0.9155          
##                                           
##        'Positive' Class : no              
##

Class 2, notebook 2 — random forest