The aim of this lab is learn how to train and tune random forest in R.
By the end of this lab session, students should be able to
How do to undersampling for highly imbalanced data
Train random forest and
Tune the hyperparameters in random forest to minimize the out-of-bag error.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
We will again work with a dataset of credict card frauds.
Source: https://www.kaggle.com/mlg-ulb/creditcardfraud
This is a collection of credict card transactions in Europe. Original features have been transformed to ensure anonimity.
library(tidyverse) # data transformation and plotting
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret) # framework for training ML models
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ranger) # for training random forest
C <- read_csv("creditcard.csv") %>%
select(-Time) %>%
mutate(Class = ifelse(Class == 1, "yes", "no")) %>%
mutate(Class = as.factor(Class)) %>%
arrange(-Amount)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double()
## )
## i Use `spec()` for the full column specifications.
head(C)
Our dataset is large and highly imbalanced:
C %>%
group_by(Class) %>%
summarise(N = n()) %>%
ungroup %>%
mutate(Frequency = N / sum(N))
## `summarise()` ungrouping output (override with `.groups` argument)
We will create a new dataset by keeping, say 5000 records with “no” class and all the records with the “yes” class:
set.seed(78)
ind_subset_of_our_data <- c(sample(which(C$Class == "no"), 5000), which(C$Class == "yes"))
new_data <- C %>% slice(ind_subset_of_our_data)
new_data %>%
group_by(Class) %>%
summarise(N = n()) %>%
ungroup %>%
mutate(Frequency = N / sum(N))
## `summarise()` ungrouping output (override with `.groups` argument)
The data is still imbalanced, but it’s alright. And then we will split this new dataset into 50% training and 50% test sets.
set.seed(128)
ind_train <- sample(1:nrow(new_data), round(nrow(new_data)/2))
train_data <- new_data %>% slice(ind_train)
test_data <- new_data %>% slice(-ind_train)
cat("Dimensions of training data are", dim(train_data), "\n")
## Dimensions of training data are 2746 30
cat("Dimensions of test data are", dim(test_data), "\n")
## Dimensions of test data are 2746 30
We will use the library ranger
because it allows tuning more than just one hyperparameter.
First, we train a random forest model and print the result. Note that by default, it tries three values of mtry
(the number of predictors allowed at each step), \(2\), \(p/2\) and \(p\) and two values of splitrule
(this is something that we did not cover in class and you can either ignore it or google what it is) and only one value of min.node.size
, 5.
Here we set train control to oob
, i.e., out-of-bag error (with 5-fold cross validation, training time will be 5 times slower) and num.trees
to 50 (with default value of 500, training time will be 10 times slower).
set.seed(100)
mod_rf <- train(Class ~ . , data = train_data, method = "ranger",
num.trees = 50,
importance = 'impurity',
trControl = trainControl("oob"))
print(mod_rf)
## Random Forest
##
## 2746 samples
## 29 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9868900 0.9165982
## 2 extratrees 0.9854334 0.9066255
## 15 gini 0.9857975 0.9108277
## 15 extratrees 0.9868900 0.9175333
## 29 gini 0.9854334 0.9093862
## 29 extratrees 0.9857975 0.9104938
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
## and min.node.size = 1.
Here we will construct predictions and report the test error
mod_rf %>%
predict(test_data) %>%
confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2504 42
## yes 3 197
##
## Accuracy : 0.9836
## 95% CI : (0.9781, 0.988)
## No Information Rate : 0.913
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8887
##
## Mcnemar's Test P-Value : 1.473e-08
##
## Sensitivity : 0.9988
## Specificity : 0.8243
## Pos Pred Value : 0.9835
## Neg Pred Value : 0.9850
## Prevalence : 0.9130
## Detection Rate : 0.9119
## Detection Prevalence : 0.9272
## Balanced Accuracy : 0.9115
##
## 'Positive' Class : no
##
Here is variable importance for top 20 most important variables:
varImp(mod_rf)
## ranger variable importance
##
## only 20 most important variables shown (out of 29)
##
## Overall
## V17 100.000
## V12 96.986
## V10 89.289
## V11 65.045
## V4 64.709
## V14 62.480
## V16 52.945
## V7 42.886
## V2 36.009
## V9 33.874
## V18 31.327
## V3 30.557
## V6 25.167
## V27 24.855
## V5 23.095
## V19 17.406
## V1 16.190
## V28 11.731
## V8 7.853
## V20 6.261
If we want all, here is how we can get the full information (we only printed the top 10 most important variables, but you can easily print the whole vector):
var_importance <- mod_rf$finalModel$variable.importance %>%
sort(decreasing = TRUE)
var_importance %>% head(10)
## V17 V12 V10 V11 V4 V14 V16 V7
## 47.65450 46.28231 42.77736 31.73847 31.58564 30.57044 26.22929 21.64892
## V2 V9
## 18.51773 17.54532
Here is the built-in plot of variable importance:
varImp(mod_rf) %>%
plot(top = 10)
And here is how you can make a custom plot with ggplot2
:
var_importance <- mod_rf$finalModel$variable.importance %>%
sort(decreasing = TRUE) %>% head(10)
data.frame(variable = names(var_importance),
importance = var_importance) %>%
mutate(word = gsub("w_", "", variable)) %>%
ggplot(aes(x = reorder(word, -importance), y = importance)) +
geom_col() + xlab("word") + ylab("importance") +
theme(axis.text.x = element_text(angle = 45))
Random forest has a number of hyperparameters that can be tuned with OOB error (faster) or with cross-validation (slower).
set.seed(199)
rfGrid <- expand.grid(mtry = c(5, 10, 15, 20),
min.node.size = c(5, 10, 20, 40),
splitrule = "gini")
mod_rf_tune <- train(Class ~ . , data = train_data, method = "ranger",
num.trees = 100,
importance = 'impurity',
tuneGrid = rfGrid,
trControl = trainControl("oob"))
mod_rf_tune
## Random Forest
##
## 2746 samples
## 29 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 5 5 0.9868900 0.9172239
## 5 10 0.9865259 0.9147647
## 5 20 0.9865259 0.9147647
## 5 40 0.9865259 0.9144432
## 10 5 0.9865259 0.9150839
## 10 10 0.9857975 0.9108277
## 10 20 0.9868900 0.9178403
## 10 40 0.9861617 0.9126252
## 15 5 0.9857975 0.9108277
## 15 10 0.9861617 0.9129518
## 15 20 0.9861617 0.9132759
## 15 40 0.9850692 0.9059037
## 20 5 0.9854334 0.9090501
## 20 10 0.9850692 0.9062547
## 20 20 0.9847050 0.9037888
## 20 40 0.9847050 0.9037888
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 5, splitrule = gini
## and min.node.size = 5.
The tuning process can be plotted to get a better picture of what is going on:
plot(mod_rf_tune)
The optimal values of the hyperparameters are
mod_rf_tune$bestTune
Here is the new confusion matrix. It should be a bit better than the first version of random forest.
mod_rf_tune %>%
predict(test_data) %>%
confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2500 39
## yes 7 200
##
## Accuracy : 0.9832
## 95% CI : (0.9777, 0.9877)
## No Information Rate : 0.913
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8878
##
## Mcnemar's Test P-Value : 4.861e-06
##
## Sensitivity : 0.9972
## Specificity : 0.8368
## Pos Pred Value : 0.9846
## Neg Pred Value : 0.9662
## Prevalence : 0.9130
## Detection Rate : 0.9104
## Detection Prevalence : 0.9246
## Balanced Accuracy : 0.9170
##
## 'Positive' Class : no
##
Retrain random forest with 500 trees. Also, try more values for mtry
and min.node.size
. Print the confusion matrix on the test set.
set.seed(100)
rfGrid <- expand.grid(mtry = c(2, 3, 4, 5, 8, 10, 15),
min.node.size = c(2, 3, 5, 8, 13),
splitrule = "gini")
mod_rf_final <- train(Class ~ . , data = train_data, method = "ranger",
num.trees = 500,
importance = 'impurity',
tuneGrid = rfGrid,
trControl = trainControl("oob"))
mod_rf_final %>%
predict(test_data) %>%
confusionMatrix(test_data$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2503 40
## yes 4 199
##
## Accuracy : 0.984
## 95% CI : (0.9785, 0.9883)
## No Information Rate : 0.913
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8918
##
## Mcnemar's Test P-Value : 1.317e-07
##
## Sensitivity : 0.9984
## Specificity : 0.8326
## Pos Pred Value : 0.9843
## Neg Pred Value : 0.9803
## Prevalence : 0.9130
## Detection Rate : 0.9115
## Detection Prevalence : 0.9261
## Balanced Accuracy : 0.9155
##
## 'Positive' Class : no
##