Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
Random forests are built on the same fundamental principles as decision trees and bagging (check out this tutorial if you need a refresher on these techniques). Bagging trees introduces a random component in to the tree building process that reduces the variance of a single tree’s prediction and improves predictive performance. However, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to underlying relationships.
However, Random Forests have important parameters which cannot be directly estimated from the data. Searching optimal parameters that maximizes model performance is called the process of tuning parameter.
There are different approaches to searching for the best parameters. A general approach that can be applied to almost any model is to define a set of candidate values, generate reliable estimates of model utility across the candidates values, then choose the optimal settings. A flowchart of this process is shown as follows:
More specifically, once a candidate set of parameter values has been selected, then we must obtain trustworthy estimates of model performance. The performance on the hold-out samples is then aggregated into a performance profile which is then used to determine the final tuning parameters. We then build a final model with all of the training data using the selected tuning parameters. The training data would then be resampled and evaluated many times for each tuning parameter value. These results would then be aggregated to find the optimal value of parameter.
Currently, there are over 20 random forest packages in R and in this post I will present the method of training and turning parameters for Random Forest by using h2o package. This package is a powerful and efficient java-based interface that provides parallel distributed algorithms. Moreover, h2o allows for different optimal search paths in our grid search. This allows us to be more efficient in tuning our models.
#=================================
# Data Pre-processing
#=================================
# Load some packages for data manipulation:
library(tidyverse)
library(magrittr)
# Clear workspace:
rm(list = ls())
# Import data:
hmeq <- read.csv("http://www.creditriskanalytics.net/uploads/1/9/5/1/19511601/hmeq.csv")
# Function replaces NA by mean:
replace_by_mean <- function(x) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
return(x)
}
# A function imputes NA observations for categorical variables:
replace_na_categorical <- function(x) {
x %>%
table() %>%
as.data.frame() %>%
arrange(-Freq) ->> my_df
n_obs <- sum(my_df$Freq)
pop <- my_df$. %>% as.character()
set.seed(29)
x[is.na(x)] <- sample(pop, sum(is.na(x)), replace = TRUE, prob = my_df$Freq)
return(x)
}
# Use the two functions:
df <- hmeq %>%
mutate_if(is.factor, as.character) %>%
mutate(REASON = case_when(REASON == "" ~ NA_character_, TRUE ~ REASON),
JOB = case_when(JOB == "" ~ NA_character_, TRUE ~ JOB),
BAD = case_when(BAD == 1 ~ "BAD", TRUE ~ "GOOD")) %>%
mutate_if(is_character, as.factor) %>%
mutate_if(is.numeric, replace_by_mean) %>%
mutate_if(is.factor, replace_na_categorical)
# Split our data:
set.seed(1)
df_train <- df %>%
group_by(BAD) %>%
sample_frac(0.7) %>%
ungroup() # Use 70% data set for training model.
df_test <- dplyr::setdiff(df, df_train) # Use 30% data set for validation.
# Activate h2o package for using:
library(h2o)
h2o.init(nthreads = 20, max_mem_size = "16g")
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /tmp/RtmpyvFm1p/h2o_chidung_started_from_r.out
## /tmp/RtmpyvFm1p/h2o_chidung_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 46 milliseconds
## H2O cluster timezone: Asia/Ho_Chi_Minh
## H2O data parsing timezone: UTC
## H2O cluster version: 3.20.0.8
## H2O cluster version age: 2 months and 20 days
## H2O cluster name: H2O_started_from_R_chidung_meh345
## H2O cluster total nodes: 1
## H2O cluster total memory: 16.00 GB
## H2O cluster total cores: 32
## H2O cluster allowed cores: 20
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.1 (2018-07-02)
h2o.no_progress()
# Convert to h2o Frame and identify inputs and output:
test <- as.h2o(df_test)
train <- as.h2o(df_train)
y <- "BAD"
x <- setdiff(names(train), y)
R codes bellows will be used for training a “Default” Random Forest (for further information: http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/h2o.randomForest.html):
# Train default Random Forest:
default_rf <- h2o.randomForest(x = x, y = y,
training_frame = train,
stopping_rounds = 5,
stopping_tolerance = 0.001,
stopping_metric = "AUC",
seed = 29,
balance_classes = FALSE,
nfolds = 10)
# Function for collecting cross-validation results:
results_cross_validation <- function(h2o_model) {
h2o_model@model$cross_validation_metrics_summary %>%
as.data.frame() %>%
select(-mean, -sd) %>%
t() %>%
as.data.frame() %>%
mutate_all(as.character) %>%
mutate_all(as.numeric) %>%
select(Accuracy = accuracy,
AUC = auc,
Precision = precision,
Specificity = specificity,
Recall = recall,
Logloss = logloss) %>%
return()
}
# Use function:
results_cross_validation(default_rf) -> ket_qua_default
# Model Performance by Graph:
theme_set(theme_minimal())
plot_results <- function(df_results) {
df_results %>%
gather(Metrics, Values) %>%
ggplot(aes(Metrics, Values, fill = Metrics, color = Metrics)) +
geom_boxplot(alpha = 0.3, show.legend = FALSE) +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm")) +
scale_y_continuous(labels = scales::percent) +
facet_wrap(~ Metrics, scales = "free") +
labs(title = "Model Performance by Some Criteria Selected", y = NULL)
}
plot_results(ket_qua_default) +
labs(subtitle = "Model: Random Forest (h2o package)")
Use model for prediction:
# Model performance based on test data:
pred_class <- h2o.predict(default_rf, test) %>% as.data.frame() %>% pull(predict)
library(caret)
confusionMatrix(pred_class, df_test$BAD, positive = "BAD")
## Confusion Matrix and Statistics
##
## Reference
## Prediction BAD GOOD
## BAD 287 52
## GOOD 70 1379
##
## Accuracy : 0.9318
## 95% CI : (0.9191, 0.943)
## No Information Rate : 0.8003
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7824
## Mcnemar's Test P-Value : 0.1238
##
## Sensitivity : 0.8039
## Specificity : 0.9637
## Pos Pred Value : 0.8466
## Neg Pred Value : 0.9517
## Prevalence : 0.1997
## Detection Rate : 0.1605
## Detection Prevalence : 0.1896
## Balanced Accuracy : 0.8838
##
## 'Positive' Class : BAD
##
# ROC curve and AUC:
library(pROC)
# Function calculates AUC:
auc_for_test <- function(model_selected) {
actual <- df_test$BAD
pred_prob <- h2o.predict(model_selected, test) %>% as.data.frame() %>% pull(BAD)
return(roc(actual, pred_prob))
}
# Use this function:
my_auc <- auc_for_test(default_rf)
my_auc$auc
## Area under the curve: 0.9682
# Graph ROC and AUC:
sen_spec_df <- data_frame(TPR = my_auc$sensitivities, FPR = 1 - my_auc$specificities)
sen_spec_df %>%
ggplot(aes(x = FPR, ymin = 0, ymax = TPR))+
geom_polygon(aes(y = TPR), fill = "red", alpha = 0.3)+
geom_path(aes(y = TPR), col = "firebrick", size = 1.2) +
geom_abline(intercept = 0, slope = 1, color = "gray37", size = 1, linetype = "dashed") +
theme_bw() +
coord_equal() +
labs(x = "FPR (1 - Specificity)",
y = "TPR (Sensitivity)",
title = "Model Performance for RF Classifier based on Test Data",
subtitle = paste0("AUC Value: ", my_auc$auc %>% round(2)))
# Function for calculating CM:
my_cm_com_rf <- function(thre) {
du_bao_prob <- h2o.predict(default_rf, test) %>% as.data.frame() %>% pull(BAD)
du_bao <- case_when(du_bao_prob >= thre ~ "BAD",
du_bao_prob < thre ~ "GOOD") %>% as.factor()
cm <- confusionMatrix(du_bao, df_test$BAD, positive = "BAD")
return(cm)
}
Accuracy rate when predict bad applications by threshold selected:
# Set a range of threshold for classification:
my_threshold <- c(0.10, 0.15, 0.35, 0.5)
results_list_rf <- lapply(my_threshold, my_cm_com_rf)
# Function for presenting prediction power by class:
vis_detection_rate_rf <- function(x) {
results_list_rf[[x]]$table %>% as.data.frame() -> m
rate <- round(100*m$Freq[1] / sum(m$Freq[c(1, 2)]), 2)
acc <- round(100*sum(m$Freq[c(1, 4)]) / sum(m$Freq), 2)
acc <- paste0(acc, "%")
m %>%
ggplot(aes(Reference, Freq, fill = Prediction)) +
geom_col(position = "fill") +
scale_fill_manual(values = c("#e41a1c", "#377eb8"), name = "") +
theme(panel.grid.minor.y = element_blank()) +
theme(panel.grid.minor.x = element_blank()) +
scale_y_continuous(labels = scales::percent) +
labs(x = NULL, y = NULL,
title = paste0("Detecting Default Cases when Threshold = ", my_threshold[x]),
subtitle = paste0("Detecting Rate for Default Cases: ", rate, "%", ", ", "Accuracy: ", acc))
}
# Use this function:
gridExtra::grid.arrange(vis_detection_rate_rf(1),
vis_detection_rate_rf(2),
vis_detection_rate_rf(3),
vis_detection_rate_rf(4))
When Using a full cartesian grid search, we will examine every combination of hyperparameter settings that we specify in hyper_grid.h2o object:
#=================================
# Full Cartesian Grid Search
#=================================
# Set hyperparameter grid:
hyper_grid.h2o <- list(ntrees = seq(50, 500, by = 50),
mtries = seq(3, 5, by = 1),
# max_depth = seq(10, 30, by = 10),
# min_rows = seq(1, 3, by = 1),
# nbins = seq(20, 30, by = 10),
sample_rate = c(0.55, 0.632, 0.75))
# The number of models is 90:
sapply(hyper_grid.h2o, length) %>% prod()
## [1] 90
# Train 6000 Random Forest Models:
system.time(grid_cartesian <- h2o.grid(algorithm = "randomForest",
grid_id = "rf_grid1",
x = x,
y = y,
seed = 29,
nfolds = 10,
training_frame = train,
stopping_metric = "AUC",
hyper_params = hyper_grid.h2o,
search_criteria = list(strategy = "Cartesian")))
## user system elapsed
## 987.759 60.825 15471.287
# Collect the results and sort by our model performance metric of choice:
grid_perf <- h2o.getGrid(grid_id = "rf_grid1",
sort_by = "auc",
decreasing = FALSE)
# Best model chosen by validation error:
best_model <- h2o.getModel(grid_perf@model_ids[[1]])
# Use best model for making predictions:
confusionMatrix(h2o.predict(best_model, test) %>% as.data.frame() %>% pull(predict),
df_test$BAD, positive = "BAD")
## Confusion Matrix and Statistics
##
## Reference
## Prediction BAD GOOD
## BAD 270 66
## GOOD 87 1365
##
## Accuracy : 0.9144
## 95% CI : (0.9005, 0.927)
## No Information Rate : 0.8003
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7262
## Mcnemar's Test P-Value : 0.1059
##
## Sensitivity : 0.7563
## Specificity : 0.9539
## Pos Pred Value : 0.8036
## Neg Pred Value : 0.9401
## Prevalence : 0.1997
## Detection Rate : 0.1510
## Detection Prevalence : 0.1879
## Balanced Accuracy : 0.8551
##
## 'Positive' Class : BAD
##
# Compare:
confusionMatrix(h2o.predict(default_rf, test) %>% as.data.frame() %>% pull(predict),
df_test$BAD, positive = "BAD")
## Confusion Matrix and Statistics
##
## Reference
## Prediction BAD GOOD
## BAD 287 52
## GOOD 70 1379
##
## Accuracy : 0.9318
## 95% CI : (0.9191, 0.943)
## No Information Rate : 0.8003
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7824
## Mcnemar's Test P-Value : 0.1238
##
## Sensitivity : 0.8039
## Specificity : 0.9637
## Pos Pred Value : 0.8466
## Neg Pred Value : 0.9517
## Prevalence : 0.1997
## Detection Rate : 0.1605
## Detection Prevalence : 0.1896
## Balanced Accuracy : 0.8838
##
## 'Positive' Class : BAD
##
# By Graph:
my_cm_com_rf_best <- function(thre) {
du_bao_prob <- h2o.predict(best_model, test) %>% as.data.frame() %>% pull(BAD)
du_bao <- case_when(du_bao_prob >= thre ~ "BAD",
du_bao_prob < thre ~ "GOOD") %>% as.factor()
cm <- confusionMatrix(du_bao, df_test$BAD, positive = "BAD")
return(cm)
}
results_list_rf_best <- lapply(my_threshold, my_cm_com_rf_best)
# Function for presenting prediction power by class:
vis_detection_rate_rf_best <- function(x) {
results_list_rf_best[[x]]$table %>% as.data.frame() -> m
rate <- round(100*m$Freq[1] / sum(m$Freq[c(1, 2)]), 2)
acc <- round(100*sum(m$Freq[c(1, 4)]) / sum(m$Freq), 2)
acc <- paste0(acc, "%")
m %>%
ggplot(aes(Reference, Freq, fill = Prediction)) +
geom_col(position = "fill") +
scale_fill_manual(values = c("#e41a1c", "#377eb8"), name = "") +
theme(panel.grid.minor.y = element_blank()) +
theme(panel.grid.minor.x = element_blank()) +
scale_y_continuous(labels = scales::percent) +
labs(x = NULL, y = NULL,
title = paste0("Detecting Default Cases when Threshold = ", my_threshold[x]),
subtitle = paste0("Detecting Rate for Default Cases: ", rate, "%", ", ", "Accuracy: ", acc))
}
# Use this function:
gridExtra::grid.arrange(vis_detection_rate_rf_best(1),
vis_detection_rate_rf_best(2),
vis_detection_rate_rf_best(3),
vis_detection_rate_rf_best(4))
Because of the combinatorial explosion, each additional hyperparameter that gets added to our grid search has a huge effect on the time to complete. Consequently, h2o provides an additional grid search path called RandomDiscrete, which will jump from one random combination to another and stop once a certain level of improvement has been made, certain amount of time has been exceeded, or a certain amount of models have been ran (or a combination of these have been met). Although using a random discrete search path will likely not find the optimal model, it typically does a good job of finding a very good model.
For example, the following code searches a large grid search of 90 hyperparameter combinations. We create a random grid search that will stop if:
None of the last 10 models have managed to have a 0.5% improvement in AUC compared to the best model before that.
We continue to find improvements then I cut the grid search off after 600 seconds (30 minutes).
#=================================
# Random Discrete Grid Search
#=================================
# Set random grid search criteria:
search_criteria_2 <- list(strategy = "RandomDiscrete",
stopping_metric = "AUC",
stopping_tolerance = 0.005,
stopping_rounds = 10,
max_runtime_secs = 30*60)
# Turn parameters for RF:
system.time(random_grid <- h2o.grid(algorithm = "randomForest",
grid_id = "rf_grid2",
x = x,
y = y,
seed = 29,
nfolds = 10,
training_frame = train,
hyper_params = hyper_grid.h2o,
search_criteria = search_criteria_2))
## user system elapsed
## 121.971 7.949 1818.201
# Collect the results and sort by our models:
grid_perf2 <- h2o.getGrid(grid_id = "rf_grid2",
sort_by = "AUC",
decreasing = FALSE)
# Best RF:
best_model2 <- h2o.getModel(grid_perf2@model_ids[[1]])
# Use this best model for prediction with some thresholds selected:
my_cm_com_rf_best2 <- function(thre) {
du_bao_prob <- h2o.predict(best_model2, test) %>% as.data.frame() %>% pull(BAD)
du_bao <- case_when(du_bao_prob >= thre ~ "BAD",
du_bao_prob < thre ~ "GOOD") %>% as.factor()
cm <- confusionMatrix(du_bao, df_test$BAD, positive = "BAD")
return(cm)
}
results_list_rf_best2 <- lapply(my_threshold, my_cm_com_rf_best2)
# Function for presenting prediction power by class:
vis_detection_rate_rf_best2 <- function(x) {
results_list_rf_best2[[x]]$table %>% as.data.frame() -> m
rate <- round(100*m$Freq[1] / sum(m$Freq[c(1, 2)]), 2)
acc <- round(100*sum(m$Freq[c(1, 4)]) / sum(m$Freq), 2)
acc <- paste0(acc, "%")
m %>%
ggplot(aes(Reference, Freq, fill = Prediction)) +
geom_col(position = "fill") +
scale_fill_manual(values = c("#e41a1c", "#377eb8"), name = "") +
theme(panel.grid.minor.y = element_blank()) +
theme(panel.grid.minor.x = element_blank()) +
scale_y_continuous(labels = scales::percent) +
labs(x = NULL, y = NULL,
title = paste0("Detecting Default Cases when Threshold = ", my_threshold[x]),
subtitle = paste0("Detecting Rate for Default Cases: ", rate, "%", ", ", "Accuracy: ", acc))
}
gridExtra::grid.arrange(vis_detection_rate_rf_best2(1),
vis_detection_rate_rf_best2(2),
vis_detection_rate_rf_best2(3),
vis_detection_rate_rf_best2(4))
Random forests provide a very powerful out-of-the-box algorithm that often has great predictive accuracy. Because of their more simplistic tuning nature and the fact that they require very little, if any, feature pre-processing they are often one of the first go-to algorithms when facing a predictive modeling problem.
In this post you discovered the importance of tuning well-performing machine learning Random Forest in order to get the best performance from them. You worked through an example of tuning the Random Forest algorithm in R and discovered two ways of turning parameter by using h20 package.