setwd("D:/Class Materials & Work/Summer 2020 practice/5_Tidymodel_Case study")
getwd()
## [1] "D:/Class Materials & Work/Summer 2020 practice/5_Tidymodel_Case study"
In this final case study, we will use all of the previous practices, tidymodel 1-4, as a foundation to build a predictive model from beginning to end with data on hotel stays.
We have learned to build a model(1), preprocess the dataset on the fly with recipe(2), realistically evaluate the model with resampled data(3), and tune hyperparameters of the model for optimality(4).
Loading required packages:
library(tidymodels) # for the tune package, along with the rest of tidymodels
#helper packages
library(readr) #for the data importing
library(vip) #for variable importance plots
#The hotel booking data----
We will use the hotel bookings data to predict which hotel stays included children and/or babies, based on variables such as hotel names, costs, and date.
First, let’s read the data set.
hotels <-
read_csv('hotels.csv', col_names = T) %>%
mutate_if(is.character, as.factor) #mutate the variable into factor
The reason that we are only analyzing hotel stays is because guests who cancel their booking are likely to have a lot of missing data.
glimpse(hotels)
## Rows: 50,000
## Columns: 23
## $ hotel <fct> City_Hotel, City_Hotel, Resort_Hotel...
## $ lead_time <dbl> 217, 2, 95, 143, 136, 67, 47, 56, 80...
## $ stays_in_weekend_nights <dbl> 1, 0, 2, 2, 1, 2, 0, 0, 0, 2, 1, 0, ...
## $ stays_in_week_nights <dbl> 3, 1, 5, 6, 4, 2, 2, 3, 4, 2, 2, 1, ...
## $ adults <dbl> 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, ...
## $ children <fct> none, none, none, none, none, none, ...
## $ meal <fct> BB, BB, BB, HB, HB, SC, BB, BB, BB, ...
## $ country <fct> DEU, PRT, GBR, ROU, PRT, GBR, ESP, E...
## $ market_segment <fct> Offline_TA/TO, Direct, Online_TA, On...
## $ distribution_channel <fct> TA/TO, Direct, TA/TO, TA/TO, Direct,...
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type <fct> A, D, A, A, F, A, C, B, D, A, A, D, ...
## $ assigned_room_type <fct> A, K, A, A, F, A, C, A, D, A, D, D, ...
## $ booking_changes <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <fct> No_Deposit, No_Deposit, No_Deposit, ...
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <fct> Transient-Party, Transient, Transien...
## $ average_daily_rate <dbl> 80.75, 170.00, 8.00, 81.00, 157.60, ...
## $ required_car_parking_spaces <fct> none, none, none, none, none, none, ...
## $ total_of_special_requests <dbl> 1, 3, 2, 1, 4, 1, 1, 1, 1, 1, 0, 1, ...
## $ arrival_date <date> 2016-09-01, 2017-08-25, 2016-11-19,...
We will build a model to predict which actual hotel stays included children and/or babies, and which did not. Our outcome variable children is a factor variable with two levels:
hotels %>%
count(children) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## children n prop
## <fct> <int> <dbl>
## 1 children 4038 0.0808
## 2 none 45962 0.919
We can see that children were only in 8.1% of the reservations. This type of class imbalance can often wreak havoc on an analysis. While we can address this issue with upsample or downsample, or we can even use themis package. This practice will analyze the data as-is with stratified sampling instead.
#Data splitting and resampling----
We will reserve 25% of the data for the test set, while the remaining 75% will serve as the not-tested set for model training and validation.
set.seed(123)
splits <- initial_split(hotels, strata = children)
hotel_other <- training(splits)
hotel_test <- testing(splits)
training set proportions by children
hotel_other %>%
count(children) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## children n prop
## <fct> <int> <dbl>
## 1 children 3048 0.0813
## 2 none 34452 0.919
test set proportions by children
hotel_test %>%
count(children) %>%
mutate(prop = n/sum(n))
## # A tibble: 2 x 3
## children n prop
## <fct> <int> <dbl>
## 1 children 990 0.0792
## 2 none 11510 0.921
Until now, we have relied on rsample::vfold_cv() to create 10 different resampled data sets that would produce 10 performance metrics for us to average.
However, we will segregate the not-tested set into a 20% single resampled validation data set to measure model’s performance and an 80% training set to train the model (see the diagram below).
Diagram1.data set validation split
We will use the validation_split() to allocate 20% (N=7500) of the non-tested hotel_other to the validation set and 80% (N=30000) to the training set.
This means that our model performance metrics will be computed on a single set of 7,500 hotel stays.
The sample size is farly large, so the metric should be able to provide a reliable performance index in predicting our outcome variable in a single resampling iteration.
set.seed(234)
val_set <- validation_split(hotel_other,
strata = children, #stratified sampling
prop = 0.80) #indicating the proportion
val_set
## # Validation Set Split (0.8/0.2) using stratification
## # A tibble: 1 x 2
## splits id
## <named list> <chr>
## 1 <split [30K/7.5K]> validation
The validation_split works similar to initial_split in segregating the data set. They both have the same strata argument to set a reference variable in terms of proportion between the two data sets.
This means we’ll have roughly the same proportions of hotel stays with and without children in our new validation and training sets, as compared to the original hotel_other proportions.
#The first model: Penalized Logistic Regression----
Since our outcome variable, children, is categorical by nature, we will use logistic regression as the first predictive model. The generalized linear model via penalized maximum likelihood estimates the regression slope parameters with a penalty in the process, so that less relevant predictors are driven towards a value of zero, or even zero if a large enough penalty is used (see lasso method).
First, let’s build a model with the glmnet engine.
lr_mod <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
We will set the penalty to tune() for now as a to-be-optimized hyperparameter. Setting mixture to a value of one means that the glmnet model will potentially remove irrelevant predictors and choose a simpler model.
Second, we create a recipe to pre-process the hotel data set in date-based predictors that pertains to important components of arrival date.
holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter",
"ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")
lr_recipe <-
recipe(children ~ ., data = hotel_other) %>%
step_date(arrival_date) %>%
step_holiday(arrival_date, holidays = holidays) %>%
step_rm(arrival_date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
step_dateis to create predictors for the year, month, and day of the week.
step_holiday()generates a set of indicator variables for specific holidays. Although we don’t know where these two hotels are located, we do know that the countries for origin for most stays are based in Europe.
step_rm() removes the original date variable since we no longer want it in the model.
All categorical predictors should be converted to dummy variables, and all numeric predictors need to be centered and scaled, so step_dummy() converts characters or factors into one or more numeric binary model terms for the levels of the original data.
step_zv() removes indicator variables that only contain a single unique value. This is important because the predictors should be centered and scaled in penalized model. Finally, step_normalize() centers and scales numeric variables.
Thirdly, we bundle both the model and the recipe into a single workflow() object for the ease of management.
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe)
Fourth, we create a grid of penalty values to tune. Previously, we use grid_regular to create an expanded grid based on a combination of two hyperparameters. However, we only have one hyperparameter in this modeling, we can set the grid up manually using a one-column tibble with 30 candidate values:
lr_reg_grid <- tibble(penalty = 10^seq(from = -4, to = -1, length.out = 30))
lr_reg_grid %>% top_n(-5) # lowest penalty values. Negative values select bottom from group.
## Selecting by penalty
## # A tibble: 5 x 1
## penalty
## <dbl>
## 1 0.0001
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259
lr_reg_grid %>% top_n(5) # highest penalty values
## Selecting by penalty
## # A tibble: 5 x 1
## penalty
## <dbl>
## 1 0.0386
## 2 0.0489
## 3 0.0621
## 4 0.0788
## 5 0.1
Fifth, we can train and tune the model. Let’s use tune::tune_grid() to train these 30 penalized logistic regression models. We’ll also save the validation set predictions (via the call to control_grid()) so that diagnostic information can be available after the model fit.
The Area Under Curve will be used to quantify how well the model performs across a continuum of event thresholds (recall that the event rate-the proportion of stays including children- is very low for these data).
lr_res <- #res means resample
lr_workflow %>% #pipe the workflow, which consists of the model and the recipe
tune_grid(val_set, #val_set contains both training and validation data sets for the model.
grid = lr_reg_grid, #specify the grid for penalty value
control = control_grid(save_pred = TRUE), #retain this prediction value for diagnosis.
metrics = metric_set(roc_auc)) #ask for the AUC.
Comparing between val_set and lr_res, we can see that we have three more columns in addition, which are “.metrics”, “.notes”, “.predictions”.
val_set
## # Validation Set Split (0.8/0.2) using stratification
## # A tibble: 1 x 2
## splits id
## <named list> <chr>
## 1 <split [30K/7.5K]> validation
lr_res
## # Validation Set Split (0.8/0.2) using stratification
## # A tibble: 1 x 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [30K/7.5~ validati~ <tibble [30 x ~ <tibble [0 x ~ <tibble [224,970 x ~
It might be easier to visualize the validation set metrics by plotting the area under the ROC curve against the range of penalty values:
lr_plot <-
lr_res %>%
collect_metrics() %>% #to tell R that we want to plot the accuracy metric.
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
ylab("Area under the ROC Curve") +
scale_x_log10(labels = scales::label_number())
lr_plot
The plot shows us that model performance is generally better at the smaller penalty values. This suggests that the majority of the predictors are important to the model.
The steep drop in the AUC toward the highest penalty value indicates that a large enough penalty will remove all predictors from the model; Thus, reducing predictive accuracy of the model. AUC value at .50 means the model is no better than a mere chance.
The model performance is at its peak at the smaller penalty value, so we could refer to roc_auc metric to find options of the best hyperparameter value.
top_models <-
lr_res %>% #the workflow
show_best("roc_auc", n = 15) %>%
arrange(penalty) #arrange row by "penalty". We know the model has this column due to the model of our choice.
top_models
## # A tibble: 15 x 6
## penalty .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0001 roc_auc binary 0.880 1 NA
## 2 0.000127 roc_auc binary 0.881 1 NA
## 3 0.000161 roc_auc binary 0.881 1 NA
## 4 0.000204 roc_auc binary 0.881 1 NA
## 5 0.000259 roc_auc binary 0.881 1 NA
## 6 0.000329 roc_auc binary 0.881 1 NA
## 7 0.000418 roc_auc binary 0.881 1 NA
## 8 0.000530 roc_auc binary 0.881 1 NA
## 9 0.000672 roc_auc binary 0.881 1 NA
## 10 0.000853 roc_auc binary 0.881 1 NA
## 11 0.00108 roc_auc binary 0.881 1 NA
## 12 0.00137 roc_auc binary 0.881 1 NA
## 13 0.00174 roc_auc binary 0.881 1 NA
## 14 0.00221 roc_auc binary 0.880 1 NA
## 15 0.00281 roc_auc binary 0.879 1 NA
Every candidate model in this tibble likely includes more predictor variables than the model in the row below it. If we use select_best, the function would return the 8th model candidate with a penalty value of 0.00053, shown with the dotted line below. (try replacing show_best with select_best and remove the n argument in the chunk above).
ROC with the best Model
However, we may want to choose a penalty value further along the x-axis, closer to where we start to see the decline in model performance. For example, candidate model 12 with a penalty value of 0.00137 has effectively the same performance as the numerically best model (see the dashed line), but with less predictors.
Essentially, we are optimizing the number of predictor to be as few as possible in terms of irrelevant variable. If performance is about the same, we’d prefer to choose a higher penalty value.
Let’s select this value and visualize the validation set ROC curve:
lr_best <-
lr_res %>%
collect_metrics() %>%
arrange(penalty) %>%
slice(12) #model 12
lr_best
## # A tibble: 1 x 6
## penalty .metric .estimator mean n std_err
## <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0.00137 roc_auc binary 0.881 1 NA
Now, we can plot the ROC vurve based on our chosen model
lr_auc <-
lr_res %>% #the workflow
collect_predictions(parameters = lr_best) %>% #use the parameter from the chosen model
roc_curve(children, .pred_children) %>% #in terms of children variable
mutate(model = "Logistic Regression") #use Logistic Regression model
autoplot(lr_auc)
The level of performance generated by this logistic regression model is good, but not groundbreaking. Perhaps the linear nature of the prediction equation is too limiting for this data set. As a next step, we might consider a highly non-linear model generated using a tree-based ensemble method.
#The second model: Tree-based ensemble----
Aside from the Generalized Linear Model (GLM), Random Forest is another effective and low-maintenance model we can use.
Random Forest (RF), which comprises of multitude of non-linear decision trees, possesses more flexibility comparing to the glm. Aggregating results across trees allows the final model parameter to be stable. The model itself requires little data preprocessing and can handle many types of predictors (sparse, skewed, continuous, categorical, etc.).
Despite having a decent performance by default, we will tune two hyperparameters for optimization, mtry (the number of predictors that are sampled at splits in a tree) and min_n (the minimum n to split at a node). However, the RF is computationally expensive to train and tune, but this strain could be lessened with paralell processing.
While we can use the tune package for paralell computing, we only have one data set for validation, which requires the use of other alternatives. The ranger package can also be used for parallelization, but we need to know the the number of cores we have to work with.
cores <- parallel::detectCores()
cores
## [1] 8
Now, we can pass on the core number to the ranger engine when we set up our parsnip rand_forest() model.
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger", num.threads = cores) %>%
set_mode("classification")
However, in other contexts, it is recommended that you use the tune package to do the parallel processing.
We do not need to do any dummification or normalization as in Generalized Linear Model, but we may want to feature engineer our arrival_date variable for the ease of pattern identification.
rf_recipe <-
recipe(children ~ ., data = hotel_other) %>%
step_date(arrival_date) %>%
step_holiday(arrival_date) %>%
step_rm(arrival_date)
Next, we combine both the model and the recipe into a workflow, so that we can pre-process the data set efficiently.
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
We have two hyperparameters to be tuned:
rf_mod
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
##
## Engine-Specific Arguments:
## num.threads = cores
##
## Computational engine: ranger
# show what will be tuned
rf_mod %>%
parameters()
## Collection of 2 parameters for tuning
##
## id parameter type object class
## mtry mtry nparam[?]
## min_n min_n nparam[+]
##
## Model parameters needing finalization:
## # Randomly Selected Predictors ('mtry')
##
## See `?dials::finalize` or `?dials::update.parameters` for more information.
We will use a space-filling design to tune the hyperparameter, with 25 candidate models:
set.seed(345) #For replicability
rf_res <- #res = resampled
rf_workflow %>%
tune_grid(val_set,#val_set contains both training and validation data sets for the model.
grid = 25, #Intended number of candidate model
control = control_grid(save_pred = TRUE), #retain this prediction value for diagnosis.
metrics = metric_set(roc_auc)) #ask for the AUC.
The message printed above “Creating pre-processing data to finalize unknown parameter: mtry” is related to the size of the data set. Since mtry depends on the number of predictors in the data set, tune_grid() determines the upper bound for mtry once it receives the data.
Here are our top 5 random forest models, out of the 25 candidates in terms of AUC:
rf_res %>%
show_best(metric = "roc_auc")
## # A tibble: 5 x 7
## mtry min_n .metric .estimator mean n std_err
## <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 3 3 roc_auc binary 0.933 1 NA
## 2 8 7 roc_auc binary 0.933 1 NA
## 3 6 18 roc_auc binary 0.933 1 NA
## 4 7 25 roc_auc binary 0.932 1 NA
## 5 9 12 roc_auc binary 0.932 1 NA
We can see that our top 5 models in Random Forest perform better than the selected model from Penalized Logistic Regression (Best AUC = 0.881).
Now we are plotting the result of the tuning process that highlights both mtry (number of predictors at each node) and min_n (minimum number of data points required to keep splitting). Both hyperparameters should be as small as possible to optimize the performance.
Good news is, the range of the y-axis indicates that the model is very robust to the choice of these parameter values - all but one of the ROC AUC values are greater than 0.90.
autoplot(rf_res)
Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:
rf_best <-
rf_res %>% #The model
select_best(metric = "roc_auc")
rf_best
## # A tibble: 1 x 2
## mtry min_n
## <int> <int>
## 1 3 3
To calculate the data needed to plot the ROC curve, we use collect_predictions(). This is only possible after tuning with control_grid(save_pred = TRUE).
rf_res %>%
collect_predictions()
## # A tibble: 187,475 x 7
## id .pred_children .pred_none .row mtry min_n children
## <chr> <dbl> <dbl> <int> <int> <int> <fct>
## 1 validation 0.00207 0.998 11 12 7 none
## 2 validation 0.00025 1.00 13 12 7 none
## 3 validation 0.000333 1.00 31 12 7 none
## 4 validation 0.000143 1.00 32 12 7 none
## 5 validation 0 1 36 12 7 none
## 6 validation 0.0888 0.911 43 12 7 none
## 7 validation 0.123 0.877 45 12 7 none
## 8 validation 0.0673 0.933 47 12 7 none
## 9 validation 0.167 0.833 48 12 7 none
## 10 validation 0.00424 0.996 53 12 7 none
## # ... with 187,465 more rows
In the output, you can see the two columns that hold our class probabilities for predicting hotel stays including and not including children,.pred_childrenand.pred_none .
To filter the predictions for only our best random forest model, we can use the parameters argument and pass it our tibble with the best hyperparameter values from tuning, which we called rf_best:
rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>% #to collect only the best model.
roc_curve(children, .pred_children) %>% #use only this two column.
mutate(model = "Random Forest") #specify the model.
Now, we can compare the validation set ROC curves for our top penalized logistic regression model and top random forest model:
rf_auc
## # A tibble: 7,135 x 4
## .threshold specificity sensitivity model
## <dbl> <dbl> <dbl> <chr>
## 1 -Inf 0 1 Random Forest
## 2 0.0000769 0 1 Random Forest
## 3 0.0000830 0.000145 1 Random Forest
## 4 0.0000859 0.000290 1 Random Forest
## 5 0.0000928 0.000435 1 Random Forest
## 6 0.000100 0.000581 1 Random Forest
## 7 0.000112 0.000726 1 Random Forest
## 8 0.000115 0.000871 1 Random Forest
## 9 0.000116 0.00102 1 Random Forest
## 10 0.000122 0.00116 1 Random Forest
## # ... with 7,125 more rows
lr_auc
## # A tibble: 7,105 x 4
## .threshold specificity sensitivity model
## <dbl> <dbl> <dbl> <chr>
## 1 -Inf 0 1 Logistic Regression
## 2 0.000536 0 1 Logistic Regression
## 3 0.000802 0.000145 1 Logistic Regression
## 4 0.000909 0.000290 1 Logistic Regression
## 5 0.00110 0.000435 1 Logistic Regression
## 6 0.00129 0.000581 1 Logistic Regression
## 7 0.00133 0.000726 1 Logistic Regression
## 8 0.00135 0.000871 1 Logistic Regression
## 9 0.00141 0.00116 1 Logistic Regression
## 10 0.00143 0.00131 1 Logistic Regression
## # ... with 7,095 more rows
bind_rows(rf_auc, lr_auc) %>% #to draw the two AUC together
ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + #specify the X-axis, Y-axis, and Plot Column by using the metric name (try calling any auc).
geom_path(lwd = 1.5 , alpha = 0.8) + #to connect the two AUC. lwd = linewidth. Alpha = color transparency value.
geom_abline(lty = 3) + # abline to annotate the plot. lty=line type.
coord_equal() + #ensures that the ranges of axes are equal.
scale_color_viridis_d(option = "plasma", end = .6)
The random forest is uniformly better across event probability thresholds.
#The Last Fit----
Our goal was to predict which hotel stays included children and/or babies, and from the test, Random Forest is our best bet for prediction.
After selecting our best model and hyperparameter values, our last task is to fit the final model on all the rows of non-tested data set (hotel_other) to acquire as many sample as possible (training + validation set). Then, we would evaluate the model performance one last time with the held-out test set (hotel_test).
We’ll start by building our parsnip model object again from scratch. We take our best hyperparameter values from our random forest model. When we set the engine, we will add a new argument: importance = "impurity" to provide variable importance scores for this last model, which gives some insight into which predictors drive model performance.
# the last model
last_rf_mod <-
rand_forest(mtry = 8, min_n = 7, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")
# the last workflow
last_rf_workflow <-
rf_workflow %>% #random forest workflow
update_model(last_rf_mod) #update the workflow with the above model.
# the last fit
set.seed(345)
last_rf_fit <-
last_rf_workflow %>%
last_fit(splits) #Fit the final best model to the training set and evaluate the test set. "splits" contains both the training and the testing set.
last_rf_fit
## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples
## # A tibble: 1 x 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [37.5K~ train/test~ <tibble [2 ~ <tibble [0~ <tibble [12,500~ <workflo~
This fitted workflow contains everything, including our final metrics based on the test set. So, how did this model do on the test set? Was the validation set a good estimate of future performance?
last_rf_fit %>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.948
## 2 roc_auc binary 0.922
The ROC AUC value is pretty close to what we saw when we tuned the random forest model with the validation set, which is good news. This means that our estimate of how well our model would perform with new data was not too far off from how well our model actually performed with the unseen test data.
We can access those variable importance scores via the .workflow column. We first need to pluck out the first element in the workflow column, then pull workflow fit from the workflow object. Finally, the vip package helps us visualize the variable importance scores for the top 20 features:
last_rf_fit %>%
pluck(".workflow", 1) %>% #take out the first element in the ".workflow" column.
pull_workflow_fit() %>% #Extract element of the fitted model.
vip(num_features = 20) #Display top 20 variable.
The most important predictors in whether a hotel stay had children or not were the daily cost for the room, the type of room reserved, the type of room that was ultimately assigned, and the time between the creation of the reservation and the arrival date.
Let’s generate our last ROC curve to visualize. Since the event we are predicting is the first level in the `children factor (children), we provide the roc_curve() with the relevant class probability .pred_children:
last_rf_fit %>%
collect_predictions() %>% #Collect the prediction metrics
roc_curve(children, .pred_children) %>% #constructs the full ROC curve and provide it with class probability.
autoplot()
Based on these results, the validation set and test set performance statistics are very close, so we would have pretty high confidence that our random forest model with the selected hyperparameters would perform well when predicting new data.