A predictive model case study

setwd("D:/Class Materials & Work/Summer 2020 practice/5_Tidymodel_Case study")
getwd()

## [1] "D:/Class Materials & Work/Summer 2020 practice/5_Tidymodel_Case study"

In this final case study, we will use all of the previous practices, tidymodel 1-4, as a foundation to build a predictive model from beginning to end with data on hotel stays.

We have learned to build a model(1), preprocess the dataset on the fly with recipe(2), realistically evaluate the model with resampled data(3), and tune hyperparameters of the model for optimality(4).

Loading required packages:

library(tidymodels)  # for the tune package, along with the rest of tidymodels

#helper packages
library(readr)   #for the data importing
library(vip)     #for variable importance plots

#The hotel booking data----

We will use the hotel bookings data to predict which hotel stays included children and/or babies, based on variables such as hotel names, costs, and date.

First, let’s read the data set.

hotels <- 
  read_csv('hotels.csv', col_names = T) %>%
  mutate_if(is.character, as.factor) #mutate the variable into factor

The reason that we are only analyzing hotel stays is because guests who cancel their booking are likely to have a lot of missing data.

glimpse(hotels)

## Rows: 50,000
## Columns: 23
## $ hotel                          <fct> City_Hotel, City_Hotel, Resort_Hotel...
## $ lead_time                      <dbl> 217, 2, 95, 143, 136, 67, 47, 56, 80...
## $ stays_in_weekend_nights        <dbl> 1, 0, 2, 2, 1, 2, 0, 0, 0, 2, 1, 0, ...
## $ stays_in_week_nights           <dbl> 3, 1, 5, 6, 4, 2, 2, 3, 4, 2, 2, 1, ...
## $ adults                         <dbl> 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, ...
## $ children                       <fct> none, none, none, none, none, none, ...
## $ meal                           <fct> BB, BB, BB, HB, HB, SC, BB, BB, BB, ...
## $ country                        <fct> DEU, PRT, GBR, ROU, PRT, GBR, ESP, E...
## $ market_segment                 <fct> Offline_TA/TO, Direct, Online_TA, On...
## $ distribution_channel           <fct> TA/TO, Direct, TA/TO, TA/TO, Direct,...
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type             <fct> A, D, A, A, F, A, C, B, D, A, A, D, ...
## $ assigned_room_type             <fct> A, K, A, A, F, A, C, A, D, A, D, D, ...
## $ booking_changes                <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                   <fct> No_Deposit, No_Deposit, No_Deposit, ...
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type                  <fct> Transient-Party, Transient, Transien...
## $ average_daily_rate             <dbl> 80.75, 170.00, 8.00, 81.00, 157.60, ...
## $ required_car_parking_spaces    <fct> none, none, none, none, none, none, ...
## $ total_of_special_requests      <dbl> 1, 3, 2, 1, 4, 1, 1, 1, 1, 1, 0, 1, ...
## $ arrival_date                   <date> 2016-09-01, 2017-08-25, 2016-11-19,...

We will build a model to predict which actual hotel stays included children and/or babies, and which did not. Our outcome variable children is a factor variable with two levels:

hotels %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   children     n   prop
##   <fct>    <int>  <dbl>
## 1 children  4038 0.0808
## 2 none     45962 0.919

We can see that children were only in 8.1% of the reservations. This type of class imbalance can often wreak havoc on an analysis. While we can address this issue with upsample or downsample, or we can even use themis package. This practice will analyze the data as-is with stratified sampling instead.

#Data splitting and resampling----

We will reserve 25% of the data for the test set, while the remaining 75% will serve as the not-tested set for model training and validation.

set.seed(123)
splits      <- initial_split(hotels, strata = children)

hotel_other <- training(splits)
hotel_test  <- testing(splits)

training set proportions by children

hotel_other %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   children     n   prop
##   <fct>    <int>  <dbl>
## 1 children  3048 0.0813
## 2 none     34452 0.919

test set proportions by children

hotel_test  %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   children     n   prop
##   <fct>    <int>  <dbl>
## 1 children   990 0.0792
## 2 none     11510 0.921

Until now, we have relied on rsample::vfold_cv() to create 10 different resampled data sets that would produce 10 performance metrics for us to average.

However, we will segregate the not-tested set into a 20% single resampled validation data set to measure model’s performance and an 80% training set to train the model (see the diagram below).

Diagram1.data set validation split

We will use the validation_split() to allocate 20% (N=7500) of the non-tested hotel_other to the validation set and 80% (N=30000) to the training set.
This means that our model performance metrics will be computed on a single set of 7,500 hotel stays.

The sample size is farly large, so the metric should be able to provide a reliable performance index in predicting our outcome variable in a single resampling iteration.

set.seed(234)
val_set <- validation_split(hotel_other, 
                            strata = children, #stratified sampling
                            prop = 0.80) #indicating the proportion
val_set

## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 2
##   splits             id        
##   <named list>       <chr>     
## 1 <split [30K/7.5K]> validation

The validation_split works similar to initial_split in segregating the data set. They both have the same strata argument to set a reference variable in terms of proportion between the two data sets.

This means we’ll have roughly the same proportions of hotel stays with and without children in our new validation and training sets, as compared to the original hotel_other proportions.

#The first model: Penalized Logistic Regression----

Since our outcome variable, children, is categorical by nature, we will use logistic regression as the first predictive model. The generalized linear model via penalized maximum likelihood estimates the regression slope parameters with a penalty in the process, so that less relevant predictors are driven towards a value of zero, or even zero if a large enough penalty is used (see lasso method).

First, let’s build a model with the glmnet engine.

lr_mod <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

We will set the penalty to tune() for now as a to-be-optimized hyperparameter. Setting mixture to a value of one means that the glmnet model will potentially remove irrelevant predictors and choose a simpler model.

Second, we create a recipe to pre-process the hotel data set in date-based predictors that pertains to important components of arrival date.

holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter", 
              "ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")

lr_recipe <- 
  recipe(children ~ ., data = hotel_other) %>% 
  step_date(arrival_date) %>% 
  step_holiday(arrival_date, holidays = holidays) %>% 
  step_rm(arrival_date) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

step_dateis to create predictors for the year, month, and day of the week.
step_holiday()generates a set of indicator variables for specific holidays. Although we don’t know where these two hotels are located, we do know that the countries for origin for most stays are based in Europe.
step_rm() removes the original date variable since we no longer want it in the model.
All categorical predictors should be converted to dummy variables, and all numeric predictors need to be centered and scaled, so step_dummy() converts characters or factors into one or more numeric binary model terms for the levels of the original data.
step_zv() removes indicator variables that only contain a single unique value. This is important because the predictors should be centered and scaled in penalized model. Finally, step_normalize() centers and scales numeric variables.

Thirdly, we bundle both the model and the recipe into a single workflow() object for the ease of management.

lr_workflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(lr_recipe)

Fourth, we create a grid of penalty values to tune. Previously, we use grid_regular to create an expanded grid based on a combination of two hyperparameters. However, we only have one hyperparameter in this modeling, we can set the grid up manually using a one-column tibble with 30 candidate values:

lr_reg_grid <- tibble(penalty = 10^seq(from = -4, to = -1, length.out = 30))

lr_reg_grid %>% top_n(-5) # lowest penalty values. Negative values select bottom from group.

## Selecting by penalty

## # A tibble: 5 x 1
##    penalty
##      <dbl>
## 1 0.0001  
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259

lr_reg_grid %>% top_n(5)  # highest penalty values

## Selecting by penalty

## # A tibble: 5 x 1
##   penalty
##     <dbl>
## 1  0.0386
## 2  0.0489
## 3  0.0621
## 4  0.0788
## 5  0.1

Fifth, we can train and tune the model. Let’s use tune::tune_grid() to train these 30 penalized logistic regression models. We’ll also save the validation set predictions (via the call to control_grid()) so that diagnostic information can be available after the model fit.

The Area Under Curve will be used to quantify how well the model performs across a continuum of event thresholds (recall that the event rate-the proportion of stays including children- is very low for these data).

lr_res <- #res means resample
  lr_workflow %>% #pipe the workflow, which consists of the model and the recipe
  tune_grid(val_set, #val_set contains both training and validation data sets for the model.
            grid = lr_reg_grid, #specify the grid for penalty value
            control = control_grid(save_pred = TRUE), #retain this prediction value for diagnosis.
            metrics = metric_set(roc_auc)) #ask for the AUC.

Comparing between val_set and lr_res, we can see that we have three more columns in addition, which are “.metrics”, “.notes”, “.predictions”.

val_set

## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 2
##   splits             id        
##   <named list>       <chr>     
## 1 <split [30K/7.5K]> validation

lr_res

## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 5
##   splits           id        .metrics        .notes         .predictions        
##   <list>           <chr>     <list>          <list>         <list>              
## 1 <split [30K/7.5~ validati~ <tibble [30 x ~ <tibble [0 x ~ <tibble [224,970 x ~

It might be easier to visualize the validation set metrics by plotting the area under the ROC curve against the range of penalty values:

lr_plot <- 
  lr_res %>% 
  collect_metrics() %>% #to tell R that we want to plot the accuracy metric.
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the ROC Curve") +
  scale_x_log10(labels = scales::label_number())

lr_plot

The plot shows us that model performance is generally better at the smaller penalty values. This suggests that the majority of the predictors are important to the model.

The steep drop in the AUC toward the highest penalty value indicates that a large enough penalty will remove all predictors from the model; Thus, reducing predictive accuracy of the model. AUC value at .50 means the model is no better than a mere chance.

The model performance is at its peak at the smaller penalty value, so we could refer to roc_auc metric to find options of the best hyperparameter value.

top_models <-
  lr_res %>% #the workflow
  show_best("roc_auc", n = 15) %>% 
  arrange(penalty) #arrange row by "penalty". We know the model has this column due to the model of our choice.

top_models

## # A tibble: 15 x 6
##     penalty .metric .estimator  mean     n std_err
##       <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
##  1 0.0001   roc_auc binary     0.880     1      NA
##  2 0.000127 roc_auc binary     0.881     1      NA
##  3 0.000161 roc_auc binary     0.881     1      NA
##  4 0.000204 roc_auc binary     0.881     1      NA
##  5 0.000259 roc_auc binary     0.881     1      NA
##  6 0.000329 roc_auc binary     0.881     1      NA
##  7 0.000418 roc_auc binary     0.881     1      NA
##  8 0.000530 roc_auc binary     0.881     1      NA
##  9 0.000672 roc_auc binary     0.881     1      NA
## 10 0.000853 roc_auc binary     0.881     1      NA
## 11 0.00108  roc_auc binary     0.881     1      NA
## 12 0.00137  roc_auc binary     0.881     1      NA
## 13 0.00174  roc_auc binary     0.881     1      NA
## 14 0.00221  roc_auc binary     0.880     1      NA
## 15 0.00281  roc_auc binary     0.879     1      NA

Every candidate model in this tibble likely includes more predictor variables than the model in the row below it. If we use select_best, the function would return the 8th model candidate with a penalty value of 0.00053, shown with the dotted line below. (try replacing show_best with select_best and remove the n argument in the chunk above).

ROC with the best Model

However, we may want to choose a penalty value further along the x-axis, closer to where we start to see the decline in model performance. For example, candidate model 12 with a penalty value of 0.00137 has effectively the same performance as the numerically best model (see the dashed line), but with less predictors.

Essentially, we are optimizing the number of predictor to be as few as possible in terms of irrelevant variable. If performance is about the same, we’d prefer to choose a higher penalty value.

Let’s select this value and visualize the validation set ROC curve:

lr_best <- 
  lr_res %>% 
  collect_metrics() %>% 
  arrange(penalty) %>% 
  slice(12) #model 12

lr_best

## # A tibble: 1 x 6
##   penalty .metric .estimator  mean     n std_err
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
## 1 0.00137 roc_auc binary     0.881     1      NA

Now, we can plot the ROC vurve based on our chosen model

lr_auc <- 
  lr_res %>% #the workflow
  collect_predictions(parameters = lr_best) %>% #use the parameter from the chosen model
  roc_curve(children, .pred_children) %>% #in terms of children variable
  mutate(model = "Logistic Regression") #use Logistic Regression model

autoplot(lr_auc)

The level of performance generated by this logistic regression model is good, but not groundbreaking. Perhaps the linear nature of the prediction equation is too limiting for this data set. As a next step, we might consider a highly non-linear model generated using a tree-based ensemble method.

#The second model: Tree-based ensemble----

Aside from the Generalized Linear Model (GLM), Random Forest is another effective and low-maintenance model we can use.

Random Forest (RF), which comprises of multitude of non-linear decision trees, possesses more flexibility comparing to the glm. Aggregating results across trees allows the final model parameter to be stable. The model itself requires little data preprocessing and can handle many types of predictors (sparse, skewed, continuous, categorical, etc.).

Despite having a decent performance by default, we will tune two hyperparameters for optimization, mtry (the number of predictors that are sampled at splits in a tree) and min_n (the minimum n to split at a node). However, the RF is computationally expensive to train and tune, but this strain could be lessened with paralell processing.

While we can use the tune package for paralell computing, we only have one data set for validation, which requires the use of other alternatives. The ranger package can also be used for parallelization, but we need to know the the number of cores we have to work with.

cores <- parallel::detectCores()
cores

## [1] 8

Now, we can pass on the core number to the ranger engine when we set up our parsnip rand_forest() model.

rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("classification")

However, in other contexts, it is recommended that you use the tune package to do the parallel processing.

We do not need to do any dummification or normalization as in Generalized Linear Model, but we may want to feature engineer our arrival_date variable for the ease of pattern identification.

rf_recipe <- 
  recipe(children ~ ., data = hotel_other) %>% 
  step_date(arrival_date) %>% 
  step_holiday(arrival_date) %>% 
  step_rm(arrival_date)

Next, we combine both the model and the recipe into a workflow, so that we can pre-process the data set efficiently.

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)

We have two hyperparameters to be tuned:

rf_mod

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   num.threads = cores
## 
## Computational engine: ranger

# show what will be tuned
rf_mod %>%    
  parameters()

## Collection of 2 parameters for tuning
## 
##     id parameter type object class
##   mtry           mtry    nparam[?]
##  min_n          min_n    nparam[+]
## 
## Model parameters needing finalization:
##    # Randomly Selected Predictors ('mtry')
## 
## See `?dials::finalize` or `?dials::update.parameters` for more information.

We will use a space-filling design to tune the hyperparameter, with 25 candidate models:

set.seed(345) #For replicability
rf_res <- #res = resampled
  rf_workflow %>% 
  tune_grid(val_set,#val_set contains both training and validation data sets for the model.
            grid = 25, #Intended number of candidate model
            control = control_grid(save_pred = TRUE), #retain this prediction value for diagnosis.
            metrics = metric_set(roc_auc)) #ask for the AUC.

The message printed above “Creating pre-processing data to finalize unknown parameter: mtry” is related to the size of the data set. Since mtry depends on the number of predictors in the data set, tune_grid() determines the upper bound for mtry once it receives the data.

Here are our top 5 random forest models, out of the 25 candidates in terms of AUC:

rf_res %>% 
  show_best(metric = "roc_auc")

## # A tibble: 5 x 7
##    mtry min_n .metric .estimator  mean     n std_err
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1     3     3 roc_auc binary     0.933     1      NA
## 2     8     7 roc_auc binary     0.933     1      NA
## 3     6    18 roc_auc binary     0.933     1      NA
## 4     7    25 roc_auc binary     0.932     1      NA
## 5     9    12 roc_auc binary     0.932     1      NA

We can see that our top 5 models in Random Forest perform better than the selected model from Penalized Logistic Regression (Best AUC = 0.881).

Now we are plotting the result of the tuning process that highlights both mtry (number of predictors at each node) and min_n (minimum number of data points required to keep splitting). Both hyperparameters should be as small as possible to optimize the performance.

Good news is, the range of the y-axis indicates that the model is very robust to the choice of these parameter values - all but one of the ROC AUC values are greater than 0.90.

autoplot(rf_res)

Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:

rf_best <- 
  rf_res %>% #The model
  select_best(metric = "roc_auc")
rf_best

## # A tibble: 1 x 2
##    mtry min_n
##   <int> <int>
## 1     3     3

To calculate the data needed to plot the ROC curve, we use collect_predictions(). This is only possible after tuning with control_grid(save_pred = TRUE).

rf_res %>% 
  collect_predictions()

## # A tibble: 187,475 x 7
##    id         .pred_children .pred_none  .row  mtry min_n children
##    <chr>               <dbl>      <dbl> <int> <int> <int> <fct>   
##  1 validation       0.00207       0.998    11    12     7 none    
##  2 validation       0.00025       1.00     13    12     7 none    
##  3 validation       0.000333      1.00     31    12     7 none    
##  4 validation       0.000143      1.00     32    12     7 none    
##  5 validation       0             1        36    12     7 none    
##  6 validation       0.0888        0.911    43    12     7 none    
##  7 validation       0.123         0.877    45    12     7 none    
##  8 validation       0.0673        0.933    47    12     7 none    
##  9 validation       0.167         0.833    48    12     7 none    
## 10 validation       0.00424       0.996    53    12     7 none    
## # ... with 187,465 more rows

In the output, you can see the two columns that hold our class probabilities for predicting hotel stays including and not including children,.pred_childrenand.pred_none .

To filter the predictions for only our best random forest model, we can use the parameters argument and pass it our tibble with the best hyperparameter values from tuning, which we called rf_best:

rf_auc <- 
  rf_res %>% 
  collect_predictions(parameters = rf_best) %>% #to collect only the best model.
  roc_curve(children, .pred_children) %>% #use only this two column.
  mutate(model = "Random Forest") #specify the model.

Now, we can compare the validation set ROC curves for our top penalized logistic regression model and top random forest model:

rf_auc

## # A tibble: 7,135 x 4
##      .threshold specificity sensitivity model        
##           <dbl>       <dbl>       <dbl> <chr>        
##  1 -Inf            0                  1 Random Forest
##  2    0.0000769    0                  1 Random Forest
##  3    0.0000830    0.000145           1 Random Forest
##  4    0.0000859    0.000290           1 Random Forest
##  5    0.0000928    0.000435           1 Random Forest
##  6    0.000100     0.000581           1 Random Forest
##  7    0.000112     0.000726           1 Random Forest
##  8    0.000115     0.000871           1 Random Forest
##  9    0.000116     0.00102            1 Random Forest
## 10    0.000122     0.00116            1 Random Forest
## # ... with 7,125 more rows

lr_auc

## # A tibble: 7,105 x 4
##     .threshold specificity sensitivity model              
##          <dbl>       <dbl>       <dbl> <chr>              
##  1 -Inf           0                  1 Logistic Regression
##  2    0.000536    0                  1 Logistic Regression
##  3    0.000802    0.000145           1 Logistic Regression
##  4    0.000909    0.000290           1 Logistic Regression
##  5    0.00110     0.000435           1 Logistic Regression
##  6    0.00129     0.000581           1 Logistic Regression
##  7    0.00133     0.000726           1 Logistic Regression
##  8    0.00135     0.000871           1 Logistic Regression
##  9    0.00141     0.00116            1 Logistic Regression
## 10    0.00143     0.00131            1 Logistic Regression
## # ... with 7,095 more rows

bind_rows(rf_auc, lr_auc) %>%  #to draw the two AUC together
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + #specify the X-axis, Y-axis, and Plot Column by using the metric name (try calling any auc).
  geom_path(lwd = 1.5 , alpha = 0.8) + #to connect the two AUC. lwd = linewidth. Alpha = color transparency value.
  geom_abline(lty = 3) + # abline to annotate the plot. lty=line type.
  coord_equal() +  #ensures that the ranges of axes are equal.
  scale_color_viridis_d(option = "plasma", end = .6)

The random forest is uniformly better across event probability thresholds.

#The Last Fit----

Our goal was to predict which hotel stays included children and/or babies, and from the test, Random Forest is our best bet for prediction.

After selecting our best model and hyperparameter values, our last task is to fit the final model on all the rows of non-tested data set (hotel_other) to acquire as many sample as possible (training + validation set). Then, we would evaluate the model performance one last time with the held-out test set (hotel_test).

We’ll start by building our parsnip model object again from scratch. We take our best hyperparameter values from our random forest model. When we set the engine, we will add a new argument: importance = "impurity" to provide variable importance scores for this last model, which gives some insight into which predictors drive model performance.

# the last model
last_rf_mod <- 
  rand_forest(mtry = 8, min_n = 7, trees = 1000) %>% 
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")

# the last workflow
last_rf_workflow <- 
  rf_workflow %>% #random forest workflow
  update_model(last_rf_mod) #update the workflow with the above model.

# the last fit
set.seed(345)
last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(splits) #Fit the final best model to the training set and evaluate the test set. "splits" contains both the training and the testing set.

last_rf_fit

## # Monte Carlo cross-validation (0.75/0.25) with 1 resamples  
## # A tibble: 1 x 6
##   splits         id          .metrics     .notes      .predictions     .workflow
##   <list>         <chr>       <list>       <list>      <list>           <list>   
## 1 <split [37.5K~ train/test~ <tibble [2 ~ <tibble [0~ <tibble [12,500~ <workflo~

This fitted workflow contains everything, including our final metrics based on the test set. So, how did this model do on the test set? Was the validation set a good estimate of future performance?

last_rf_fit %>% 
  collect_metrics()

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.948
## 2 roc_auc  binary         0.922

The ROC AUC value is pretty close to what we saw when we tuned the random forest model with the validation set, which is good news. This means that our estimate of how well our model would perform with new data was not too far off from how well our model actually performed with the unseen test data.

We can access those variable importance scores via the .workflow column. We first need to pluck out the first element in the workflow column, then pull workflow fit from the workflow object. Finally, the vip package helps us visualize the variable importance scores for the top 20 features:

last_rf_fit %>% 
  pluck(".workflow", 1) %>%   #take out the first element in the ".workflow" column.
  pull_workflow_fit() %>%     #Extract element of the fitted model.
  vip(num_features = 20)      #Display top 20 variable.

The most important predictors in whether a hotel stay had children or not were the daily cost for the room, the type of room reserved, the type of room that was ultimately assigned, and the time between the creation of the reservation and the arrival date.

Let’s generate our last ROC curve to visualize. Since the event we are predicting is the first level in the `children factor (children), we provide the roc_curve() with the relevant class probability .pred_children:

last_rf_fit %>% 
  collect_predictions() %>% #Collect the prediction metrics
  roc_curve(children, .pred_children) %>% #constructs the full ROC curve and provide it with class probability.
  autoplot()

Based on these results, the validation set and test set performance statistics are very close, so we would have pretty high confidence that our random forest model with the selected hyperparameters would perform well when predicting new data.

A predictive model case study

Tarid Wongvorachan

August 7th, 2020