A predictive modeling case study

Introduction
The Hotel Bookings Data
A first model: penalized logistic regression
A second model: tree-based ensemble
- Create the recipe and workflow
- Train and tune the model
The last fit

Get started with building a model in this R Markdown document that accompanies A predictive modeling case study tidymodels start article.

If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.

Take advantage of the RStudio IDE and use “Run All Chunks Above” or “Run Current Chunk” buttons to easily execute code chunks. If you have been running other tidymodels articles in this project, restart R before working on this article so you don’t run out of memory on RStudio Cloud.

Introduction

Let’s put everything we learned from each of the previous Get Started articles together and build a predictive model from beginning to end with data on hotel stays.

Load necessary packages:

library(tidymodels)  
# Helper packages
library(readr)       # for importing data
library(vip)         # for variable importance plots

The Hotel Bookings Data

Let’s read the hotel data into R and randomly select 30% of the rows in the data set to avoid long computation times later on.

Note that your results will differ from the original article, since you are only using 30% of the data.

# Fix the random numbers by setting the seed 
# This enables the analysis to be reproducible when random numbers are used
set.seed(123)
hotels <- 
  read_csv('https://tidymodels.org/start/case-study/hotels.csv') %>%
  mutate_if(is.character, as.factor) %>%
  # randomly select rows
  slice_sample(prop = 0.30)
dim(hotels)

## [1] 15000    23

glimpse(hotels)

## Rows: 15,000
## Columns: 23
## $ hotel                          <fct> City_Hotel, Resort_Hotel, City_Hotel...
## $ lead_time                      <dbl> 7, 1, 237, 134, 140, 1, 210, 131, 12...
## $ stays_in_weekend_nights        <dbl> 0, 0, 1, 1, 0, 1, 2, 2, 0, 1, 1, 2, ...
## $ stays_in_week_nights           <dbl> 1, 1, 2, 1, 3, 1, 5, 5, 2, 0, 1, 1, ...
## $ adults                         <dbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, ...
## $ children                       <fct> children, none, none, children, none...
## $ meal                           <fct> BB, BB, SC, BB, BB, BB, HB, HB, BB, ...
## $ country                        <fct> PRT, PRT, SWE, DEU, PRT, PRT, DEU, G...
## $ market_segment                 <fct> Direct, Corporate, Online_TA, Online...
## $ distribution_channel           <fct> Direct, Corporate, TA/TO, TA/TO, TA/...
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <dbl> 0, 15, 0, 0, 0, 0, 0, 0, 0, 10, 0, 0...
## $ reserved_room_type             <fct> E, A, A, A, A, A, A, C, A, A, A, E, ...
## $ assigned_room_type             <fct> E, E, A, A, A, D, A, C, A, A, A, E, ...
## $ booking_changes                <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, ...
## $ deposit_type                   <fct> No_Deposit, No_Deposit, No_Deposit, ...
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 87, 0, 0, 0, 56, 0, 0, 0...
## $ customer_type                  <fct> Transient, Transient, Transient, Tra...
## $ average_daily_rate             <dbl> 173.95, 50.00, 72.42, 134.50, 65.00,...
## $ required_car_parking_spaces    <fct> none, none, none, none, none, none, ...
## $ total_of_special_requests      <dbl> 1, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 0, ...
## $ arrival_date                   <date> 2016-07-23, 2016-04-20, 2017-04-28,...

Let’s look at proportions of hotel stays that include children and/or babies:

hotels %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   children     n   prop
##   <fct>    <int>  <dbl>
## 1 children  1192 0.0795
## 2 none     13808 0.921

A first model: penalized logistic regression

Do you recall Evaluate your model with resampling article for data splitting?

Let’s reserve 25% of the hotels data for the test set:

set.seed(123)
splits      <- initial_split(hotels, strata = children)
hotel_other <- training(splits)
hotel_test  <- testing(splits)
# training set proportions by children
hotel_other %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   children     n   prop
##   <fct>    <int>  <dbl>
## 1 children   874 0.0777
## 2 none     10376 0.922

# test set proportions by children
hotel_test  %>% 
  count(children) %>% 
  mutate(prop = n/sum(n))

## # A tibble: 2 x 3
##   children     n   prop
##   <fct>    <int>  <dbl>
## 1 children   318 0.0848
## 2 none      3432 0.915

Now let’s reserve another 20% of the hotel_other for our validation set.

set.seed(234)
val_set <- validation_split(hotel_other, 
                            strata = children, 
                            prop = 0.80)
val_set

## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 2
##   splits            id        
##   <list>            <chr>     
## 1 <split [9K/2.2K]> validation

Build the model

Let’s specify a penalized logistic regression model using the lasso method. Note that we define penalty = tune() so we can tune it in the next steps, and since we are using lasso method, we set mixture = 1.

lr_mod <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

For more details try typing ?logistic_reg on the console.

Create the recipe

Remember the second article Preprocess your data with recipes?

Let’s preprocess the data by creating a recipe:

holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter", 
              "ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")
lr_recipe <- 
  recipe(children ~ ., data = hotel_other) %>% 
  step_date(arrival_date) %>% 
  step_holiday(arrival_date, holidays = holidays) %>% 
  step_rm(arrival_date) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

Create the workflow

Let’s bundle the model and recipe into a single workflow():

lr_workflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(lr_recipe)

Create the grid for tuning

We can now tune our model, similar to what is shown in the previous article Tune model parameters. Let’s create a grid with 30 values for the hyperparameter we would like to tune:

lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5) # lowest penalty values

## # A tibble: 5 x 1
##    penalty
##      <dbl>
## 1 0.0001  
## 2 0.000127
## 3 0.000161
## 4 0.000204
## 5 0.000259

lr_reg_grid %>% top_n(5)  # highest penalty values

## # A tibble: 5 x 1
##   penalty
##     <dbl>
## 1  0.0386
## 2  0.0489
## 3  0.0621
## 4  0.0788
## 5  0.1

Train and tune the model

Let’s train all these logistic regression models with 30 different hyperparameter values. We provide the validation set val_set, so model diagnostics computed on val_set will be available after the fit.

lr_res <- 
  lr_workflow %>% 
  tune_grid(val_set,
            grid = lr_reg_grid,
            control = control_grid(save_pred = TRUE, verbose = TRUE),
            metrics = metric_set(roc_auc))
lr_res

## # Tuning results
## # Validation Set Split (0.8/0.2)  using stratification 
## # A tibble: 1 x 5
##   splits           id         .metrics        .notes         .predictions       
##   <list>           <chr>      <list>          <list>         <list>             
## 1 <split [9K/2.2K~ validation <tibble [30 x ~ <tibble [0 x ~ <tibble [67,470 x ~

Now visualize the validation set metrics by plotting the area under the ROC curve against the range of penalty values:

lr_plot <- 
  lr_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line() + 
  ylab("Area under the ROC Curve") +
  scale_x_log10(labels = scales::label_number())
lr_plot

Get the best values for this hyperparameter:

top_models <-
  lr_res %>% 
  show_best("roc_auc", n = 15) %>% 
  arrange(penalty) 
top_models

## # A tibble: 15 x 7
##     penalty .metric .estimator  mean     n std_err .config              
##       <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.000204 roc_auc binary     0.854     1      NA Preprocessor1_Model04
##  2 0.000259 roc_auc binary     0.855     1      NA Preprocessor1_Model05
##  3 0.000329 roc_auc binary     0.857     1      NA Preprocessor1_Model06
##  4 0.000418 roc_auc binary     0.858     1      NA Preprocessor1_Model07
##  5 0.000530 roc_auc binary     0.859     1      NA Preprocessor1_Model08
##  6 0.000672 roc_auc binary     0.860     1      NA Preprocessor1_Model09
##  7 0.000853 roc_auc binary     0.860     1      NA Preprocessor1_Model10
##  8 0.00108  roc_auc binary     0.861     1      NA Preprocessor1_Model11
##  9 0.00137  roc_auc binary     0.861     1      NA Preprocessor1_Model12
## 10 0.00174  roc_auc binary     0.862     1      NA Preprocessor1_Model13
## 11 0.00221  roc_auc binary     0.862     1      NA Preprocessor1_Model14
## 12 0.00281  roc_auc binary     0.862     1      NA Preprocessor1_Model15
## 13 0.00356  roc_auc binary     0.860     1      NA Preprocessor1_Model16
## 14 0.00452  roc_auc binary     0.858     1      NA Preprocessor1_Model17
## 15 0.00574  roc_auc binary     0.856     1      NA Preprocessor1_Model18

Let’s pick candidate model 12 with a penalty value of 0.00137:

Note that because you are using less data, your mean ROC AUC will be slightly lower than what’s shown in the article.

lr_best <- 
  lr_res %>% 
  collect_metrics() %>% 
  arrange(penalty) %>% 
  slice(12)
lr_best

## # A tibble: 1 x 7
##   penalty .metric .estimator  mean     n std_err .config              
##     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.00137 roc_auc binary     0.861     1      NA Preprocessor1_Model12

And visualize the validation set ROC curve:

lr_auc <- 
  lr_res %>% 
  collect_predictions(parameters = lr_best) %>% 
  roc_curve(children, .pred_children) %>% 
  mutate(model = "Logistic Regression")
autoplot(lr_auc)

A second model: tree-based ensemble

Let’s try to improve our prediction performance by using a random forest model (model type), which we also explored in the Evaluate your model with resampling article.

Check number of cores to work with:

cores <- parallel::detectCores()
cores

## [1] 8

Set model specification and provide number of cores for parallelization while tuning.

rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("classification")

Create the recipe and workflow

Let’s create the recipe for the model.

rf_recipe <- 
  recipe(children ~ ., data = hotel_other) %>% 
  step_date(arrival_date) %>% 
  step_holiday(arrival_date) %>% 
  step_rm(arrival_date)

Then bundle it with the model specification:

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)

Train and tune the model

When we set up our parsnip model, we chose two hyperparameters for tuning:

rf_mod

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 1000
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   num.threads = cores
## 
## Computational engine: ranger

# show what will be tuned
rf_mod %>%    
  parameters()

## Collection of 2 parameters for tuning
## 
##  identifier  type    object
##        mtry  mtry nparam[?]
##       min_n min_n nparam[+]
## 
## Model parameters needing finalization:
##    # Randomly Selected Predictors ('mtry')
## 
## See `?dials::finalize` or `?dials::update.parameters` for more information.

We will use a space-filling design to tune with 12 candidate models (instead of 25, to reduce computation time).

Be patient here! Computing these results will take several minutes to complete if you are using the default RStudio Cloud resources (1 GB memory, 1 CPU).

set.seed(345)
rf_res <- 
  rf_workflow %>% 
  tune_grid(val_set,
            grid = 12,
            control = control_grid(save_pred = TRUE, verbose = TRUE),
            metrics = metric_set(roc_auc))

Here are our top 5 random forest models, out of the 12 candidates:

Note that your results will be different and your accuracy will take a small hit since you are using less data (only 25% of the whole data set) and setting up a smaller grid to tune model hyperparameters.

rf_res %>% 
  show_best(metric = "roc_auc")

## # A tibble: 5 x 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     7    13 roc_auc binary     0.895     1      NA Preprocessor1_Model12
## 2     9     5 roc_auc binary     0.894     1      NA Preprocessor1_Model06
## 3     4    38 roc_auc binary     0.893     1      NA Preprocessor1_Model07
## 4    13     4 roc_auc binary     0.891     1      NA Preprocessor1_Model03
## 5    12    28 roc_auc binary     0.890     1      NA Preprocessor1_Model09

But we’re already getting much better results than our penalized logistic regression!

Let’s plot the results:

autoplot(rf_res)

Let’s select the best model according to the ROC AUC metric. Our final tuning parameter values are:

rf_best <- 
  rf_res %>% 
  select_best(metric = "roc_auc")
rf_best

## # A tibble: 1 x 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     7    13 Preprocessor1_Model12

Collect predictions for the best model: Note that we simply provide our best model’s parameter values rf_best to subset it from a whole list of tuned models.

rf_res %>% 
  collect_predictions()

## # A tibble: 26,988 x 8
##   id        .pred_children .pred_none  .row  mtry min_n children .config        
##   <chr>              <dbl>      <dbl> <int> <int> <int> <fct>    <chr>          
## 1 validati~       0.000611      0.999     1    16    32 none     Preprocessor1_~
## 2 validati~       0.119         0.881     4    16    32 children Preprocessor1_~
## 3 validati~       0             1         7    16    32 none     Preprocessor1_~
## 4 validati~       0.00212       0.998    25    16    32 none     Preprocessor1_~
## 5 validati~       0.00208       0.998    26    16    32 none     Preprocessor1_~
## # ... with 26,983 more rows

rf_auc <- 
  rf_res %>% 
  collect_predictions(parameters = rf_best) %>% 
  roc_curve(children, .pred_children) %>% 
  mutate(model = "Random Forest")

Now, we can compare the validation set ROC curves for our top penalized logistic regression model and random forest model:

bind_rows(rf_auc, lr_auc) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_viridis_d(option = "plasma", end = .6)

The random forest is uniformly better across event probability thresholds.

The last fit

Let’s evaluate the model performance one last time with the held-out test set. We’ll start by building our parsnip model object again from scratch with our best hyperparameter values from our random forest model:

# the last model
last_rf_mod <- 
  rand_forest(mtry = 8, min_n = 7, trees = 1000) %>% 
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")
# the last workflow
last_rf_workflow <- 
  rf_workflow %>% 
  update_model(last_rf_mod)
# the last fit
set.seed(345)
last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(splits)
last_rf_fit

## # Resampling results
## # Manual resampling 
## # A tibble: 1 x 6
##   splits         id          .metrics     .notes      .predictions     .workflow
##   <list>         <chr>       <list>       <list>      <list>           <list>   
## 1 <split [11.2K~ train/test~ <tibble [2 ~ <tibble [0~ <tibble [3,750 ~ <workflo~

Note that we added a new argument importance = "impurity" to set_engine to get variable importance scores. This is an optional, engine-specific argument. To see its documentation, you need to read the documentation for the underlying ranger() function. To see it and other options, type ?ranger in console.

Now let’s collect the metrics:

last_rf_fit %>% 
  collect_metrics()

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.942 Preprocessor1_Model1
## 2 roc_auc  binary         0.921 Preprocessor1_Model1

Now let’s pluck the workflow and pull out the fit and visualize the variable importance scores for the top 20 features:

last_rf_fit %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  vip(num_features = 20)

Let’s generate our last ROC curve to visualize:

last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(children, .pred_children) %>% 
  autoplot()

Not bad!