Hotel Bookings Cancellation

1 Introduction
2 Data Wrangling
3 Data Visualization
- 3.1 Categorical Features
- 3.2 Numeric Features
4 Data Preparation
- 4.1 Cross Validation
- 4.2 X-y Splitting
5 Machine Learning Modelling
6 Model Evaluation
7 Model Tuning
- 7.1 Parameter Tuning
- 7.2 Feature Selection and Feature Engineering
8 Conclusion

1 Introduction

In this article, we will try to build a classification model to predict whether a hotel booking will be canceled or not.

1.1 Background

The data used in this article is a data of hotel bookings. It contains various features or fields that are related to a hotel booking, with the target variable is whether the booking is canceled or not. We want to be able to predict whether a customer or a client is potentially going to cancel a booking, so that we can either prepare for it or simply just reject the request.

The data was collected from kaggle. I have refactored the data and removed some redundant columns.

1.2 Read Data

We will store the data into hotel variable.

hotel <- read.csv("hotel_bookings_small.csv")

head(hotel)

dim(hotel)

[1] 119390     16

There are 119390 observations and 16 features. Most of the features are self-explanatory.

1. is_canceled : The target variable, 1 means that the booking was canceled and 0 means it wasn’t.
2. adults, babies and children : represent the number of adults, children and babies in the booking.
3. meal : explains the meal in the booking. BB for Bed and Breakfast, FB for Full Board (Breakfast, Lunch, and Dinner), HB for Half Board (usually Breakfast and Dinner), and RO for Room Only.
4. customer_type : Whether the customer is transient (short-time) or contract.
5. adr : average daily rate is the most common indicator to estimate the room’s price based on average occupation per day.

1.3 Libraries

For this analysis, I will use dplyr for data wrangling, ggplot2 for data visualization, caret for model evaluation, and various machine learning libraries for building the model.

library(dplyr)
library(ggplot2)
library(caret)
library(e1071)
library(party)
library(randomForest)

And the template theme for visualization.

theme_ds <- theme(
           panel.background = element_rect(fill="#6CADDF"),
           panel.border = element_rect(fill=NA),
           panel.grid.minor.x = element_blank(),
           panel.grid.major.x = element_blank(),
           panel.grid.major.y = element_blank(),
           panel.grid.minor.y = element_blank(),
           plot.background = element_rect(fill="#00285E"),
           text = element_text(color="white"),
           axis.text = element_text(color="white")
           )

2 Data Wrangling

The data we collected isn’t always in tidiest form, there are some failures here and there. The process of tidying a data is called data wrangling.

2.1 Data Type

To make sure the analysis goes in the right direction we wanted, we have to make sure each column has the proper data type.

str(hotel)

'data.frame':   119390 obs. of  16 variables:
 $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
 $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
 $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
 $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ adr                           : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...

There are two columns in integer type when it should be a category.

hotel$is_canceled <- as.factor(hotel$is_canceled)
hotel$is_repeated_guest <- as.factor(hotel$is_repeated_guest)

str(hotel)

'data.frame':   119390 obs. of  16 variables:
 $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
 $ is_canceled                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
 $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
 $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
 $ is_repeated_guest             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
 $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
 $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ adr                           : num  0 0 75 75 98 ...
 $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
 $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...

2.2 Missing Values

We also have to check the missing values. Not only it adds noise to the data, but it also can’t be processed by the machine learning algorithm.

colSums(is.na(hotel))

                         hotel                    is_canceled 
                             0                              0 
                        adults                       children 
                             0                              4 
                        babies                           meal 
                             0                              0 
             is_repeated_guest         previous_cancellations 
                             0                              0 
previous_bookings_not_canceled             reserved_room_type 
                             0                              0 
                  deposit_type                  customer_type 
                             0                              0 
                           adr    required_car_parking_spaces 
                             0                              0 
     total_of_special_requests             reservation_status 
                             0                              0

Typically, there are 3 most common techniques to deal with missing values, i.e removing it, replacing with constants, or predicting using algorithms. In this case, we only have 4 missing values, which is a very small value, thus we can simply remove them.

hotel <- na.omit(hotel)

anyNA(hotel)

[1] FALSE

2.3 Feature Selection

Features or Columns are very important to the model performance. Feature Selection and Feature Engineering are two powerful methods that can significantly boost the model performance. Feature Selection can be done either by manually selecting the columns, or by applying statistical significance tests such as f-test and chi-square test.

head(hotel)

As we can see, there are 2 possible columns to remove :
1. reservation_status, which is just the string representation of the target variable is_canceled, where Check-Out means 0 and Canceled means 1.
2. required_car_parking_spaces : it might be useful but as per now, we can remove this column.

hotel %>% 
  select(-reservation_status, -required_car_parking_spaces) -> hotel

3 Data Visualization

We can check the summary of the data.

summary(hotel)

          hotel       is_canceled     adults          children      
 City Hotel  :79326   0:75166     Min.   : 0.000   Min.   : 0.0000  
 Resort Hotel:40060   1:44220     1st Qu.: 2.000   1st Qu.: 0.0000  
                                  Median : 2.000   Median : 0.0000  
                                  Mean   : 1.856   Mean   : 0.1039  
                                  3rd Qu.: 2.000   3rd Qu.: 0.0000  
                                  Max.   :55.000   Max.   :10.0000  
                                                                    
     babies                 meal       is_repeated_guest previous_cancellations
 Min.   : 0.000000   BB       :92306   0:115576          Min.   : 0.00000      
 1st Qu.: 0.000000   FB       :  798   1:  3810          1st Qu.: 0.00000      
 Median : 0.000000   HB       :14463                     Median : 0.00000      
 Mean   : 0.007949   SC       :10650                     Mean   : 0.08712      
 3rd Qu.: 0.000000   Undefined: 1169                     3rd Qu.: 0.00000      
 Max.   :10.000000                                       Max.   :26.00000      
                                                                               
 previous_bookings_not_canceled reserved_room_type     deposit_type   
 Min.   : 0.0000                A      :85994      No Deposit:104637  
 1st Qu.: 0.0000                D      :19201      Non Refund: 14587  
 Median : 0.0000                E      : 6535      Refundable:   162  
 Mean   : 0.1371                F      : 2897                         
 3rd Qu.: 0.0000                G      : 2094                         
 Max.   :72.0000                B      : 1114                         
                                (Other): 1551                         
         customer_type        adr          total_of_special_requests
 Contract       : 4076   Min.   :  -6.38   Min.   :0.0000           
 Group          :  577   1st Qu.:  69.29   1st Qu.:0.0000           
 Transient      :89613   Median :  94.59   Median :0.0000           
 Transient-Party:25120   Mean   : 101.83   Mean   :0.5713           
                         3rd Qu.: 126.00   3rd Qu.:1.0000           
                         Max.   :5400.00   Max.   :5.0000

Looking at bunch of numbers like this isn’t really a good idea to explore the data, a much better practice is by visualizing it.

3.1 Categorical Features

We can check the distribution of categorical features by using barplot.

p1 <- ggplot(hotel, aes(hotel)) + geom_bar() + theme_ds
p2 <- ggplot(hotel, aes(meal)) + geom_bar() + theme_ds
p3 <- ggplot(hotel, aes(deposit_type)) + geom_bar() + theme_ds
p4 <- ggplot(hotel, aes(customer_type)) + geom_bar() + theme_ds

ggpubr::ggarrange(p1, p2, p3, p4)

From the charts above we can say that :
1. Most of the clients booked a City Hotel
2. Most of the clients ordered BB meal
3. Most of the clients didn’t deposit
4. Most of the clients are transient or booking for a short time

3.2 Numeric Features

For numeric data types, we can use histogram.

n1 <- ggplot(hotel, aes(x=adr)) + geom_histogram() + theme_ds
n2 <- ggplot(hotel, aes(x=adults+children+babies)) + geom_histogram() + theme_ds

ggpubr::ggarrange(n1, n2)

We can see that the numeric attributes are heavy-tailed, means that most of the data belongs to the tail. There are also some outliers in the numeric data, which we can check by looking at the summary.

summary(select(hotel, adr, adults, children, babies))

      adr              adults          children           babies         
 Min.   :  -6.38   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000000  
 1st Qu.:  69.29   1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.: 0.000000  
 Median :  94.59   Median : 2.000   Median : 0.0000   Median : 0.000000  
 Mean   : 101.83   Mean   : 1.856   Mean   : 0.1039   Mean   : 0.007949  
 3rd Qu.: 126.00   3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.: 0.000000  
 Max.   :5400.00   Max.   :55.000   Max.   :10.0000   Max.   :10.000000

The children and babies data are fine, but the adult data has a massive leap from the mean to the maximum, as well as the adr. We can see whether it’s an outlier or not by observing the extreme values.

nrow(hotel[hotel$adults>20,])

[1] 10

Indeed, there are several bookings that assign a lot number of adults.

nrow(hotel[hotel$adr>1000,])

[1] 1

On the other hand, there is only one booking that has an adr above 1000, so this data is clearly an outlier.

hotel <- hotel[hotel$adr<1000,]

4 Data Preparation

4.1 Cross Validation

The best way to validate or compare models performance is by splitting the data into train and validation. The train data will be used to train the model, and the validation data is used to obtain the evaluation score.

library(rsample)
data <- initial_split(hotel, .75, is_canceled)
train <- training(data)
test <- testing(data)

nrow(train)

[1] 89540

nrow(test)

[1] 29845

4.2 X-y Splitting

Is a process of splitting the data into two, predictor (X) and target (y).

train_x <- select(train, -is_canceled)
test_x <- select(test, -is_canceled)
train_y <- train$is_canceled
test_y <- test$is_canceled

dim(train_x)

[1] 89540    13

length(test_y)

[1] 29845

5 Machine Learning Modelling

Now that the data is ready, we can build our machine learning model. Here I will use 3 different models, i.e Naive-Bayes, Decision Tree, and Random Forest.

5.1 Naive-Bayes

Generally, Naive-Bayes is considered bad model for its Naive assumptions and it works based on probability, thus it assumes all of the features as either categorical or binomial/multinomial. This model is often used for Natural Language Processing, eventhough nowadays the Recurrent Neural Network model has done a much better job than Naive-Bayes.

In addition, I have tried to fit the model but it raised an error : attempt to make a table with >= 2^31 elements. This is due to the size of the data. We have more than 100000 data in adr, and the model tries to make a frequency table of this, thus the size is extremely big, larger than \(2^{31}\) to be precise.

#model.nb <- naiveBayes(hotel, train_x, train_y, 1)

5.2 Decision Tree

Decision Tree is a model that split the data into logical trees. Each split is done by testing each categories (for categorical) and adjacent average (for numeric), then the split is evaluated using ID3 algorithm, which basically calculates the entropy and information gain from each split. The split with highest information gain will be the split used for the final tree.

Decision Tree has a high risk of overfitting, because usually the depth is not limited. To reduce the risk of overfitting, there is a method called Tree Complexity Pruning, which will calculate the score of a tree scaled by the depth. Another pruning method is by limiting the depth, setting the minimum samples to split, etc.

model.dt <- ctree(is_canceled~., hotel)

5.3 Random Forest

Random Forest is an ensemble method which combines Bagging and Decision Tree. Bagging or Bootstrap Aggregating is an ensemble method that fits the sub-sample (bootstrap) to a model for a lot of times, then the final vote is decided via majority voting (aggregating).

The difference between Bagging Tree and Random Forest, is that in each split of the tree, Random Forest only considers subset of features. This results in both more variations of trees and faster convergence.

set.seed(42)
model.rf <- randomForest(train_x, train_y, ntree = 100)

6 Model Evaluation

After fitting the models, we can finally evaluate the performance towards the test dataset

Decision Tree

confusionMatrix(predict(model.dt, test_x), test_y)

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 17640  5139
         1  1151  5915
                                          
               Accuracy : 0.7892          
                 95% CI : (0.7846, 0.7939)
    No Information Rate : 0.6296          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5119          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9387          
            Specificity : 0.5351          
         Pos Pred Value : 0.7744          
         Neg Pred Value : 0.8371          
             Prevalence : 0.6296          
         Detection Rate : 0.5911          
   Detection Prevalence : 0.7632          
      Balanced Accuracy : 0.7369          
                                          
       'Positive' Class : 0

Random Forest

confusionMatrix(predict(model.rf, test_x), test_y)

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 17880  5440
         1   911  5614
                                          
               Accuracy : 0.7872          
                 95% CI : (0.7825, 0.7918)
    No Information Rate : 0.6296          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5017          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9515          
            Specificity : 0.5079          
         Pos Pred Value : 0.7667          
         Neg Pred Value : 0.8604          
             Prevalence : 0.6296          
         Detection Rate : 0.5991          
   Detection Prevalence : 0.7814          
      Balanced Accuracy : 0.7297          
                                          
       'Positive' Class : 0

Both models have a slightly different accuracy. The very slight difference can be ignored because it might be the result of random number in Random Forest. So, technically, both model can be used.

Another things needed to focus on is the feature importance, which explains how significant each features to the performance of the model.

fi <- data.frame(rownames(model.rf$importance), model.rf$importance)
fi <- fi[order(-fi$MeanDecreaseGini),]
colnames(fi) <- c("Features", "MeanDecreaseGini")

fip <- ggplot(fi, aes(reorder(fi$Features, fi$MeanDecreaseGini), fi$MeanDecreaseGini)) + geom_col() +
  coord_flip() + 
  labs(title="Feature Importance", x="Features", y="Mean Decrease Gini") +
  theme_ds
fip

As we can see, deposit type, previous cancellation, total of special requests, and adr has the highest impact or significance towards the model’s performance. This means that the higher these numbers, the higher the probability of cancelling a booking.

The deposit type gives the highest impact towards the model performance. The “largest” deposit type is 2 which stands for Refundable, this means that the client who has the right to refund tends to cancel a booking. This makes sense because a client wouldn’t bother to cancel a booking if it’s refundable.

7 Model Tuning

Model Tuning is a process of improving the performance of the model, either by tuning the parameters or doing feature selection or feature engineering.

7.1 Parameter Tuning

We can set the parameters in decision tree so that the model does not overfit.

model.dt2 <- ctree(is_canceled~., hotel, controls = ctree_control(mincriterion = 0.99))

confusionMatrix(predict(model.dt2, test_x), test_y)

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 17655  5189
         1  1136  5865
                                          
               Accuracy : 0.7881          
                 95% CI : (0.7834, 0.7927)
    No Information Rate : 0.6296          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5085          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9395          
            Specificity : 0.5306          
         Pos Pred Value : 0.7729          
         Neg Pred Value : 0.8377          
             Prevalence : 0.6296          
         Detection Rate : 0.5916          
   Detection Prevalence : 0.7654          
      Balanced Accuracy : 0.7351          
                                          
       'Positive' Class : 0

Slight drop from the first model.

set.seed(42)
model.rf2 <- randomForest(train_x, train_y, ntree = 100, maxnodes = 32)

confusionMatrix(predict(model.rf2, test_x), test_y)

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 18771  6927
         1    20  4127
                                         
               Accuracy : 0.7672         
                 95% CI : (0.7624, 0.772)
    No Information Rate : 0.6296         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.4272         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9989         
            Specificity : 0.3733         
         Pos Pred Value : 0.7304         
         Neg Pred Value : 0.9952         
             Prevalence : 0.6296         
         Detection Rate : 0.6289         
   Detection Prevalence : 0.8610         
      Balanced Accuracy : 0.6861         
                                         
       'Positive' Class : 0

Again, by limiting the tree’s depth, we have a drop in total accuracy. However, the recall is at a stunning 99% score.

7.2 Feature Selection and Feature Engineering

fip

As we can see on the feature importance plot, adult, children and babies are not really significant. We can try to merge these variables to make a new feature called total_members. We can also remove some insignificant features such as is_repeated_guest and reserved_room_type.

hotel$total_member <- hotel$adults + hotel$children + hotel$babies
hotel %>% 
  select(-is_repeated_guest, -reserved_room_type, -meal, -previous_bookings_not_canceled, -hotel) -> hotel_fs
head(hotel_fs)

Now we can split the data and fit the model again.

data2 <- initial_split(hotel_fs, 0.75, is_canceled)
train2 <- training(data2)
test2 <- testing(data2)
test_x2 <- select(test2, -is_canceled)
test_y2 <- test2$is_canceled

set.seed(42)
model.rf3 <- randomForest(is_canceled~., train2, ntree=100)

confusionMatrix(predict(model.rf3, test_x2), test_y2)

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 18309  6219
         1   482  4835
                                          
               Accuracy : 0.7755          
                 95% CI : (0.7707, 0.7802)
    No Information Rate : 0.6296          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.461           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9743          
            Specificity : 0.4374          
         Pos Pred Value : 0.7465          
         Neg Pred Value : 0.9093          
             Prevalence : 0.6296          
         Detection Rate : 0.6135          
   Detection Prevalence : 0.8218          
      Balanced Accuracy : 0.7059          
                                          
       'Positive' Class : 0

Unfortunately, we still can’t improve the model performance.

8 Conclusion

Generally, with 78% accuracy, the model performed just fine, not really that fancy but not that bad either. The model also has an incredible 94% recall, which means that the model did a good job predicting not canceled bookings. That doesn’t really add up to our analysis since our main goal is to predict clients who are probably going to cancel the bookings.

Some of the significant features are deposit type, previous cancellation, special requests, and adr. So we need to focus on the clients with refundable deposit type, has previous cancellation, has a high number of special requests, and/or high adr.