Hotel Bookings Cancellation
Hotel Bookings Cancellation
1 Introduction
In this article, we will try to build a classification model to predict whether a hotel booking will be canceled or not.
1.1 Background
The data used in this article is a data of hotel bookings. It contains various features or fields that are related to a hotel booking, with the target variable is whether the booking is canceled or not. We want to be able to predict whether a customer or a client is potentially going to cancel a booking, so that we can either prepare for it or simply just reject the request.
The data was collected from kaggle. I have refactored the data and removed some redundant columns.
1.2 Read Data
We will store the data into hotel variable.
[1] 119390 16
There are 119390 observations and 16 features. Most of the features are self-explanatory.
1. is_canceled : The target variable, 1 means that the booking was canceled and 0 means it wasn’t.
2. adults, babies and children : represent the number of adults, children and babies in the booking.
3. meal : explains the meal in the booking. BB for Bed and Breakfast, FB for Full Board (Breakfast, Lunch, and Dinner), HB for Half Board (usually Breakfast and Dinner), and RO for Room Only.
4. customer_type : Whether the customer is transient (short-time) or contract.
5. adr : average daily rate is the most common indicator to estimate the room’s price based on average occupation per day.
1.3 Libraries
For this analysis, I will use dplyr for data wrangling, ggplot2 for data visualization, caret for model evaluation, and various machine learning libraries for building the model.
And the template theme for visualization.
theme_ds <- theme(
panel.background = element_rect(fill="#6CADDF"),
panel.border = element_rect(fill=NA),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
plot.background = element_rect(fill="#00285E"),
text = element_text(color="white"),
axis.text = element_text(color="white")
)2 Data Wrangling
The data we collected isn’t always in tidiest form, there are some failures here and there. The process of tidying a data is called data wrangling.
2.1 Data Type
To make sure the analysis goes in the right direction we wanted, we have to make sure each column has the proper data type.
'data.frame': 119390 obs. of 16 variables:
$ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
$ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
$ adults : int 2 2 1 1 2 2 2 2 2 2 ...
$ children : int 0 0 0 0 0 0 0 0 0 0 ...
$ babies : int 0 0 0 0 0 0 0 0 0 0 ...
$ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
$ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
$ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
$ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
$ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
$ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
$ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
$ adr : num 0 0 75 75 98 ...
$ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
$ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
$ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
There are two columns in integer type when it should be a category.
hotel$is_canceled <- as.factor(hotel$is_canceled)
hotel$is_repeated_guest <- as.factor(hotel$is_repeated_guest)'data.frame': 119390 obs. of 16 variables:
$ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
$ is_canceled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
$ adults : int 2 2 1 1 2 2 2 2 2 2 ...
$ children : int 0 0 0 0 0 0 0 0 0 0 ...
$ babies : int 0 0 0 0 0 0 0 0 0 0 ...
$ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
$ is_repeated_guest : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
$ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
$ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
$ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
$ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
$ adr : num 0 0 75 75 98 ...
$ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
$ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
$ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
2.2 Missing Values
We also have to check the missing values. Not only it adds noise to the data, but it also can’t be processed by the machine learning algorithm.
hotel is_canceled
0 0
adults children
0 4
babies meal
0 0
is_repeated_guest previous_cancellations
0 0
previous_bookings_not_canceled reserved_room_type
0 0
deposit_type customer_type
0 0
adr required_car_parking_spaces
0 0
total_of_special_requests reservation_status
0 0
Typically, there are 3 most common techniques to deal with missing values, i.e removing it, replacing with constants, or predicting using algorithms. In this case, we only have 4 missing values, which is a very small value, thus we can simply remove them.
[1] FALSE
2.3 Feature Selection
Features or Columns are very important to the model performance. Feature Selection and Feature Engineering are two powerful methods that can significantly boost the model performance. Feature Selection can be done either by manually selecting the columns, or by applying statistical significance tests such as f-test and chi-square test.
As we can see, there are 2 possible columns to remove :
1. reservation_status, which is just the string representation of the target variable is_canceled, where Check-Out means 0 and Canceled means 1.
2. required_car_parking_spaces : it might be useful but as per now, we can remove this column.
3 Data Visualization
We can check the summary of the data.
hotel is_canceled adults children
City Hotel :79326 0:75166 Min. : 0.000 Min. : 0.0000
Resort Hotel:40060 1:44220 1st Qu.: 2.000 1st Qu.: 0.0000
Median : 2.000 Median : 0.0000
Mean : 1.856 Mean : 0.1039
3rd Qu.: 2.000 3rd Qu.: 0.0000
Max. :55.000 Max. :10.0000
babies meal is_repeated_guest previous_cancellations
Min. : 0.000000 BB :92306 0:115576 Min. : 0.00000
1st Qu.: 0.000000 FB : 798 1: 3810 1st Qu.: 0.00000
Median : 0.000000 HB :14463 Median : 0.00000
Mean : 0.007949 SC :10650 Mean : 0.08712
3rd Qu.: 0.000000 Undefined: 1169 3rd Qu.: 0.00000
Max. :10.000000 Max. :26.00000
previous_bookings_not_canceled reserved_room_type deposit_type
Min. : 0.0000 A :85994 No Deposit:104637
1st Qu.: 0.0000 D :19201 Non Refund: 14587
Median : 0.0000 E : 6535 Refundable: 162
Mean : 0.1371 F : 2897
3rd Qu.: 0.0000 G : 2094
Max. :72.0000 B : 1114
(Other): 1551
customer_type adr total_of_special_requests
Contract : 4076 Min. : -6.38 Min. :0.0000
Group : 577 1st Qu.: 69.29 1st Qu.:0.0000
Transient :89613 Median : 94.59 Median :0.0000
Transient-Party:25120 Mean : 101.83 Mean :0.5713
3rd Qu.: 126.00 3rd Qu.:1.0000
Max. :5400.00 Max. :5.0000
Looking at bunch of numbers like this isn’t really a good idea to explore the data, a much better practice is by visualizing it.
3.1 Categorical Features
We can check the distribution of categorical features by using barplot.
p1 <- ggplot(hotel, aes(hotel)) + geom_bar() + theme_ds
p2 <- ggplot(hotel, aes(meal)) + geom_bar() + theme_ds
p3 <- ggplot(hotel, aes(deposit_type)) + geom_bar() + theme_ds
p4 <- ggplot(hotel, aes(customer_type)) + geom_bar() + theme_ds
ggpubr::ggarrange(p1, p2, p3, p4)From the charts above we can say that :
1. Most of the clients booked a City Hotel
2. Most of the clients ordered BB meal
3. Most of the clients didn’t deposit
4. Most of the clients are transient or booking for a short time
3.2 Numeric Features
For numeric data types, we can use histogram.
n1 <- ggplot(hotel, aes(x=adr)) + geom_histogram() + theme_ds
n2 <- ggplot(hotel, aes(x=adults+children+babies)) + geom_histogram() + theme_ds
ggpubr::ggarrange(n1, n2)We can see that the numeric attributes are heavy-tailed, means that most of the data belongs to the tail. There are also some outliers in the numeric data, which we can check by looking at the summary.
adr adults children babies
Min. : -6.38 Min. : 0.000 Min. : 0.0000 Min. : 0.000000
1st Qu.: 69.29 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.000000
Median : 94.59 Median : 2.000 Median : 0.0000 Median : 0.000000
Mean : 101.83 Mean : 1.856 Mean : 0.1039 Mean : 0.007949
3rd Qu.: 126.00 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.000000
Max. :5400.00 Max. :55.000 Max. :10.0000 Max. :10.000000
The children and babies data are fine, but the adult data has a massive leap from the mean to the maximum, as well as the adr. We can see whether it’s an outlier or not by observing the extreme values.
[1] 10
Indeed, there are several bookings that assign a lot number of adults.
[1] 1
On the other hand, there is only one booking that has an adr above 1000, so this data is clearly an outlier.
4 Data Preparation
4.1 Cross Validation
The best way to validate or compare models performance is by splitting the data into train and validation. The train data will be used to train the model, and the validation data is used to obtain the evaluation score.
library(rsample)
data <- initial_split(hotel, .75, is_canceled)
train <- training(data)
test <- testing(data)[1] 89540
[1] 29845
4.2 X-y Splitting
Is a process of splitting the data into two, predictor (X) and target (y).
train_x <- select(train, -is_canceled)
test_x <- select(test, -is_canceled)
train_y <- train$is_canceled
test_y <- test$is_canceled[1] 89540 13
[1] 29845
5 Machine Learning Modelling
Now that the data is ready, we can build our machine learning model. Here I will use 3 different models, i.e Naive-Bayes, Decision Tree, and Random Forest.
5.1 Naive-Bayes
Generally, Naive-Bayes is considered bad model for its Naive assumptions and it works based on probability, thus it assumes all of the features as either categorical or binomial/multinomial. This model is often used for Natural Language Processing, eventhough nowadays the Recurrent Neural Network model has done a much better job than Naive-Bayes.
In addition, I have tried to fit the model but it raised an error : attempt to make a table with >= 2^31 elements. This is due to the size of the data. We have more than 100000 data in adr, and the model tries to make a frequency table of this, thus the size is extremely big, larger than \(2^{31}\) to be precise.
5.2 Decision Tree
Decision Tree is a model that split the data into logical trees. Each split is done by testing each categories (for categorical) and adjacent average (for numeric), then the split is evaluated using ID3 algorithm, which basically calculates the entropy and information gain from each split. The split with highest information gain will be the split used for the final tree.
Decision Tree has a high risk of overfitting, because usually the depth is not limited. To reduce the risk of overfitting, there is a method called Tree Complexity Pruning, which will calculate the score of a tree scaled by the depth. Another pruning method is by limiting the depth, setting the minimum samples to split, etc.
5.3 Random Forest
Random Forest is an ensemble method which combines Bagging and Decision Tree. Bagging or Bootstrap Aggregating is an ensemble method that fits the sub-sample (bootstrap) to a model for a lot of times, then the final vote is decided via majority voting (aggregating).
The difference between Bagging Tree and Random Forest, is that in each split of the tree, Random Forest only considers subset of features. This results in both more variations of trees and faster convergence.
6 Model Evaluation
After fitting the models, we can finally evaluate the performance towards the test dataset
Decision Tree
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 17640 5139
1 1151 5915
Accuracy : 0.7892
95% CI : (0.7846, 0.7939)
No Information Rate : 0.6296
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5119
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9387
Specificity : 0.5351
Pos Pred Value : 0.7744
Neg Pred Value : 0.8371
Prevalence : 0.6296
Detection Rate : 0.5911
Detection Prevalence : 0.7632
Balanced Accuracy : 0.7369
'Positive' Class : 0
Random Forest
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 17880 5440
1 911 5614
Accuracy : 0.7872
95% CI : (0.7825, 0.7918)
No Information Rate : 0.6296
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5017
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9515
Specificity : 0.5079
Pos Pred Value : 0.7667
Neg Pred Value : 0.8604
Prevalence : 0.6296
Detection Rate : 0.5991
Detection Prevalence : 0.7814
Balanced Accuracy : 0.7297
'Positive' Class : 0
Both models have a slightly different accuracy. The very slight difference can be ignored because it might be the result of random number in Random Forest. So, technically, both model can be used.
Another things needed to focus on is the feature importance, which explains how significant each features to the performance of the model.
fi <- data.frame(rownames(model.rf$importance), model.rf$importance)
fi <- fi[order(-fi$MeanDecreaseGini),]
colnames(fi) <- c("Features", "MeanDecreaseGini")fip <- ggplot(fi, aes(reorder(fi$Features, fi$MeanDecreaseGini), fi$MeanDecreaseGini)) + geom_col() +
coord_flip() +
labs(title="Feature Importance", x="Features", y="Mean Decrease Gini") +
theme_ds
fipAs we can see, deposit type, previous cancellation, total of special requests, and adr has the highest impact or significance towards the model’s performance. This means that the higher these numbers, the higher the probability of cancelling a booking.
The deposit type gives the highest impact towards the model performance. The “largest” deposit type is 2 which stands for Refundable, this means that the client who has the right to refund tends to cancel a booking. This makes sense because a client wouldn’t bother to cancel a booking if it’s refundable.
7 Model Tuning
Model Tuning is a process of improving the performance of the model, either by tuning the parameters or doing feature selection or feature engineering.
7.1 Parameter Tuning
We can set the parameters in decision tree so that the model does not overfit.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 17655 5189
1 1136 5865
Accuracy : 0.7881
95% CI : (0.7834, 0.7927)
No Information Rate : 0.6296
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5085
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9395
Specificity : 0.5306
Pos Pred Value : 0.7729
Neg Pred Value : 0.8377
Prevalence : 0.6296
Detection Rate : 0.5916
Detection Prevalence : 0.7654
Balanced Accuracy : 0.7351
'Positive' Class : 0
Slight drop from the first model.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 18771 6927
1 20 4127
Accuracy : 0.7672
95% CI : (0.7624, 0.772)
No Information Rate : 0.6296
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4272
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9989
Specificity : 0.3733
Pos Pred Value : 0.7304
Neg Pred Value : 0.9952
Prevalence : 0.6296
Detection Rate : 0.6289
Detection Prevalence : 0.8610
Balanced Accuracy : 0.6861
'Positive' Class : 0
Again, by limiting the tree’s depth, we have a drop in total accuracy. However, the recall is at a stunning 99% score.
7.2 Feature Selection and Feature Engineering
As we can see on the feature importance plot, adult, children and babies are not really significant. We can try to merge these variables to make a new feature called total_members. We can also remove some insignificant features such as is_repeated_guest and reserved_room_type.
hotel$total_member <- hotel$adults + hotel$children + hotel$babies
hotel %>%
select(-is_repeated_guest, -reserved_room_type, -meal, -previous_bookings_not_canceled, -hotel) -> hotel_fs
head(hotel_fs)Now we can split the data and fit the model again.
data2 <- initial_split(hotel_fs, 0.75, is_canceled)
train2 <- training(data2)
test2 <- testing(data2)
test_x2 <- select(test2, -is_canceled)
test_y2 <- test2$is_canceledConfusion Matrix and Statistics
Reference
Prediction 0 1
0 18309 6219
1 482 4835
Accuracy : 0.7755
95% CI : (0.7707, 0.7802)
No Information Rate : 0.6296
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.461
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9743
Specificity : 0.4374
Pos Pred Value : 0.7465
Neg Pred Value : 0.9093
Prevalence : 0.6296
Detection Rate : 0.6135
Detection Prevalence : 0.8218
Balanced Accuracy : 0.7059
'Positive' Class : 0
Unfortunately, we still can’t improve the model performance.
8 Conclusion
Generally, with 78% accuracy, the model performed just fine, not really that fancy but not that bad either. The model also has an incredible 94% recall, which means that the model did a good job predicting not canceled bookings. That doesn’t really add up to our analysis since our main goal is to predict clients who are probably going to cancel the bookings.
Some of the significant features are deposit type, previous cancellation, special requests, and adr. So we need to focus on the clients with refundable deposit type, has previous cancellation, has a high number of special requests, and/or high adr.