A significant problem for hotels, resorts, and other lodging
associations is that it is difficult to tell if a customer plans on
cancelling their room reservation. This could result in a substantial
loss of profits, as those reservations could have been granted to a
paying customer. In this analysis, we will work with a
hotel_booking
dataset (from Kaggle) and use various
features to predict when a customer with a room reservation will cancel
their booking.
To begin, we first must process the data in
hotel_booking
. hotel_booking
is composed of
both numeric and categorical features, so we handle these separately.
Note that several of the feaures in hotel_booking
, namely
name
, email
, phone.number
, and
credit_card
are too granular for analysis, as they only
correspond to a few observations each, and are not generalizable. We
will ignore these entirely. Keep in mind that our target predicted
variable, is_canceled
, is a factor of 2 levels, with 0
being “no” and 1 being “yes”.
For the numeric features, we mostly use Pearson’s correlation
tests and boxplot examinations. In total, we’ll find that only 11 of the
original 19 numeric features are strongly enough associated with
is_canceled
to be further considered for our models.
Additionally, there was a categorical feature
reservation_status_date
in the original datset that was
comprised of dates of the form “yyyy-mm-dd”. As most values in
reservation_status_date
had many observations with that
same value, I decomposed the categorical
reservation_status_date
into 3 numeric features:
year
, month
, and day
in order to
capture temporal trends. All three of these new numeric features were
found to be significant. Lastly, to aid in training the logistic
regression (more on this model below), I scaled the data by
transformation into z-scores.
For the categorical features, I had first attempted to use
chi-square tests to find significant associations between certain
features and is_canceled
, but that yielded no useful
results. So instead, I trained a random forest on these categorical
features and found 6 of these features that were considered significant
predictors. These were either factor-encoded or encoded with dummy
variables, depending on what they were measuring.
One more thing, of the chosen predictors, none of them had
na
values, so no data removal or imputation was necessary.
In total, these are the chosen predictors:
## [1] "year" "month"
## [3] "day" "lead_time"
## [5] "stays_in_week_nights" "adults"
## [7] "babies" "previous_cancellations"
## [9] "previous_bookings_not_canceled" "booking_changes"
## [11] "days_in_waiting_list" "adr"
## [13] "required_car_parking_spaces" "total_of_special_requests"
## [15] "deposit_type" "market_segment"
## [17] "meal" "assigned_room_type"
## [19] "reserved_room_type" "is_canceled"
The models selected for this binary classification task were the logistic regression, decision tree, random forest, and gradient boosting model.
Above are the top 15 variables in terms of importance for the
logistic regression. Interaction terms were included, specifically
between
stays_in_week_nights
/total_of_special_requests
and lead_time
/booking_changes
. I figured that,
if a hotel was unwilling to accommodate special requests for many
nights, then cancellation is likely. Also, longer lead time means more
leniency in finding an alternative hotel in the event of booking
changes.
The tree only actually found 9 predictors to be significant, and we can see the predictive power decrease fairly linearly.
Here are the variables deemed important by the random forest. Looks similar to the decision tree, but more were deemed “important” here.
Finally, here is the gradient boosting plot of variable importance. Not as many here!
In total, we can see several features that all the models deem
to be important. Specifically, not deposit_type
,
total_of_special_requests
, and lead_time
as
all being quite important! These make sense too, if a deposit is
non-refundable, cancellation seems unlikely. Additionally, the longer
the lead time, the more likely something can get in the way of the trip,
so cancellation seems more likely.
There was also some hyperparameter tuning done for each model.
Specifically with the number of trees in the random forest and gradient
boosting (40 and 35 trees respectively), and with the labeling
thresholds for each model.. This was done to maximize F1 scores.
To judge the models, we need to look at the performance metrics and see which performs the best. Here they are below:
## accuracy recall precision F1
## logistic 0.8169863 0.6170911 0.8467153 0.7138929
## tree 0.8097412 0.6127900 0.8283354 0.7044434
## forest 0.8964737 0.8479909 0.8690407 0.8583868
## gbm 0.8106207 0.6078098 0.8355376 0.7037086
We can see a clear winner. The random forest scores the highest in
every metric, especially with the accuracy of ~0.9
. The
other three models all score approximately equally across each metric as
well, which means that I would likely consider either logistic
regression or decision tree the second best, due to their relative
simplicity.
I have here the ROC curves for each model as well:
## AUC Logistic: 0.8644579
## AUC Tree: 0.8217884
## AUC Random Forest: 0.9892999
## AUC GBM: 0.8421582
Again, the random forest outperforms the others! I was skeptical of
this extremely high AUC of 0.989
, as it may be the result
of overfitting. So, I attempted to use cross-validation (with the
caret
package), but could not get the code to work. What I
can say is that there likely exists no data leakage in the training set,
and the important variables for the random forest do make logical sense.
More model validation is necessary, but the random forest seems quite
promising.
In conclusion, the random forest was found to be the best model for
predicting when a customer will cancel their hotel booking, with an
accuracy of ~0.9
and an F1 of ~0.85
. The other
models may be adequate, but the random forest is best. Future things to
address would mainly include model validation (especially
cross-validation, if I could get that working.)