A significant problem for hotels, resorts, and other lodging associations is that it is difficult to tell if a customer plans on cancelling their room reservation. This could result in a substantial loss of profits, as those reservations could have been granted to a paying customer. In this analysis, we will work with a hotel_booking dataset (from Kaggle) and use various features to predict when a customer with a room reservation will cancel their booking.

Data Preprocessing

To begin, we first must process the data in hotel_booking. hotel_booking is composed of both numeric and categorical features, so we handle these separately. Note that several of the feaures in hotel_booking, namely name, email, phone.number, and credit_card are too granular for analysis, as they only correspond to a few observations each, and are not generalizable. We will ignore these entirely. Keep in mind that our target predicted variable, is_canceled, is a factor of 2 levels, with 0 being “no” and 1 being “yes”.


For the numeric features, we mostly use Pearson’s correlation tests and boxplot examinations. In total, we’ll find that only 11 of the original 19 numeric features are strongly enough associated with is_canceled to be further considered for our models. Additionally, there was a categorical feature reservation_status_date in the original datset that was comprised of dates of the form “yyyy-mm-dd”. As most values in reservation_status_date had many observations with that same value, I decomposed the categorical reservation_status_date into 3 numeric features: year, month, and day in order to capture temporal trends. All three of these new numeric features were found to be significant. Lastly, to aid in training the logistic regression (more on this model below), I scaled the data by transformation into z-scores.


For the categorical features, I had first attempted to use chi-square tests to find significant associations between certain features and is_canceled, but that yielded no useful results. So instead, I trained a random forest on these categorical features and found 6 of these features that were considered significant predictors. These were either factor-encoded or encoded with dummy variables, depending on what they were measuring.


One more thing, of the chosen predictors, none of them had na values, so no data removal or imputation was necessary. In total, these are the chosen predictors:

##  [1] "year"                           "month"                         
##  [3] "day"                            "lead_time"                     
##  [5] "stays_in_week_nights"           "adults"                        
##  [7] "babies"                         "previous_cancellations"        
##  [9] "previous_bookings_not_canceled" "booking_changes"               
## [11] "days_in_waiting_list"           "adr"                           
## [13] "required_car_parking_spaces"    "total_of_special_requests"     
## [15] "deposit_type"                   "market_segment"                
## [17] "meal"                           "assigned_room_type"            
## [19] "reserved_room_type"             "is_canceled"

Models

The models selected for this binary classification task were the logistic regression, decision tree, random forest, and gradient boosting model.

Logistic Regression

Above are the top 15 variables in terms of importance for the logistic regression. Interaction terms were included, specifically between stays_in_week_nights/total_of_special_requests and lead_time/booking_changes. I figured that, if a hotel was unwilling to accommodate special requests for many nights, then cancellation is likely. Also, longer lead time means more leniency in finding an alternative hotel in the event of booking changes.

Decision Tree

The tree only actually found 9 predictors to be significant, and we can see the predictive power decrease fairly linearly.

Random Forest

Here are the variables deemed important by the random forest. Looks similar to the decision tree, but more were deemed “important” here.

Gradient Boost

Finally, here is the gradient boosting plot of variable importance. Not as many here!


In total, we can see several features that all the models deem to be important. Specifically, not deposit_type, total_of_special_requests, and lead_time as all being quite important! These make sense too, if a deposit is non-refundable, cancellation seems unlikely. Additionally, the longer the lead time, the more likely something can get in the way of the trip, so cancellation seems more likely.


There was also some hyperparameter tuning done for each model. Specifically with the number of trees in the random forest and gradient boosting (40 and 35 trees respectively), and with the labeling thresholds for each model.. This was done to maximize F1 scores.

Performance Metrics

To judge the models, we need to look at the performance metrics and see which performs the best. Here they are below:

##           accuracy    recall precision        F1
## logistic 0.8169863 0.6170911 0.8467153 0.7138929
## tree     0.8097412 0.6127900 0.8283354 0.7044434
## forest   0.8964737 0.8479909 0.8690407 0.8583868
## gbm      0.8106207 0.6078098 0.8355376 0.7037086

We can see a clear winner. The random forest scores the highest in every metric, especially with the accuracy of ~0.9. The other three models all score approximately equally across each metric as well, which means that I would likely consider either logistic regression or decision tree the second best, due to their relative simplicity.


I have here the ROC curves for each model as well:

## AUC Logistic: 0.8644579
## AUC Tree: 0.8217884
## AUC Random Forest: 0.9892999
## AUC GBM: 0.8421582

Again, the random forest outperforms the others! I was skeptical of this extremely high AUC of 0.989, as it may be the result of overfitting. So, I attempted to use cross-validation (with the caret package), but could not get the code to work. What I can say is that there likely exists no data leakage in the training set, and the important variables for the random forest do make logical sense. More model validation is necessary, but the random forest seems quite promising.

Conclusion

In conclusion, the random forest was found to be the best model for predicting when a customer will cancel their hotel booking, with an accuracy of ~0.9 and an F1 of ~0.85. The other models may be adequate, but the random forest is best. Future things to address would mainly include model validation (especially cross-validation, if I could get that working.)