In this project, Our goal is to develop a prediction model for predicting the possibility of hotel booking cancellation using data mining algorithms. This data is extracted from Hotel Profit Management System (PMS) and it contains booking information due to arrive between 7/1/2015 and 8/31/2017 from one resort hotel and one city hotel located in Portugal.
Hotel cancellations are always a bad news for the business yet it’s unavoidable. The incorrect estimation of guest can lead to wastage of resources for eg. if a large group of individuals were to cancel their booking, at the last moment, this may lead to wastage of raw food materials and it is often difficult to find new guests at such a short notice
If hotel’s booking system were to have an algorithm, which could predict the possibility of cancellation, it would be easier for the hotel management to develop better cancellation policy, to predict demand accurately and to take proactive actions to manage business and resources accordingly. We are trying to analyze the booking data of 3 years (between 7/1/2015 and 8/31/2017) of multiple hotels and trying to find a common trend which can be seen in the case of cancellation. In short, we are trying to determine, what are some of the common characteristics which indicate possibility of a cancellation.
We will first identify which parameters are correlated with booking cancellation. After deciding the correlated variables, we will see whether they positively influence the cancellation or negatively influence the cancellation. Finally, we plan to come up with a logistic regression model which can predict cancellation.
Below is the list of packages used in the project.
library(psych)
library(magrittr)
library(tidyverse)
library(car)
library(knitr)
The purpose of the packages are as below:
psych is used to plot multiple histograms.tidyverse consists of 6 core packages out of which the below 3 are most important for this project:ggplot2: Used for creating powerful visualizationsdplyr: Used for data manipulationtidyr: Used for data modificationscar is used for identifying outliers in a boxplot.knitr is for displaying tables.The data was obtained at tidytuesday and detailed information can be found in the original publication (Nuno Antonio, Ana de Almeida, Luis Nunes, 2019). Since it is uncommon to find real business data, the authors intended to provide the hotel demand data to researchers and educators in revenue management, machine learning or data mining.
There are 119390 observation and 32 variables in hotels.csv file.
# Importing .csv files
hotels <- read.csv("data/hotels.csv")
# View the structure of hotels
str(hotels)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
# View the first 6 rows of hotels
head(hotels)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
Following integer variables have been switched to factors for the convenience in analysis.
hotels$hotel <- as.factor(hotels$hotel)
hotels$is_canceled <- as.factor(hotels$is_canceled)
hotels$arrival_date_year <- as.factor(hotels$arrival_date_year )
hotels$arrival_date_month <- as.factor(hotels$arrival_date_month )
hotels$arrival_date_week_number <- as.factor(hotels$arrival_date_week_number )
hotels$arrival_date_day_of_month <- as.factor(hotels$arrival_date_day_of_month)
hotels$meal <- as.factor(hotels$meal)
hotels$country <- as.factor(hotels$country)
hotels$market_segment <- as.factor(hotels$market_segment)
hotels$distribution_channel <- as.factor(hotels$distribution_channel)
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$previous_cancellations <- as.factor(hotels$previous_cancellations)
hotels$previous_bookings_not_canceled <- as.factor(hotels$previous_bookings_not_canceled)
hotels$previous_cancellations <- as.factor(hotels$previous_cancellations)
hotels$reserved_room_type <- as.factor(hotels$reserved_room_type)
hotels$assigned_room_type <- as.factor(hotels$assigned_room_type)
hotels$booking_changes <- as.factor(hotels$booking_changes)
hotels$deposit_type <- as.factor(hotels$deposit_type)
hotels$agent <- as.factor(hotels$agent)
hotels$company <- as.factor(hotels$company)
hotels$customer_type <- as.factor(hotels$customer_type)
hotels$reservation_status <- as.factor(hotels$reservation_status)
hotels$reservation_status_date <- as.factor(hotels$reservation_status_date)
#hotels$arrival_date <- as.factor(hotels$arrival_date)
New variable “arrival_date” and “stay_nights” were created using existing variables.
#create new variable: arrival_date
hotels$arrival_date <-paste(hotels$arrival_date_month,
hotels$arrival_date_day_of_month,
hotels$arrival_date_year,sep="-")
as.Date(hotels$arrival_date, format="%B-%d-%Y")
#create new variable: stays_nights
hotels %>% mutate(stays_nights= stays_in_weekend_nights + stays_in_week_nights) -> hotels_new
Regarding type of meal, there are 10,650 records of SC (no meal package) and 1,169 records of Undefined. The level of “Undefined” were combined as the level of SC.
#summary(hotels_new)
levels(hotels_new$meal)[5] <- "SC"
summary(hotels_new)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 0:75166 Min. : 0 2015:21996
## Resort Hotel:40060 1:44224 1st Qu.: 18 2016:56707
## Median : 69 2017:40687
## Mean :104
## 3rd Qu.:160
## Max. :737
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 33 : 3580 17 : 4406
## July :12661 30 : 3087 5 : 4317
## May :11791 32 : 3045 15 : 4196
## October:11160 34 : 3040 25 : 4160
## April :11089 18 : 2926 26 : 4147
## June :10939 21 : 2854 9 : 4096
## (Other):47873 (Other):100858 (Other):94068
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB:92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB: 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB:14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC:11819 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## NA's :4 (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 0:115580
## Offline TA/TO:24219 Direct :14645 1: 3810
## Groups :19811 GDS : 193
## Direct :12606 TA/TO :97870
## Corporate : 5295 Undefined: 5
## Complementary: 743
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## 0 :112906 0 :115770 A :85994
## 1 : 6051 1 : 1542 D :19201
## 2 : 116 2 : 580 E : 6535
## 3 : 65 3 : 333 F : 2897
## 24 : 48 4 : 229 G : 2094
## 11 : 35 5 : 181 B : 1118
## (Other): 169 (Other): 755 (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 0 :101314 No Deposit:104641 9 :31961
## D :25322 1 : 12701 Non Refund: 14587 NULL :16340
## E : 7806 2 : 3805 Refundable: 162 240 :13922
## F : 3751 3 : 927 1 : 7191
## G : 2553 4 : 376 14 : 3640
## C : 2375 5 : 118 7 : 3539
## (Other): 3530 (Other): 149 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date arrival_date stays_nights
## Canceled :43017 2015-10-21: 1461 Length:119390 Min. : 0.000
## Check-Out:75166 2015-07-06: 805 Class :character 1st Qu.: 2.000
## No-Show : 1207 2016-11-25: 790 Mode :character Median : 3.000
## 2015-01-01: 763 Mean : 3.428
## 2016-01-18: 625 3rd Qu.: 4.000
## 2015-07-02: 469 Max. :69.000
## (Other) :114477
There are almost no missing values except “children”.
colSums(is.na(hotels_new))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
## arrival_date stays_nights
## 0 0
There are 4 missing values in “children” and they were replaced by zero.
hotels_new$children[is.na(hotels_new$children)] <- 0
To see the distribution of adr (Average Daily Rate) and to investigate any existing outliers, a histogram and a boxplot were created.
hist(hotels_new$adr)
boxplot(hotels_new$adr)
The outlier in adr was identified.
Boxplot(~adr, data=hotels_new, id=list(n=Inf))
The outlier was replaced by NA and below is new boxplot without the outlier.
hotels_new[48516, 28] <- NA # row num. is 48516; column num. for adr is 28
boxplot(hotels_new$adr)
Below is the table of variable description.
| Variable | Type | Description |
|---|---|---|
| hotel | factor | Hotel (City Hotel or Resort Hotel) |
| is_canceled | factor | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | integer | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | factor | Year of arrival date |
| arrival_date_month | factor | Month of arrival date |
| arrival_date_week_number | factor | Week number of year for arrival date |
| arrival_date_day_of_month | factor | Day of arrival date |
| stays_in_weekend_nights | integer | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | integer | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | integer | Number of adults |
| children | numeric | Number of children |
| babies | integer | Number of babies |
| meal | factor | Type of meal booked. SC - no meal package; BB - Bed & Breakfast; HB - Half board (breakfast and one other meal – usually dinner); FB - Full board (breakfast, lunch and dinner) |
| country | factor | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
| market_segment | factor | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| distribution_channel | factor | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| is_repeated_guest | factor | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | factor | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | factor | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | factor | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
| assigned_room_type | factor | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
| booking_changes | factor | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| deposit_type | factor | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay. |
| agent | factor | ID of the travel agency that made the booking |
| company | factor | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
| days_in_waiting_list | integer | Number of days the booking was in the waiting list before it was confirmed to the customer |
| customer_type | factor | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
| adr | numeric | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces | integer | Number of car parking spaces required by the customer |
| total_of_special_requests | integer | Number of special requests made by the customer (e.g. twin bed or high floor) |
| reservation_status | factor | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
| reservation_status_date | factor | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
| arrival_date | character | Arrival Date |
| stays_nights | integer | Stays in nights |
We begin Exploratory Data Analysis by looking at the histograms of the numeric columns.This helps us to visualize if there is a skewness to the data or if any particular variable follows a perfectly normal distribution. We see that stays_nights follows almost a normal distribution curve with a longer tail to the right.
multi.hist(hotels_new[,sapply(hotels_new, is.numeric)])
The canceled and not canceled groups were displayed for the two hotels. It is clear that the number of cancellations for city hotel were higher than resort hotel.
# hotel and cancellation
hotels_new %>% ggplot( aes(x=hotel, fill=is_canceled)) + geom_bar() + scale_fill_discrete(labels=c("not canceled", "canceled")) + guides(fill=guide_legend(title=NULL)) + scale_fill_manual(values=c("#999999", "#E69F00"))
We further expand our analysis to visualize the number of cancellations on a monthly basis as below. This will help us to know in which months the hotel owners were affected the worst.
count <- table(hotels_new$is_canceled, hotels_new$arrival_date_month)
barplot(count, main = "Number of cancellations by Months",horiz = TRUE, names.arg = c("Jan", "Feb","March","April","May","June","July","Aug","Sept","Oct","Nov","Dec"),col=c("red","pink") ,cex.names = 0.4)
We can find the total number of guests by adding 3 columns of adults, children and babies. This will tell us the total number of guests per room which is an important metric but was not included in the original dataset.
hotels_new$no_of_guests = rowSums(cbind(hotels_new$adults,hotels_new$children, hotels_new$babies),na.rm = TRUE)
summary (hotels_new$no_of_guests)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.968 2.000 55.000
Next we see the trend of bookings that were converted and canceled over the period of three years period over which the data was collected. This will also help us to identify if any outside factors or events led people to cancel their hotel bookings during a particular year.
ggplot(data = hotels_new,aes(arrival_date_year,fill = (is_canceled))) +
geom_bar(position = 'dodge') +
scale_y_continuous(name = "No. of Bookings",labels = scales::comma) +
xlab("Year") +
ggtitle("Status by number of bookings across years") +
labs(fill = 'Booking Status')
We tried correlating the variables with the dependent variable is_canceled but we were not able to make conclusive decisions on the basis of that information. We only see variables that are weakly correlated or are negatively correlated. It would be helpful to learn a modification technique that would help us to increase the correlation of the variables.
corr_hotels <- hotels_new[unlist(lapply(hotels_new, is.numeric))]
cor(corr_hotels)[,1] * 100
## lead_time stays_in_weekend_nights
## 100.000000 8.567113
## stays_in_week_nights adults
## 16.579936 11.951869
## children babies
## -3.761282 -2.091502
## days_in_waiting_list adr
## 17.008418 NA
## required_car_parking_spaces total_of_special_requests
## -11.645057 -9.571205
## stays_nights no_of_guests
## 15.716697 7.226505
We have done a preliminary fit in a linear regression. Going forward we plan to use forward and backward variable selection techniques to optimize the models. As of now we have dropped certain variables from our regression model.
lm_model<- glm(is_canceled ~ lead_time + hotel + no_of_guests + deposit_type + adr + total_of_special_requests + required_car_parking_spaces + days_in_waiting_list + customer_type, data = hotels_new , family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lm_model)
##
## Call:
## glm(formula = is_canceled ~ lead_time + hotel + no_of_guests +
## deposit_type + adr + total_of_special_requests + required_car_parking_spaces +
## days_in_waiting_list + customer_type, family = "binomial",
## data = hotels_new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0201 -0.8151 -0.5512 0.1949 2.7214
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.217e+00 5.059e-02 -43.811 < 2e-16 ***
## lead_time 4.474e-03 8.072e-05 55.432 < 2e-16 ***
## hotelResort Hotel -1.425e-01 1.580e-02 -9.022 < 2e-16 ***
## no_of_guests 1.772e-01 1.212e-02 14.620 < 2e-16 ***
## deposit_typeNon Refund 5.180e+00 1.054e-01 49.156 < 2e-16 ***
## deposit_typeRefundable -5.177e-02 1.992e-01 -0.260 0.794956
## adr 4.671e-03 1.701e-04 27.462 < 2e-16 ***
## total_of_special_requests -5.504e-01 1.032e-02 -53.350 < 2e-16 ***
## required_car_parking_spaces -1.665e+01 4.289e+01 -0.388 0.697812
## days_in_waiting_list -2.822e-03 4.739e-04 -5.955 2.61e-09 ***
## customer_typeGroup -7.372e-01 1.607e-01 -4.588 4.47e-06 ***
## customer_typeTransient 7.130e-01 4.486e-02 15.895 < 2e-16 ***
## customer_typeTransient-Party -1.691e-01 4.701e-02 -3.597 0.000322 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157396 on 119388 degrees of freedom
## Residual deviance: 112211 on 119376 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 112237
##
## Number of Fisher Scoring iterations: 16