In this project, Our goal is to develop a prediction model for predicting the possibility of hotel booking cancellation using data mining algorithms. This data is extracted from Hotel Profit Management System (PMS) and it contains booking information due to arrive between 7/1/2015 and 8/31/2017 from one resort hotel and one city hotel located in Portugal.
Hotel cancellations are always a bad news for the business yet it’s unavoidable. The incorrect estimation of guest can lead to wastage of resources for eg. if a large group of individuals were to cancel their booking, at the last moment, this may lead to wastage of raw food materials and it is often difficult to find new guests at such a short notice
If hotel’s booking system were to have an algorithm, which could predict the possibility of cancellation, it would be easier for the hotel management to develop better cancellation policy, to predict demand accurately and to take proactive actions to manage business and resources accordingly. We are trying to analyze the booking data of 3 years (between 7/1/2015 and 8/31/2017) of multiple hotels and trying to find a common trend which can be seen in the case of cancellation. In short, we are trying to determine, what are some of the common characteristics which indicate possibility of a cancellation.
We will first identify which parameters are correlated with booking cancellation. After deciding the correlated variables, we will see whether they positively influence the cancellation or negatively influence the cancellation. Finally, we plan to come up with a logistic regression model which can predict cancellation.
Below is the list of packages used in the project.
library(magrittr)
library(tidyverse)
library(car)
library(knitr)
library(caret)
library(ROCR)
library(scales)
library(pander)
library(plotly)
library(MASS)
library(psych)
The purpose of the packages are as below:
psych is used to plot multiple histograms.tidyverse consists of 6 core packages out of which the below 3 are most important for this project:ggplot2: Used for creating powerful visualizationsdplyr: Used for data manipulationtidyr: Used for data modificationscar is used for identifying outliers in a boxplot.knitr is for displaying tables.The data was obtained at tidytuesday and detailed information can be found in the original publication (Nuno Antonio, Ana de Almeida, Luis Nunes, 2019). Since it is uncommon to find real business data, the authors intended to provide the hotel demand data to researchers and educators in revenue management, machine learning or data mining.
There are 119390 observation and 32 variables in hotels.csv file. We have set the seed to ensure the repeatability of the code and data split during logistic regression.
Data type of each variable with sample values:
# Importing .csv files
set.seed(70252021)
hotels <- read.csv("data/hotels.csv")
# View the structure of hotels
str(hotels)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
Each observation represents a hotel booking. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields. The data was extracted from hotels’ Property Management System (PMS) SQL databases Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon.
The following diagram presents the PMS database tables from where the data variables were extracted:
Viewing the raw data
# View the first 10 rows of hotels
knitr::kable(head(hotels,n = 10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
Following integer/Character variables have been switched to factors for convenience in analysis.
| Variables | Data Types |
|---|---|
| hotel | character |
| is_canceled | Integer |
| arrival_date_year | Integer |
| arrival_date_month | character |
| arrival_date_week_number | Integer |
| arrival_date_day_of_month | Integer |
| meal | character |
| country | character |
| market_segment | character |
| distribution_channel | character |
| is_repeated_guest | Integer |
| previous_cancellations | Integer |
| previous_bookings_not_canceled | Integer |
| reserved_room_type | character |
| assigned_room_type | character |
| booking_changes | Integer |
| deposit_type | character |
| agent | character |
| company | character |
| customer_type | character |
| reservation_status | character |
| reservation_status_date | character |
hotels$hotel <- as.factor(hotels$hotel)
hotels$is_canceled <- as.factor(hotels$is_canceled)
hotels$arrival_date_year <- as.factor(hotels$arrival_date_year )
hotels$arrival_date_month <- as.factor(hotels$arrival_date_month )
hotels$arrival_date_week_number <- as.factor(hotels$arrival_date_week_number )
hotels$arrival_date_day_of_month <- as.factor(hotels$arrival_date_day_of_month)
hotels$meal <- as.factor(hotels$meal)
hotels$country <- as.factor(hotels$country)
hotels$market_segment <- as.factor(hotels$market_segment)
hotels$distribution_channel <- as.factor(hotels$distribution_channel)
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$previous_cancellations <- as.factor(hotels$previous_cancellations)
hotels$previous_bookings_not_canceled <- as.factor(hotels$previous_bookings_not_canceled)
hotels$reserved_room_type <- as.factor(hotels$reserved_room_type)
hotels$assigned_room_type <- as.factor(hotels$assigned_room_type)
hotels$booking_changes <- as.factor(hotels$booking_changes)
hotels$deposit_type <- as.factor(hotels$deposit_type)
hotels$agent <- as.factor(hotels$agent)
hotels$company <- as.factor(hotels$company)
hotels$customer_type <- as.factor(hotels$customer_type)
hotels$reservation_status <- as.factor(hotels$reservation_status)
hotels$reservation_status_date <- as.factor(hotels$reservation_status_date)
Following new variables “arrival_date” and “stay_nights” were created using existing variables.
| Variables | Data Types |
|---|---|
| arrival_date | character |
| stay_nights | Integer |
#create new variable: arrival_date
hotels$arrival_date <-paste(hotels$arrival_date_month,
hotels$arrival_date_day_of_month,
hotels$arrival_date_year,sep="-")
hotels$arrival_date <-as.Date(hotels$arrival_date, format="%B-%d-%Y")
#create new variable: stays_nights
hotels %>% mutate(stays_nights= stays_in_weekend_nights + stays_in_week_nights) -> hotels_new
Regarding type of meal, there are 10,650 records of SC (no meal package) and 1,169 records of Undefined. The level of “Undefined” were combined as the level of SC. Further we also dropped 16 values from booking_changes and 21 values from previous_cancellations since they were not part of the training data but were part of the testing dataset.
levels(hotels_new$meal)[5] <- "SC"
# Data Summary (Booking Data)
#hotels_new<-hotels_new[!(hotels_new$booking_changes=="18" ),]
hotels_new<-hotels_new[!(hotels_new$booking_changes=="16"| hotels_new$previous_cancellations=="21"),]
summary(hotels_new)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79328 0:75165 Min. : 0 2015:21995
## Resort Hotel:40059 1:44222 1st Qu.: 18 2016:56706
## Median : 69 2017:40686
## Mean :104
## 3rd Qu.:160
## Max. :737
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 33 : 3580 17 : 4406
## July :12660 30 : 3087 5 : 4317
## May :11790 32 : 3045 15 : 4196
## October:11159 34 : 3040 25 : 4160
## April :11089 18 : 2926 26 : 4147
## June :10939 21 : 2854 9 : 4095
## (Other):47873 (Other):100855 (Other):94066
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB:92307 PRT :48589
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB: 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB:14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC:11819 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## NA's :4 (Other):28633
## market_segment distribution_channel is_repeated_guest
## Online TA :56475 Corporate: 6677 0:115578
## Offline TA/TO:24219 Direct :14645 1: 3809
## Groups :19810 GDS : 193
## Direct :12606 TA/TO :97867
## Corporate : 5295 Undefined: 5
## Complementary: 743
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## 0 :112904 0 :115767 A :85991
## 1 : 6051 1 : 1542 D :19201
## 2 : 116 2 : 580 E : 6535
## 3 : 65 3 : 333 F : 2897
## 24 : 48 4 : 229 G : 2094
## 11 : 35 5 : 181 B : 1118
## (Other): 168 (Other): 755 (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74050 0 :101313 No Deposit:104638 9 :31960
## D :25322 1 : 12701 Non Refund: 14587 NULL :16340
## E : 7806 2 : 3805 Refundable: 162 240 :13921
## F : 3751 3 : 927 1 : 7190
## G : 2553 4 : 376 14 : 3640
## C : 2375 5 : 118 7 : 3539
## (Other): 3530 (Other): 147 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112590 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89611
## 67 : 267 Mean : 2.321 Transient-Party:25123
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.59 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5713
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date arrival_date
## Canceled :43015 2015-10-21: 1461 Min. :2015-07-01
## Check-Out:75165 2015-07-06: 805 1st Qu.:2016-03-13
## No-Show : 1207 2016-11-25: 790 Median :2016-09-06
## 2015-01-01: 763 Mean :2016-08-28
## 2016-01-18: 625 3rd Qu.:2017-03-18
## 2015-07-02: 469 Max. :2017-08-31
## (Other) :114474
## stays_nights
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 3.000
## Mean : 3.428
## 3rd Qu.: 4.000
## Max. :69.000
##
There are almost no missing values except for “children”.
colSums(is.na(hotels_new))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
## arrival_date stays_nights
## 0 0
There are 4 missing values for “children” variable and they were replaced by zero.
hotels_new$children[is.na(hotels_new$children)] <- 0
To see the distribution of adr (Average Daily Rate) and to investigate any existing outliers, a histogram and a boxplot were created.
hist(hotels_new$adr,prob = T, col ="grey", breaks = 100, xlim = c(min(hotels_new$adr) - 1 ,max(hotels_new$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
boxplot(hotels_new$adr,main = "Boxplot : Average Daily Rate")
The outlier in adr was identified.
Boxplot(~adr, data=hotels_new, id=list(n=Inf),main = "Boxplot : Average Daily Rate")
The outlier was replaced by NA and below is new boxplot without the outlier.
hotels_new[48516, 28] <- NA # row num. is 48516; column num. for adr is 28
boxplot(hotels_new$adr, main = "Boxplot : Average Daily Rate")
Below is the table of variable description.
| Variable | Type | Description |
|---|---|---|
| hotel | factor | Hotel (City Hotel or Resort Hotel) |
| is_canceled | factor | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | integer | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | factor | Year of arrival date |
| arrival_date_month | factor | Month of arrival date |
| arrival_date_week_number | factor | Week number of year for arrival date |
| arrival_date_day_of_month | factor | Day of arrival date |
| stays_in_weekend_nights | integer | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | integer | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | integer | Number of adults |
| children | numeric | Number of children |
| babies | integer | Number of babies |
| meal | factor | Type of meal booked. SC - no meal package; BB - Bed & Breakfast; HB - Half board (breakfast and one other meal – usually dinner); FB - Full board (breakfast, lunch and dinner) |
| country | factor | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
| market_segment | factor | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| distribution_channel | factor | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| is_repeated_guest | factor | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | factor | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | factor | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | factor | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
| assigned_room_type | factor | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
| booking_changes | factor | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| deposit_type | factor | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay. |
| agent | factor | ID of the travel agency that made the booking |
| company | factor | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
| days_in_waiting_list | integer | Number of days the booking was in the waiting list before it was confirmed to the customer |
| customer_type | factor | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
| adr | numeric | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces | integer | Number of car parking spaces required by the customer |
| total_of_special_requests | integer | Number of special requests made by the customer (e.g. twin bed or high floor) |
| reservation_status | factor | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
| reservation_status_date | factor | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
| arrival_date | Date | Arrival Date |
| stays_nights | integer | Stays in nights |
We begin Exploratory Data Analysis by looking at the histograms of the numeric columns.This helps us to visualize if there is a skewness to the data or if any particular variable follows a perfectly normal distribution. We see that stays_nights follows almost a normal distribution curve with a longer tail to the right.
multi.hist(hotels_new[,sapply(hotels_new, is.numeric)],freq=TRUE, dcol= c("blue","red"),dlty=c("dotted", "solid"))
The canceled and not canceled groups were displayed for the two hotels. It is clear that the number of cancellations for city hotel were higher than resort hotel.
# hotel and cancellation
hotels_new %>% ggplot( aes(x=hotel, fill=is_canceled)) + geom_bar() + scale_fill_discrete(labels=c("not canceled", "canceled")) + guides(fill=guide_legend(title=NULL)) + scale_fill_manual(values=c("#F8766D","#00BFc4"))
We further expand our analysis to visualize the number of cancellations on a monthly basis as below. This will help us to know in which months the hotel owners were affected the worst.
count <- table(hotels_new$is_canceled, hotels_new$arrival_date_month)
barplot(count, main = "Number of cancellations by Months",horiz = TRUE, names.arg = c("Jan", "Feb","March","April","May","June","July","Aug","Sept","Oct","Nov","Dec"),col=c("#F8766D","#00BFc4") ,cex.names = 0.4)
We can find the total number of guests by adding 3 columns of adults, children and babies. This will tell us the total number of guests per room which is an important metric but was not included in the original dataset.
hotels_new$no_of_guests = rowSums(cbind(hotels_new$adults,hotels_new$children, hotels_new$babies),na.rm = TRUE)
summary (hotels_new$no_of_guests)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.968 2.000 55.000
Total bookings across months for 2015, 2016 and 2017 is as below:
hotels_new1 <- hotels_new %>%
mutate(Months = fct_relevel(arrival_date_month,"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"))
ggplot(data = hotels_new1,aes(Months,fill = (hotel))) +
geom_bar(position = 'dodge') +
facet_grid(hotels_new1$arrival_date_year) +
scale_y_continuous(name = "No. of Bookings",labels = scales::comma) +
xlab("Months") +
ggtitle("Hotel Type vs No. of bookings across months") +
labs(fill = 'Hotel Type')
We can see that the data for 2015 and 2017 is not available for all the months. This indicates that all the annual numbers need to be normalized to a common baseline.
Next we see the trend of bookings that were converted and canceled over the period of three years period over which the data was collected. This will also help us to identify if any outside factors or events led people to cancel their hotel bookings during a particular year.
ggplot(data = hotels_new,aes(arrival_date_year,fill = (is_canceled))) +
geom_bar(position = 'dodge') +
scale_y_continuous(name = "No. of Bookings",labels = scales::comma) +
xlab("Years") +
ggtitle("Status by number of bookings across years") +
labs(fill = 'Booking Status')
We tried correlating the variables with the dependent variable is_canceled but we were not able to make conclusive decisions on the basis of that information. We only see variables that are weakly correlated or are negatively correlated.
corr_hotels <- hotels_new[unlist(lapply(hotels_new, is.numeric))]
cor(corr_hotels)[,1] * 100
## lead_time stays_in_weekend_nights
## 100.000000 8.569450
## stays_in_week_nights adults
## 16.579416 11.951311
## children babies
## -3.760406 -2.091256
## days_in_waiting_list adr
## 17.009473 NA
## required_car_parking_spaces total_of_special_requests
## -11.644463 -9.576870
## stays_nights no_of_guests
## 15.717236 7.226568
We start with a full model to observe with all the variable and then we will proceed to use a logistic regression model with the number of cancellations as response variable.
lm_model<- glm(is_canceled ~ lead_time + hotel + no_of_guests + deposit_type + adr + total_of_special_requests + booking_changes + required_car_parking_spaces + days_in_waiting_list + customer_type, data = hotels_new , family = "binomial")
summary(lm_model)
##
## Call:
## glm(formula = is_canceled ~ lead_time + hotel + no_of_guests +
## deposit_type + adr + total_of_special_requests + booking_changes +
## required_car_parking_spaces + days_in_waiting_list + customer_type,
## family = "binomial", data = hotels_new)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0455 -0.8120 -0.4961 0.2007 2.8011
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.186e+00 5.108e-02 -42.787 < 2e-16 ***
## lead_time 4.661e-03 8.211e-05 56.764 < 2e-16 ***
## hotelResort Hotel -9.598e-02 1.598e-02 -6.006 1.90e-09 ***
## no_of_guests 1.609e-01 1.233e-02 13.046 < 2e-16 ***
## deposit_typeNon Refund 5.096e+00 1.055e-01 48.309 < 2e-16 ***
## deposit_typeRefundable 1.005e-01 2.037e-01 0.493 0.6218
## adr 5.100e-03 1.729e-04 29.497 < 2e-16 ***
## total_of_special_requests -5.497e-01 1.042e-02 -52.755 < 2e-16 ***
## booking_changes1 -1.085e+00 2.873e-02 -37.779 < 2e-16 ***
## booking_changes2 -5.999e-01 4.409e-02 -13.607 < 2e-16 ***
## booking_changes3 -8.255e-01 9.648e-02 -8.556 < 2e-16 ***
## booking_changes4 -7.208e-01 1.449e-01 -4.976 6.50e-07 ***
## booking_changes5 -5.260e-01 2.639e-01 -1.993 0.0462 *
## booking_changes6 -7.012e-02 2.984e-01 -0.235 0.8142
## booking_changes7 -1.507e+00 6.263e-01 -2.405 0.0162 *
## booking_changes8 -2.206e-01 6.183e-01 -0.357 0.7212
## booking_changes9 -1.613e+00 1.109e+00 -1.454 0.1458
## booking_changes10 -1.614e-01 1.141e+00 -0.141 0.8876
## booking_changes11 -1.577e+01 2.637e+03 -0.006 0.9952
## booking_changes12 -1.609e+01 2.754e+03 -0.006 0.9953
## booking_changes13 -1.699e+01 1.722e+03 -0.010 0.9921
## booking_changes14 -4.488e-01 1.162e+00 -0.386 0.6994
## booking_changes15 -1.716e+01 2.242e+03 -0.008 0.9939
## booking_changes17 -1.682e+01 2.653e+03 -0.006 0.9949
## booking_changes18 -1.690e+01 3.956e+03 -0.004 0.9966
## booking_changes20 -1.613e+01 3.956e+03 -0.004 0.9967
## booking_changes21 -1.540e+01 3.956e+03 -0.004 0.9969
## required_car_parking_spaces -1.659e+01 4.251e+01 -0.390 0.6963
## days_in_waiting_list -3.203e-03 4.778e-04 -6.704 2.02e-11 ***
## customer_typeGroup -6.541e-01 1.608e-01 -4.068 4.74e-05 ***
## customer_typeTransient 7.501e-01 4.521e-02 16.591 < 2e-16 ***
## customer_typeTransient-Party -5.918e-02 4.742e-02 -1.248 0.2121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157391 on 119385 degrees of freedom
## Residual deviance: 110283 on 119354 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 110347
##
## Number of Fisher Scoring iterations: 16
AIC (lm_model)
## [1] 110346.8
We split the full data in 80:20 ratio as training and testing datasets.
index <- createDataPartition(hotels_new$is_canceled, p=0.80, list=FALSE)
# select 20% of the data for testing
testset <- hotels_new[-index,]
# select 80% of data to train the models
trainset <- hotels_new[index,]
fullmodel <- glm(is_canceled ~ hotel + lead_time + adr + stays_nights + adults + children + babies +
meal + total_of_special_requests + distribution_channel +
is_repeated_guest + previous_cancellations + booking_changes +
deposit_type + days_in_waiting_list + required_car_parking_spaces +
customer_type, family = "binomial" , data=trainset)
Let’s find the best model using backward elimination technique.
# Backward Elimination
model_step_b <- step(fullmodel,direction='backward')
## Start: AIC=82509.79
## is_canceled ~ hotel + lead_time + adr + stays_nights + adults +
## children + babies + meal + total_of_special_requests + distribution_channel +
## is_repeated_guest + previous_cancellations + booking_changes +
## deposit_type + days_in_waiting_list + required_car_parking_spaces +
## customer_type
##
## Df Deviance AIC
## <none> 82398 82510
## - babies 1 82405 82515
## - days_in_waiting_list 1 82406 82516
## - hotel 1 82411 82521
## - adults 1 82431 82541
## - children 1 82464 82574
## - stays_nights 1 82522 82632
## - meal 3 82720 82826
## - is_repeated_guest 1 82963 83073
## - distribution_channel 4 83263 83367
## - booking_changes 19 83410 83484
## - adr 1 83442 83552
## - lead_time 1 83746 83856
## - customer_type 3 84230 84336
## - total_of_special_requests 1 85019 85129
## - required_car_parking_spaces 1 85712 85822
## - previous_cancellations 13 86569 86655
## - deposit_type 2 91159 91267
Considering the lowest AIC for days_in_waiting_list, babies and hotel, we tried below 2 cases.
We select our final model after removing days_in_waiting_list and babies as this model gave a lower AIC value.
#Dropped previous cancellations
Finalmodel <- glm(is_canceled ~ hotel + lead_time + adr + stays_nights + adults + children +
meal + total_of_special_requests + distribution_channel +
is_repeated_guest + booking_changes +previous_cancellations +
deposit_type + days_in_waiting_list + required_car_parking_spaces +
customer_type, family = "binomial" , data=trainset)
summary(Finalmodel)
##
## Call:
## glm(formula = is_canceled ~ hotel + lead_time + adr + stays_nights +
## adults + children + meal + total_of_special_requests + distribution_channel +
## is_repeated_guest + booking_changes + previous_cancellations +
## deposit_type + days_in_waiting_list + required_car_parking_spaces +
## customer_type, family = "binomial", data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9832 -0.7698 -0.4160 0.1849 3.3571
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.489e+00 8.522e-02 -40.942 < 2e-16 ***
## hotelResort Hotel 7.376e-02 2.035e-02 3.624 0.00029 ***
## lead_time 3.730e-03 1.014e-04 36.801 < 2e-16 ***
## adr 6.714e-03 2.074e-04 32.380 < 2e-16 ***
## stays_nights 3.921e-02 3.492e-03 11.225 < 2e-16 ***
## adults 8.562e-02 1.569e-02 5.457 4.84e-08 ***
## children 1.717e-01 2.101e-02 8.172 3.02e-16 ***
## mealFB 5.297e-01 1.219e-01 4.344 1.40e-05 ***
## mealHB -3.375e-01 2.930e-02 -11.518 < 2e-16 ***
## mealSC 3.152e-01 2.612e-02 12.067 < 2e-16 ***
## total_of_special_requests -5.879e-01 1.216e-02 -48.338 < 2e-16 ***
## distribution_channelDirect -3.673e-01 5.555e-02 -6.613 3.78e-11 ***
## distribution_channelGDS -4.323e-01 2.073e-01 -2.086 0.03701 *
## distribution_channelTA/TO 4.601e-01 4.922e-02 9.350 < 2e-16 ***
## distribution_channelUndefined 1.387e+01 6.542e+01 0.212 0.83212
## is_repeated_guest1 -2.239e+00 1.159e-01 -19.319 < 2e-16 ***
## booking_changes1 -9.302e-01 3.331e-02 -27.931 < 2e-16 ***
## booking_changes2 -5.033e-01 5.123e-02 -9.824 < 2e-16 ***
## booking_changes3 -6.394e-01 1.105e-01 -5.789 7.10e-09 ***
## booking_changes4 -5.041e-01 1.653e-01 -3.050 0.00229 **
## booking_changes5 -2.258e-01 2.836e-01 -0.796 0.42600
## booking_changes6 -5.836e-02 3.261e-01 -0.179 0.85798
## booking_changes7 -1.440e+00 7.637e-01 -1.885 0.05938 .
## booking_changes8 -8.610e-01 8.303e-01 -1.037 0.29976
## booking_changes9 -1.649e+00 1.126e+00 -1.465 0.14284
## booking_changes10 -1.746e+01 3.013e+03 -0.006 0.99538
## booking_changes11 -1.603e+01 6.523e+03 -0.002 0.99804
## booking_changes12 -1.900e+01 6.523e+03 -0.003 0.99768
## booking_changes13 -1.790e+01 3.127e+03 -0.006 0.99543
## booking_changes14 -6.065e-01 1.239e+00 -0.490 0.62439
## booking_changes15 -1.821e+01 4.597e+03 -0.004 0.99684
## booking_changes17 -1.832e+01 6.523e+03 -0.003 0.99776
## booking_changes18 -1.755e+01 6.523e+03 -0.003 0.99785
## booking_changes20 -1.768e+01 6.523e+03 -0.003 0.99784
## booking_changes21 -1.818e+01 6.523e+03 -0.003 0.99778
## previous_cancellations1 4.465e+00 1.080e-01 41.344 < 2e-16 ***
## previous_cancellations2 2.877e+00 2.951e-01 9.749 < 2e-16 ***
## previous_cancellations3 2.978e+00 3.814e-01 7.809 5.74e-15 ***
## previous_cancellations4 3.748e+00 5.519e-01 6.790 1.12e-11 ***
## previous_cancellations5 2.614e+00 7.957e-01 3.285 0.00102 **
## previous_cancellations6 -6.496e-01 2.702e+00 -0.240 0.80998
## previous_cancellations11 1.649e+00 5.482e-01 3.007 0.00264 **
## previous_cancellations13 5.625e+00 1.070e+00 5.257 1.46e-07 ***
## previous_cancellations14 1.384e+01 1.967e+03 0.007 0.99439
## previous_cancellations19 1.351e+01 1.684e+03 0.008 0.99360
## previous_cancellations24 1.818e+01 7.532e+02 0.024 0.98074
## previous_cancellations25 1.332e+01 1.458e+03 0.009 0.99271
## previous_cancellations26 1.368e+01 1.423e+03 0.010 0.99233
## deposit_typeNon Refund 5.052e+00 1.233e-01 40.976 < 2e-16 ***
## deposit_typeRefundable 5.975e-01 2.307e-01 2.591 0.00958 **
## days_in_waiting_list -1.502e-03 5.385e-04 -2.789 0.00528 **
## required_car_parking_spaces -2.150e+01 9.104e+01 -0.236 0.81330
## customer_typeGroup 1.723e-01 1.920e-01 0.897 0.36969
## customer_typeTransient 1.515e+00 6.418e-02 23.599 < 2e-16 ***
## customer_typeTransient-Party 6.581e-01 6.698e-02 9.825 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125913 on 95508 degrees of freedom
## Residual deviance: 82405 on 95454 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 82515
##
## Number of Fisher Scoring iterations: 17
Next, we look at the model fitting criteria.
** Model deviance
Finalmodel$deviance
## [1] 82405.07
** We check the AIC of the final model.
AIC(Finalmodel)
## [1] 82515.07
** We also check the BIC of the final model.
BIC(Finalmodel)
## [1] 83035.76
A Receiver operating characteristic curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True positive rate and False positive rate. Below ROC curve for training data set is created.
#In-sample prediction
pred.glm0.train<- predict(Finalmodel, type="response")
##ROC Curve
trainset <- trainset[-c(1),]
length(pred.glm0.train)==length(as.numeric(trainset$is_canceled)) #This checks if the lengths of pred.glm0.train and is_cancelled variable are equal to ensure the next function runs correctly.
## [1] TRUE
pred <- ROCR::prediction(pred.glm0.train, as.numeric(trainset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
Area Under the Curve (AUC) for training dataset was computed.
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8264023
Now that we have our final model, we will use it to predict on the testing data set.
#out-of-sample prediction
pred.glm0.test<- predict(Finalmodel, newdata = testset, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(testset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
Area under the curve (AUC) for out of sample prediction.
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8351158
Below are our observations from the project:
p <- hotels %>% group_by(arrival_date) %>%
count(is_canceled) %>%
ggplot(aes(x=arrival_date, y=n, color=is_canceled)) +
geom_line( ) +
geom_hline(aes(yintercept=mean(n, na.rm=TRUE)), linetype="dashed")+
scale_x_date(date_breaks= "2 month", date_labels = "%Y %b") +
theme(axis.text.x=element_text(angle=45, size=10, hjust=1)) +
ggtitle("") +
labs(x="", y="Booking", color="") +
scale_color_manual(values=c("#F8766D","#00BFC4"), labels=c("Not canceled", "canceled")) +
ggtitle("Interactive Daily Booking with/wo Cancellation")
ggplotly(p)