Hotel reservation cancellation is unavoidable but one of the major challenges in Hospitality industry because it directly affects a hotel’s profit margins. If a predictive model of cancellation is available in revenue management, it allows hotel managers to develop better cancellation policy, to predict demand accurately, to take actions proactively managing man-hours and resources accordingly, and to make those canceled rooms available for walk-ins or last minute bookings.
This data was obtained from Hotel Profit Management System (PMS), and it contains booking information due to arrive between 7/1/2015 and 8/31/2017 from one resort hotel and one city hotel located in Portugal.
The objectives of this project are
Statistical analyses and visualization were performed using R 3.6.3.
Below is the list of packages used in the project.
library(magrittr)
library(tidyverse)
library(car)
library(knitr)
library(caret)
library(ROCR)
library(scales)
library(pander)
library(plotly)
magrittr
is for running tidyverse functions with the pipe operator, %>%.tidyverse
is used for dplyr for data manipulation and ggplot2 package for creating plots all at once.car
is used for identifying outliers in a boxplot.knitr
is for displaying tables.caret
is for partitioning data.ROCR
is for creating ROC curve.Scales
is to demonstrate ggplot2 style scales for specific types of data.pander
is to produce simple tables from summary() output.plotly
is for creating interactive charts.The data was obtained at tidytuesday and detailed information can be found in the original publication (Nuno Antonio, Ana de Almeida, Luis Nunes, 2019). For finding real business data is uncommon, the authors intended to provide the hotel demand data to researchers and educators in revenue management, machine learning or data mining.
The hotels.csv
file was imported.
# Importing .csv files
hotels <- read.csv("data/hotels.csv")
Below is a brief snapshot of the data set.
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
The following is the table of variable description.
Variable | Type | Description |
---|---|---|
hotel | factor | Hotel (City Hotel or Resort Hotel) |
is_canceled | integer | Value indicating if the booking was canceled (1) or not (0) |
lead_time | integer | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
arrival_date_year | integer | Year of arrival date |
arrival_date_month | factor | Month of arrival date |
arrival_date_week_number | integer | Week number of year for arrival date |
arrival_date_day_of_month | integer | Day of arrival date |
stays_in_weekend_nights | integer | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
stays_in_week_nights | integer | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
adults | integer | Number of adults |
children | integer | Number of children |
babies | integer | Number of babies |
meal | factor | Type of meal booked. Undefined/SC - no meal package; BB - Bed & Breakfast; HB - Half board (breakfast and one other meal – usually dinner); FB - Full board (breakfast, lunch and dinner) |
country | factor | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
market_segment | factor | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
distribution_channel | factor | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
is_repeated_guest | integer | Value indicating if the booking name was from a repeated guest (1) or not (0) |
previous_cancellations | integer | Number of previous bookings that were cancelled by the customer prior to the current booking |
previous_bookings_not_canceled | integer | Number of previous bookings not cancelled by the customer prior to the current booking |
reserved_room_type | factor | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
assigned_room_type | factor | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
booking_changes | integer | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
deposit_type | factor | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay |
agent | factor | ID of the travel agency that made the booking |
company | factor | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
days_in_waiting_list | integer | Number of days the booking was in the waiting list before it was confirmed to the customer |
customer_type | factor | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
adr | numeric | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
required_car_parking_spaces | integer | Number of car parking spaces required by the customer |
total_of_special_requests | integer | Number of special requests made by the customer (e.g. twin bed or high floor) |
reservation_status | factor | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
reservation_status_date | factor | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
The hotel booking data is a data frame, and includes 119390 rows (observations) and 32 columns (variables).
# View the structure of hotels
str(hotels)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
Some integer variables have been switched to factors for the convenience in analysis.
hotels$arrival_date_year <- as.factor(hotels$arrival_date_year )
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
#label the levels of is_canceled variable
hotels$is_canceled <- factor(hotels$is_canceled, levels=c(0, 1),
labels=c("Not canceled", "Canceled"))
#The order for month was specified explicitly
hotels$arrival_date_month<- factor(hotels$arrival_date_month, levels =
c("January", "February", "March", "April", "May", "June", "July",
"August", "September", "October", "November", "December"))
The new variable “arrival_date” was created using existing month, day, and year variables.
#create new variable: arrival_date
hotels$arrival_date <- paste(hotels$arrival_date_month,
hotels$arrival_date_day_of_month,
hotels$arrival_date_year,sep="-")
hotels$arrival_date <-as.Date(hotels$arrival_date, format="%B-%d-%Y")
New variables of “stays_nights” and “total_guests” were also created using existing variables.
#create new variables: stays_nights and total_guests
hotels <- hotels %>%
dplyr::mutate(stays_nights = stays_in_weekend_nights + stays_in_week_nights,
total_guests = adults + children + babies) %>%
dplyr::select(-c(stays_in_weekend_nights, stays_in_week_nights, adults, children, babies,
arrival_date_week_number ))
The summary output below shows, there are 10650 records of “SC” (no meal package) and 1169 records of “Undefined” in meal
.
summary(hotels)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 Not canceled:75166 Min. : 0 2015:21996
## Resort Hotel:40060 Canceled :44224 1st Qu.: 18 2016:56707
## Median : 69 2017:40687
## Mean :104
## 3rd Qu.:160
## Max. :737
##
## arrival_date_month arrival_date_day_of_month meal country
## August :13877 Min. : 1.0 BB :92310 PRT :48590
## July :12661 1st Qu.: 8.0 FB : 798 GBR :12129
## May :11791 Median :16.0 HB :14463 FRA :10415
## October:11160 Mean :15.8 SC :10650 ESP : 8568
## April :11089 3rd Qu.:23.0 Undefined: 1169 DEU : 7287
## June :10939 Max. :31.0 ITA : 3766
## (Other):47873 (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 0:115580
## Offline TA/TO:24219 Direct :14645 1: 3810
## Groups :19811 GDS : 193
## Direct :12606 TA/TO :97870
## Corporate : 5295 Undefined: 5
## Complementary: 743
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date arrival_date
## Canceled :43017 2015-10-21: 1461 Min. :2015-07-01
## Check-Out:75166 2015-07-06: 805 1st Qu.:2016-03-13
## No-Show : 1207 2016-11-25: 790 Median :2016-09-06
## 2015-01-01: 763 Mean :2016-08-28
## 2016-01-18: 625 3rd Qu.:2017-03-18
## 2015-07-02: 469 Max. :2017-08-31
## (Other) :114477
## stays_nights total_guests
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 2.000
## Median : 3.000 Median : 2.000
## Mean : 3.428 Mean : 1.968
## 3rd Qu.: 4.000 3rd Qu.: 2.000
## Max. :69.000 Max. :55.000
## NA's :4
The level of “Undefined” was combined to the level of “SC” according to the data description, and the new variable with 4 levels was created as meal_new
and dropped meal
.
hotels<- hotels %>%
dplyr::mutate(meal_new = fct_collapse(meal, SC = c("Undefined" , "SC"),
BB = "BB",
FB = "FB",
HB = "HB"),
meal_new = fct_relevel(meal_new, "FB", "HB", "BB", "SC"),
meal_new = fct_explicit_na(meal_new)) %>%
dplyr::select(-meal)
A boxplot was created to see the distribution of adr
(Average Daily Rate) and to investigate any existing outliers.
par(mfrow=c(1,1))
boxplot(hotels$adr, ylab ="adr", main="Distribution of Average Daily Rate (adr) \n and Presence of an Outlier")
The outlier was identified on the left side figure, and the adr
value from the row number of 48,516 was 5,400 which was extremely far from the rest. The record was replaced by the median of adr
, and new boxplot without the outlier has been displayed on the right side, and it looks more reasonable.
par(mfrow=c(1,2))
Boxplot(~adr, data=hotels, id=list(n=Inf), ylab="adr",
main="Identifying the Outlier" )
which(colnames(hotels) == "adr")
hotels[48516, 22] <- median(hotels$adr)
boxplot(hotels$adr, ylab="adr",
main="Boxplot of adr \n after Removing the Outlier")
There are almost no missing values except “total_guests”.
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_day_of_month
## 0 0
## country market_segment
## 0 0
## distribution_channel is_repeated_guest
## 0 0
## previous_cancellations previous_bookings_not_canceled
## 0 0
## reserved_room_type assigned_room_type
## 0 0
## booking_changes deposit_type
## 0 0
## agent company
## 0 0
## days_in_waiting_list customer_type
## 0 0
## adr required_car_parking_spaces
## 0 0
## total_of_special_requests reservation_status
## 0 0
## reservation_status_date arrival_date
## 0 0
## stays_nights total_guests
## 0 4
## meal_new
## 0
There are 4 missing values in “total_guests” and they were replaced by zero.
hotels$total_guests[is.na(hotels$total_guests)] <- 0
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_day_of_month
## 0 0
## country market_segment
## 0 0
## distribution_channel is_repeated_guest
## 0 0
## previous_cancellations previous_bookings_not_canceled
## 0 0
## reserved_room_type assigned_room_type
## 0 0
## booking_changes deposit_type
## 0 0
## agent company
## 0 0
## days_in_waiting_list customer_type
## 0 0
## adr required_car_parking_spaces
## 0 0
## total_of_special_requests reservation_status
## 0 0
## reservation_status_date arrival_date
## 0 0
## stays_nights total_guests
## 0 0
## meal_new
## 0
A summary for all variables in the final data is displayed in the table below.
hotel | is_canceled | lead_time | arrival_date_year |
---|---|---|---|
City Hotel :79330 | Not canceled:75166 | Min. : 0 | 2015:21996 |
Resort Hotel:40060 | Canceled :44224 | 1st Qu.: 18 | 2016:56707 |
Median : 69 | 2017:40687 | ||
Mean :104 | |||
3rd Qu.:160 | |||
Max. :737 |
arrival_date_month | arrival_date_day_of_month | country |
---|---|---|
August :13877 | Min. : 1.0 | PRT :48590 |
July :12661 | 1st Qu.: 8.0 | GBR :12129 |
May :11791 | Median :16.0 | FRA :10415 |
October:11160 | Mean :15.8 | ESP : 8568 |
April :11089 | 3rd Qu.:23.0 | DEU : 7287 |
June :10939 | Max. :31.0 | ITA : 3766 |
(Other):47873 | (Other):28635 |
market_segment | distribution_channel | is_repeated_guest |
---|---|---|
Online TA :56477 | Corporate: 6677 | 0:115580 |
Offline TA/TO:24219 | Direct :14645 | 1: 3810 |
Groups :19811 | GDS : 193 | |
Direct :12606 | TA/TO :97870 | |
Corporate : 5295 | Undefined: 5 | |
Complementary: 743 | ||
(Other) : 239 |
previous_cancellations | previous_bookings_not_canceled | reserved_room_type |
---|---|---|
Min. : 0.00000 | Min. : 0.0000 | A :85994 |
1st Qu.: 0.00000 | 1st Qu.: 0.0000 | D :19201 |
Median : 0.00000 | Median : 0.0000 | E : 6535 |
Mean : 0.08712 | Mean : 0.1371 | F : 2897 |
3rd Qu.: 0.00000 | 3rd Qu.: 0.0000 | G : 2094 |
Max. :26.00000 | Max. :72.0000 | B : 1118 |
(Other): 1551 |
assigned_room_type | booking_changes | deposit_type | agent |
---|---|---|---|
A :74053 | Min. : 0.0000 | No Deposit:104641 | 9 :31961 |
D :25322 | 1st Qu.: 0.0000 | Non Refund: 14587 | NULL :16340 |
E : 7806 | Median : 0.0000 | Refundable: 162 | 240 :13922 |
F : 3751 | Mean : 0.2211 | 1 : 7191 | |
G : 2553 | 3rd Qu.: 0.0000 | 14 : 3640 | |
C : 2375 | Max. :21.0000 | 7 : 3539 | |
(Other): 3530 | (Other):42797 |
company | days_in_waiting_list | customer_type | adr |
---|---|---|---|
NULL :112593 | Min. : 0.000 | Contract : 4076 | Min. : -6.38 |
40 : 927 | 1st Qu.: 0.000 | Group : 577 | 1st Qu.: 69.29 |
223 : 784 | Median : 0.000 | Transient :89613 | Median : 94.58 |
67 : 267 | Mean : 2.321 | Transient-Party:25124 | Mean : 101.83 |
45 : 250 | 3rd Qu.: 0.000 | 3rd Qu.: 126.00 | |
153 : 215 | Max. :391.000 | Max. :5400.00 | |
(Other): 4354 |
required_car_parking_spaces | total_of_special_requests | reservation_status |
---|---|---|
Min. : 0.00000 | Min. :0.0000 | Canceled :43017 |
1st Qu.: 0.00000 | 1st Qu.:0.0000 | Check-Out:75166 |
Median : 0.00000 | Median :0.0000 | No-Show : 1207 |
Mean : 0.06331 | Mean :0.5714 | |
3rd Qu.: 0.00000 | 3rd Qu.:1.0000 | |
Max. :94.57500 | Max. :5.0000 |
reservation_status_date | arrival_date | stays_nights | total_guests |
---|---|---|---|
2015-10-21: 1461 | Min. :2015-07-01 | Min. : 0.000 | Min. : 0.000 |
2015-07-06: 805 | 1st Qu.:2016-03-13 | 1st Qu.: 2.000 | 1st Qu.: 2.000 |
2016-11-25: 790 | Median :2016-09-06 | Median : 3.000 | Median : 2.000 |
2015-01-01: 763 | Mean :2016-08-28 | Mean : 3.428 | Mean : 1.968 |
2016-01-18: 625 | 3rd Qu.:2017-03-18 | 3rd Qu.: 4.000 | 3rd Qu.: 2.000 |
2015-07-02: 469 | Max. :2017-08-31 | Max. :69.000 | Max. :55.000 |
(Other) :114477 |
meal_new |
---|
FB: 798 |
HB:14463 |
BB:92310 |
SC:11819 |
The hotel booking data contains booking records from July 2015 to August 2017. Thus, there exist three years of records for July and August, but only two years of records for other months. Figure 1. is the barchart of total Bookings for two hotel. The city hotel has much more customers than the resort hotel. People made more reservations for staying in April, May, June, and September and October. On the other hand, less reservations were made for staying in November, December, and January.
hotels %>% ggplot(aes(x=arrival_date_month, fill=hotel)) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("azure4", "azure3"),labels=c("City Hotel", "Resort Hotel")) +
scale_y_continuous(name = "Bookings",labels = scales::comma) +
guides(fill=guide_legend(title=NULL)) +
facet_grid(arrival_date_year ~ .) +
theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
scale_x_discrete(labels=c("Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec" )) +
labs(x="", y="Bookings") +
ggtitle("Figure 1. Bookings Data", subtitle="7/1/2015 - 8/31/2017")
The barchart in Figure 2. displays 37% of total bookings was canceled, and majority of cancellation was from the city hotel.
hotels %>% ggplot( aes(x=hotel,
fill=is_canceled)) +
geom_bar(position="dodge") +
geom_text(stat = "Count", aes(label=scales::percent(..count../sum(..count..))),position=position_dodge(0.9), vjust=1.5) +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
guides(fill=guide_legend(title=NULL)) +
ggtitle("Figure 2. Two Hotels Booking with/without Cancellation")
Figure 3. displays that as the city hotels’ bookings without cancellation increase, canceled bookings also increase. However, many bookings without cancellation were in the months of July and August, but much less cancellation were made in those two months.
Both Bookings and Cancellations in Resort Hotel (bottom row) less fluctuates and stays fairly constant except Winter months. In November, December, and January, less cancellation were made comparing to the other months.
hotels %>% ggplot(aes(x=arrival_date_month, fill=is_canceled)) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
scale_y_continuous(name = "Bookings",labels = scales::comma) +
guides(fill=guide_legend(title=NULL)) +
facet_wrap(~hotel + arrival_date_year ) +
theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) +
scale_x_discrete(labels=c("Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec" )) +
labs(x="", y="Bookings") +
ggtitle("Figure 3. Bookings and cancellation: City vs Resort Hotel", subtitle="7/1/2015 - 8/31/2017")
Average booking, not total booking, was used for plotting. Since the duration of the data collection doesn’t allow equal numbers per month, average booking records per month was calculated. The following is a line chart of monthly average booking for two hotels.
monthly_booking_all<- hotels %>% group_by(arrival_date_year, arrival_date_month, hotel) %>%
count()
agg_monthly_booking_all2 <- monthly_booking_all %>%
group_by(arrival_date_month, hotel) %>%
summarize(avg_count=mean(n))
agg_monthly_booking_all2 %>% ggplot(aes(x=as.numeric(arrival_date_month), y=avg_count, color=hotel, linetype=hotel))+
geom_line(size=1) +
scale_color_manual(values=c("azure4", "azure3")) +
theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
scale_x_continuous(breaks=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),labels=c("Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec" )) +
labs(x="", y="Bookings", title = "Figure 4. Monthly Average Booking")
With the same reason above, average canceled and not canceled bookings per month was calculated, and a line chart was plotted for booking with and without cancellation.
monthly_booking_all3<- hotels %>% group_by(arrival_date_year, arrival_date_month, is_canceled) %>%
count()
agg_monthly_booking_all4 <- monthly_booking_all3 %>%
group_by(arrival_date_month, is_canceled) %>%
summarize(avg_count=mean(n))
agg_monthly_booking_all4 %>% ggplot(aes(x=as.numeric(arrival_date_month), y=avg_count, color=is_canceled, linetype=is_canceled))+
geom_line(size=1) +
scale_color_manual(values=c("#80CDC1", "#C2A5CF")) +
theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
scale_x_continuous(breaks=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),labels=c("Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec" )) +
labs(x="", y="Bookings", title = "Figure 5. Average Monthly Booking with/without cancellation")
The following table displays the average lead time
per month. July and September have two longest ‘lead time’, and it can be interpreted as people make a hotel reservation in advance for staying in July or September.
#Average Lead_time per month
monthly_lead_overall <- hotels %>% group_by(arrival_date_month) %>%
summarize(avg_lead_time=mean(lead_time))
pander(monthly_lead_overall, caption="Monthly Lead Time (days)")
arrival_date_month | avg_lead_time |
---|---|
January | 45.31 |
February | 47.86 |
March | 70.12 |
April | 94.82 |
May | 117.6 |
June | 128.2 |
July | 136.3 |
August | 121.1 |
September | 136.7 |
October | 123 |
November | 76.58 |
December | 73.56 |
Figure 6. shows average lead time per month.
hotels %>% group_by(arrival_date_month, is_canceled) %>%
summarize(avg_lead_time=mean(lead_time)) %>%
ggplot(aes(x=arrival_date_month, y=avg_lead_time, fill=is_canceled)) +
geom_col(position="dodge") +
geom_text(aes(label=round(avg_lead_time)), position=position_dodge(0.9), vjust=1.5, size=3) +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
guides(fill=guide_legend(title=NULL)) +
theme(legend.position="bottom", axis.text.x=element_text(angle=45, hjust=1, vjust=0.5)) +
labs(x="", y="Lead Time (days)", title = "Figure 6. Average Lead time per month")
## `summarise()` regrouping output by 'arrival_date_month' (override with `.groups` argument)
Figure 7. shows lead time for a canceled booking was much longer than that of a booking without cancellation.
hotels %>% ggplot(aes( x = hotel, y = lead_time, fill = is_canceled)) +
geom_boxplot(position = position_dodge()) +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
labs( title = "Figure 7. Lead time and Cancellation", x = "", y = "Lead Time (days)")
There are four categories of customer_type
. As shown in Data Summary earlier, majorities are Transient
and Transient-Party
. Almost all types of customer, city hotel bookings were more frequently canceled than resort hotel bookings, except Group
customer. For Group
customer, there is no significant difference between two hotels.
# hotel and cancellation - facet: customer_type
hotels %>% ggplot( aes(x=hotel, fill=is_canceled)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
guides(fill=guide_legend(title=NULL)) +
facet_grid(. ~ customer_type) +
theme(legend.position="bottom", axis.text.x=element_text(angle=45, hjust=1, vjust=0.5)) +
labs(x="", y="Proportion", title = "Figure 8. Cancelation and Customer Type ")
There are three deposit_type
: No Deposit
, Non-Refundable
, and Refundable
. As shown in Data Summary, majority type was No Deposit
, followed by Non-refundable
. A Non-refundable
booking is likely to be canceled regardless of hotel types. For bookings with No deposit
, the proportion of cancellation in city hotel is slightly higher, but pretty close to that in resort hotel. For Refundable
bookings, significantly more cancellations were made in city hotel.
# Deposit type
hotels%>% ggplot( aes(x=hotel, fill=is_canceled)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
guides(fill=guide_legend(title=NULL)) +
facet_grid(. ~ deposit_type) +
theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) +
labs(x="", y="Proportion", title = "Figure 9. Cancelation and Deposit Type ")
Out of 8 market_segment
, Within Groups market, City hotel bookings are likely to be canceled than Resort Hotel. Similary, within Offline TA/To, City hotel bookings are likely to be canceled than Resort Hotel. However, for Online TA, Corporate and Direct Market, there are no significant difference between Two hotels.
# hotel and cancellation - facet: market_segment
hotels %>% ggplot(aes(x=hotel, fill=is_canceled)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
guides(fill=guide_legend(title=NULL)) +
facet_grid(. ~ market_segment) +
theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
labs(x="", y="Proportion", title = "Figure 10. Cancelation and Market Segment ")
## $x
## [1] ""
##
## $y
## [1] "Proportion"
##
## $title
## [1] "Figure 10. Cancelation and Market Segment "
##
## attr(,"class")
## [1] "labels"
hotels %>% filter(market_segment %in% c( "complementary", "Corporate", "Direct", "Groups", "Offline TA/TO", "Online TA")) %>%
ggplot(aes(x=hotel, fill=is_canceled)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
guides(fill=guide_legend(title=NULL)) +
facet_grid(. ~ market_segment) +
theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))+
labs(x="", y="Proportion", title = "Figure 10. Cancelation and Market Segment ")
The data was Randomly splitted to training (80%) and testing (20%) datasets.
index <- createDataPartition(hotels$is_canceled, p=0.80, list=FALSE)
# select 20% of the data for testing
testset <- hotels[-index,]
# select 80% of data to train the models
trainset <- hotels[index,]
Let’s fit the full model with some explanatory variables, after excluding some less interesting variables such as date, month, year, country etc.
fullmodel <- glm(is_canceled ~ hotel + lead_time + adr + stays_nights + total_guests +
meal_new + total_of_special_requests + distribution_channel +
is_repeated_guest + previous_cancellations + booking_changes +
deposit_type + days_in_waiting_list + required_car_parking_spaces +
customer_type, family = "binomial" , data=trainset)
Let’s find the best model using backward elimination technique.
# Backward Elimination
model_step_b <- step(fullmodel,direction='backward')
## Start: AIC=85891.19
## is_canceled ~ hotel + lead_time + adr + stays_nights + total_guests +
## meal_new + total_of_special_requests + distribution_channel +
## is_repeated_guest + previous_cancellations + booking_changes +
## deposit_type + days_in_waiting_list + required_car_parking_spaces +
## customer_type
##
## Df Deviance AIC
## <none> 85843 85891
## - hotel 1 85853 85899
## - days_in_waiting_list 1 85863 85909
## - total_guests 1 85935 85981
## - stays_nights 1 85949 85995
## - is_repeated_guest 1 86158 86204
## - meal_new 3 86180 86222
## - booking_changes 1 86423 86469
## - adr 1 86798 86844
## - distribution_channel 4 86810 86850
## - customer_type 3 87388 87430
## - lead_time 1 87394 87440
## - previous_cancellations 1 87694 87740
## - total_of_special_requests 1 88648 88694
## - required_car_parking_spaces 1 88911 88957
## - deposit_type 2 95056 95100
#Remove hotel and days_in_waiting_list, Lower AIC or BOC value indicates a better fit.
Finalmodel <- glm(is_canceled ~ lead_time + adr + stays_nights + total_guests +
meal_new + total_of_special_requests + distribution_channel +
is_repeated_guest + previous_cancellations + booking_changes +
deposit_type + required_car_parking_spaces +
customer_type, family = "binomial" , data=trainset)
summary(Finalmodel)
##
## Call:
## glm(formula = is_canceled ~ lead_time + adr + stays_nights +
## total_guests + meal_new + total_of_special_requests + distribution_channel +
## is_repeated_guest + previous_cancellations + booking_changes +
## deposit_type + required_car_parking_spaces + customer_type,
## family = "binomial", data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.3868 -0.7851 -0.4586 0.2277 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.652e+00 1.401e-01 -18.928 < 2e-16 ***
## lead_time 3.815e-03 9.702e-05 39.319 < 2e-16 ***
## adr 6.336e-03 1.995e-04 31.761 < 2e-16 ***
## stays_nights 3.874e-02 3.322e-03 11.663 < 2e-16 ***
## total_guests 1.191e-01 1.355e-02 8.790 < 2e-16 ***
## meal_newHB -8.712e-01 1.201e-01 -7.252 4.11e-13 ***
## meal_newBB -4.915e-01 1.181e-01 -4.161 3.17e-05 ***
## meal_newSC -2.367e-01 1.204e-01 -1.966 0.0493 *
## total_of_special_requests -5.962e-01 1.192e-02 -50.001 < 2e-16 ***
## distribution_channelDirect -3.599e-01 5.340e-02 -6.739 1.60e-11 ***
## distribution_channelGDS -5.194e-01 2.172e-01 -2.391 0.0168 *
## distribution_channelTA/TO 4.922e-01 4.687e-02 10.500 < 2e-16 ***
## distribution_channelUndefined 1.284e+01 5.124e+01 0.251 0.8021
## is_repeated_guest1 -1.304e+00 8.129e-02 -16.043 < 2e-16 ***
## previous_cancellations 1.786e+00 4.942e-02 36.146 < 2e-16 ***
## booking_changes -3.703e-01 1.680e-02 -22.043 < 2e-16 ***
## deposit_typeNon Refund 5.028e+00 1.200e-01 41.905 < 2e-16 ***
## deposit_typeRefundable 3.295e-01 2.343e-01 1.406 0.1597
## required_car_parking_spaces -3.208e+00 1.068e-01 -30.035 < 2e-16 ***
## customer_typeGroup -1.081e-01 1.806e-01 -0.599 0.5495
## customer_typeTransient 1.133e+00 5.424e-02 20.896 < 2e-16 ***
## customer_typeTransient-Party 3.276e-01 5.686e-02 5.762 8.30e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125920 on 95512 degrees of freedom
## Residual deviance: 85874 on 95491 degrees of freedom
## AIC: 85918
##
## Number of Fisher Scoring iterations: 8
The following displays Model fitting Criteria. * Model deviance
Finalmodel$deviance
## [1] 85873.65
AIC(Finalmodel)
## [1] 85917.65
BIC(Finalmodel)
## [1] 86125.93
In-sample prediction and Confusion Matrix with cut-off probability of 0.5 were calculated.
#Setting cut-off probability=0.5
table(predict(Finalmodel,type="response") > 0.5)
##
## FALSE TRUE
## 73438 22075
#confusion matrix
pred_prob=predict(Finalmodel, data=trainset, type="response")
pred_value=1*(pred_prob>0.5)
actual_value <-trainset$is_canceled
confusion_matrix <- table(actual_value, pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## Not canceled 56897 3236
## Canceled 16541 18839
#misclasscification or error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate #0.21
## [1] 0.2070608
A ROC curve is a graph showing the performance of a classification model at all classsification thresholds. ROC curve for training data set was created.
#In-sample prediction
pred.glm0.train<- predict(Finalmodel, type="response")
##ROC Curve
pred <- prediction(pred.glm0.train, as.numeric(trainset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
And Area Under the Curve(AUC) for training dataset was computed.
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8356457
Similarly, out-of-sample prediction was evaluated and ROC curve for testing data set was created.
The Receiver Operating Characteristic (ROC) curve was created.
#out-of-sample prediction
pred.glm0.test<- predict(Finalmodel, newdata = testset, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(testset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
AUC was calculated.
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8352637
Here are some data-driven insights from this project.
Group
customer.Non-refundable
booking is likely to be canceled regardless of hotel types.No deposit
, the proportion of cancellation is similar between two hotels. - For Refundable
bookings, significantly more cancellations were made in city hotel.p <- hotels %>% group_by(arrival_date) %>%
count(is_canceled) %>%
ggplot(aes(x=arrival_date, y=n, color=is_canceled)) +
geom_line( ) +
geom_hline(aes(yintercept=mean(n, na.rm=TRUE)), linetype="dashed")+
scale_x_date(date_breaks= "2 month", date_labels = "%Y %b") +
theme(axis.text.x=element_text(angle=45, size=10, hjust=1)) +
ggtitle("") +
labs(x="", y="Booking", color="") +
scale_color_manual(values=c("#80CDC1", "#C2A5CF"), labels=c("Not canceled", "canceled")) +
ggtitle("Interactive Daily Booking with/wo Cancellation")
ggplotly(p)