Hotel Booking Data Analysis

Introduction

Hotel reservation cancellation is unavoidable but one of the major challenges in Hospitality industry because it directly affects a hotel’s profit margins. If a predictive model of cancellation is available in revenue management, it allows hotel managers to develop better cancellation policy, to predict demand accurately, to take actions proactively managing man-hours and resources accordingly, and to make those canceled rooms available for walk-ins or last minute bookings.

This data was obtained from Hotel Profit Management System (PMS), and it contains booking information due to arrive between 7/1/2015 and 8/31/2017 from one resort hotel and one city hotel located in Portugal.

The objectives of this project are

  • to explore the data and identify attributes having some relationship with canceling a hotel booking,
  • to produce compelling visualizations to gain insights from real hospitality industry data, and
  • to develop a predictive model of bookings with and without cancellation using logistic regression.

Statistical analyses and visualization were performed using R 3.6.3.

Packages Required

Below is the list of packages used in the project.

library(magrittr)   
library(tidyverse)  
library(car)
library(knitr)
library(caret)
library(ROCR)
library(scales)
library(pander)
library(plotly)
  • Package magrittr is for running tidyverse functions with the pipe operator, %>%.
  • Package tidyverse is used for dplyr for data manipulation and ggplot2 package for creating plots all at once.
  • Package car is used for identifying outliers in a boxplot.
  • Package knitr is for displaying tables.
  • Package caret is for partitioning data.
  • Package ROCR is for creating ROC curve.
  • Package Scales is to demonstrate ggplot2 style scales for specific types of data.
  • Package pander is to produce simple tables from summary() output.
  • Package plotly is for creating interactive charts.

Data Preparation

The data was obtained at tidytuesday and detailed information can be found in the original publication (Nuno Antonio, Ana de Almeida, Luis Nunes, 2019). For finding real business data is uncommon, the authors intended to provide the hotel demand data to researchers and educators in revenue management, machine learning or data mining.

Import Data

The hotels.csv file was imported.

# Importing .csv files 
hotels <- read.csv("data/hotels.csv") 

Below is a brief snapshot of the data set.

Table 1: Hotel Booking data
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
Resort Hotel 0 342 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 737 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 7 2015 July 27 1 0 1 1 0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit NULL NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 13 2015 July 27 1 0 1 1 0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304 NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 0 2015 July 27 1 0 2 2 0 0 BB PRT Direct Direct 0 0 0 C C 0 No Deposit NULL NULL 0 Transient 107.0 0 0 Check-Out 2015-07-03
Resort Hotel 0 9 2015 July 27 1 0 2 2 0 0 FB PRT Direct Direct 0 0 0 C C 0 No Deposit 303 NULL 0 Transient 103.0 0 1 Check-Out 2015-07-03
Resort Hotel 1 85 2015 July 27 1 0 3 2 0 0 BB PRT Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 82.0 0 1 Canceled 2015-05-06
Resort Hotel 1 75 2015 July 27 1 0 3 2 0 0 HB PRT Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 15 NULL 0 Transient 105.5 0 0 Canceled 2015-04-22

Data Dictionary

The following is the table of variable description.

Table 2: Data Dictionary
Variable Type Description
hotel factor Hotel (City Hotel or Resort Hotel)
is_canceled integer Value indicating if the booking was canceled (1) or not (0)
lead_time integer Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year integer Year of arrival date
arrival_date_month factor Month of arrival date
arrival_date_week_number integer Week number of year for arrival date
arrival_date_day_of_month integer Day of arrival date
stays_in_weekend_nights integer Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights integer Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults integer Number of adults
children integer Number of children
babies integer Number of babies
meal factor Type of meal booked. Undefined/SC - no meal package; BB - Bed & Breakfast; HB - Half board (breakfast and one other meal – usually dinner); FB - Full board (breakfast, lunch and dinner)
country factor Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment factor Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel factor Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest integer Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations integer Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled integer Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type factor Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type factor Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes integer Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type factor Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay
agent factor ID of the travel agency that made the booking
company factor ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list integer Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type factor Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
adr numeric Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces integer Number of car parking spaces required by the customer
total_of_special_requests integer Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status factor Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date factor Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Structure of Data

The hotel booking data is a data frame, and includes 119390 rows (observations) and 32 columns (variables).

# View the structure of hotels
str(hotels)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ country                       : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
##  $ market_segment                : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
##  $ assigned_room_type            : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ agent                         : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
##  $ company                       : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ reservation_status_date       : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...

Some integer variables have been switched to factors for the convenience in analysis.

hotels$arrival_date_year   <- as.factor(hotels$arrival_date_year )
hotels$is_repeated_guest   <- as.factor(hotels$is_repeated_guest)

#label the levels of is_canceled variable 
hotels$is_canceled  <- factor(hotels$is_canceled, levels=c(0, 1), 
                              labels=c("Not canceled", "Canceled"))

#The order for month was specified explicitly
hotels$arrival_date_month<- factor(hotels$arrival_date_month, levels = 
                                     c("January", "February", "March", "April", "May", "June", "July",
                                       "August", "September", "October", "November", "December"))

Data Manipulation

The new variable “arrival_date” was created using existing month, day, and year variables.

#create new variable: arrival_date
hotels$arrival_date <- paste(hotels$arrival_date_month, 
                            hotels$arrival_date_day_of_month,
                            hotels$arrival_date_year,sep="-")
hotels$arrival_date <-as.Date(hotels$arrival_date, format="%B-%d-%Y")

New variables of “stays_nights” and “total_guests” were also created using existing variables.

#create new variables: stays_nights and total_guests
hotels <- hotels %>% 
          dplyr::mutate(stays_nights = stays_in_weekend_nights + stays_in_week_nights,
                  total_guests = adults + children + babies) %>% 
          dplyr::select(-c(stays_in_weekend_nights, stays_in_week_nights, adults, children, babies,
                        arrival_date_week_number )) 

The summary output below shows, there are 10650 records of “SC” (no meal package) and 1169 records of “Undefined” in meal.

summary(hotels)
##           hotel             is_canceled      lead_time   arrival_date_year
##  City Hotel  :79330   Not canceled:75166   Min.   :  0   2015:21996       
##  Resort Hotel:40060   Canceled    :44224   1st Qu.: 18   2016:56707       
##                                            Median : 69   2017:40687       
##                                            Mean   :104                    
##                                            3rd Qu.:160                    
##                                            Max.   :737                    
##                                                                           
##  arrival_date_month arrival_date_day_of_month        meal          country     
##  August :13877      Min.   : 1.0              BB       :92310   PRT    :48590  
##  July   :12661      1st Qu.: 8.0              FB       :  798   GBR    :12129  
##  May    :11791      Median :16.0              HB       :14463   FRA    :10415  
##  October:11160      Mean   :15.8              SC       :10650   ESP    : 8568  
##  April  :11089      3rd Qu.:23.0              Undefined: 1169   DEU    : 7287  
##  June   :10939      Max.   :31.0                                ITA    : 3766  
##  (Other):47873                                                  (Other):28635  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56477   Corporate: 6677      0:115580         
##  Offline TA/TO:24219   Direct   :14645      1:  3810         
##  Groups       :19811   GDS      :  193                       
##  Direct       :12606   TA/TO    :97870                       
##  Corporate    : 5295   Undefined:    5                       
##  Complementary:  743                                         
##  (Other)      :  239                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                A      :85994     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                D      :19201     
##  Median : 0.00000       Median : 0.0000                E      : 6535     
##  Mean   : 0.08712       Mean   : 0.1371                F      : 2897     
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                G      : 2094     
##  Max.   :26.00000       Max.   :72.0000                B      : 1118     
##                                                        (Other): 1551     
##  assigned_room_type booking_changes       deposit_type        agent      
##  A      :74053      Min.   : 0.0000   No Deposit:104641   9      :31961  
##  D      :25322      1st Qu.: 0.0000   Non Refund: 14587   NULL   :16340  
##  E      : 7806      Median : 0.0000   Refundable:   162   240    :13922  
##  F      : 3751      Mean   : 0.2211                       1      : 7191  
##  G      : 2553      3rd Qu.: 0.0000                       14     : 3640  
##  C      : 2375      Max.   :21.0000                       7      : 3539  
##  (Other): 3530                                            (Other):42797  
##     company       days_in_waiting_list         customer_type  
##  NULL   :112593   Min.   :  0.000      Contract       : 4076  
##  40     :   927   1st Qu.:  0.000      Group          :  577  
##  223    :   784   Median :  0.000      Transient      :89613  
##  67     :   267   Mean   :  2.321      Transient-Party:25124  
##  45     :   250   3rd Qu.:  0.000                             
##  153    :   215   Max.   :391.000                             
##  (Other):  4354                                               
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.58   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5714           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##                                                                         
##  reservation_status reservation_status_date  arrival_date       
##  Canceled :43017    2015-10-21:  1461       Min.   :2015-07-01  
##  Check-Out:75166    2015-07-06:   805       1st Qu.:2016-03-13  
##  No-Show  : 1207    2016-11-25:   790       Median :2016-09-06  
##                     2015-01-01:   763       Mean   :2016-08-28  
##                     2016-01-18:   625       3rd Qu.:2017-03-18  
##                     2015-07-02:   469       Max.   :2017-08-31  
##                     (Other)   :114477                           
##   stays_nights     total_guests   
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 2.000  
##  Mean   : 3.428   Mean   : 1.968  
##  3rd Qu.: 4.000   3rd Qu.: 2.000  
##  Max.   :69.000   Max.   :55.000  
##                   NA's   :4

The level of “Undefined” was combined to the level of “SC” according to the data description, and the new variable with 4 levels was created as meal_new and dropped meal.

hotels<- hotels %>% 
         dplyr::mutate(meal_new = fct_collapse(meal, SC = c("Undefined" , "SC"),
                                       BB = "BB",
                                       FB = "FB",
                                       HB = "HB"),
         meal_new = fct_relevel(meal_new, "FB", "HB", "BB", "SC"),
         meal_new = fct_explicit_na(meal_new)) %>% 
        dplyr::select(-meal) 

Outliers

A boxplot was created to see the distribution of adr (Average Daily Rate) and to investigate any existing outliers.

par(mfrow=c(1,1))
boxplot(hotels$adr, ylab ="adr", main="Distribution of Average Daily Rate (adr) \n and Presence of an Outlier") 

The outlier was identified on the left side figure, and the adr value from the row number of 48,516 was 5,400 which was extremely far from the rest. The record was replaced by the median of adr, and new boxplot without the outlier has been displayed on the right side, and it looks more reasonable.

par(mfrow=c(1,2))
Boxplot(~adr, data=hotels, id=list(n=Inf), ylab="adr",
         main="Identifying the Outlier" ) 
which(colnames(hotels) == "adr")
hotels[48516, 22] <- median(hotels$adr)
boxplot(hotels$adr, ylab="adr",
         main="Boxplot of adr \n after Removing the Outlier")

Missing Values

There are almost no missing values except “total_guests”.

colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month      arrival_date_day_of_month 
##                              0                              0 
##                        country                 market_segment 
##                              0                              0 
##           distribution_channel              is_repeated_guest 
##                              0                              0 
##         previous_cancellations previous_bookings_not_canceled 
##                              0                              0 
##             reserved_room_type             assigned_room_type 
##                              0                              0 
##                booking_changes                   deposit_type 
##                              0                              0 
##                          agent                        company 
##                              0                              0 
##           days_in_waiting_list                  customer_type 
##                              0                              0 
##                            adr    required_car_parking_spaces 
##                              0                              0 
##      total_of_special_requests             reservation_status 
##                              0                              0 
##        reservation_status_date                   arrival_date 
##                              0                              0 
##                   stays_nights                   total_guests 
##                              0                              4 
##                       meal_new 
##                              0

There are 4 missing values in “total_guests” and they were replaced by zero.

hotels$total_guests[is.na(hotels$total_guests)]  <- 0
colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month      arrival_date_day_of_month 
##                              0                              0 
##                        country                 market_segment 
##                              0                              0 
##           distribution_channel              is_repeated_guest 
##                              0                              0 
##         previous_cancellations previous_bookings_not_canceled 
##                              0                              0 
##             reserved_room_type             assigned_room_type 
##                              0                              0 
##                booking_changes                   deposit_type 
##                              0                              0 
##                          agent                        company 
##                              0                              0 
##           days_in_waiting_list                  customer_type 
##                              0                              0 
##                            adr    required_car_parking_spaces 
##                              0                              0 
##      total_of_special_requests             reservation_status 
##                              0                              0 
##        reservation_status_date                   arrival_date 
##                              0                              0 
##                   stays_nights                   total_guests 
##                              0                              0 
##                       meal_new 
##                              0

Data Summary

A summary for all variables in the final data is displayed in the table below.

Table continues below
hotel is_canceled lead_time arrival_date_year
City Hotel :79330 Not canceled:75166 Min. : 0 2015:21996
Resort Hotel:40060 Canceled :44224 1st Qu.: 18 2016:56707
Median : 69 2017:40687
Mean :104
3rd Qu.:160
Max. :737
Table continues below
arrival_date_month arrival_date_day_of_month country
August :13877 Min. : 1.0 PRT :48590
July :12661 1st Qu.: 8.0 GBR :12129
May :11791 Median :16.0 FRA :10415
October:11160 Mean :15.8 ESP : 8568
April :11089 3rd Qu.:23.0 DEU : 7287
June :10939 Max. :31.0 ITA : 3766
(Other):47873 (Other):28635
Table continues below
market_segment distribution_channel is_repeated_guest
Online TA :56477 Corporate: 6677 0:115580
Offline TA/TO:24219 Direct :14645 1: 3810
Groups :19811 GDS : 193
Direct :12606 TA/TO :97870
Corporate : 5295 Undefined: 5
Complementary: 743
(Other) : 239
Table continues below
previous_cancellations previous_bookings_not_canceled reserved_room_type
Min. : 0.00000 Min. : 0.0000 A :85994
1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
Median : 0.00000 Median : 0.0000 E : 6535
Mean : 0.08712 Mean : 0.1371 F : 2897
3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
Max. :26.00000 Max. :72.0000 B : 1118
(Other): 1551
Table continues below
assigned_room_type booking_changes deposit_type agent
A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
E : 7806 Median : 0.0000 Refundable: 162 240 :13922
F : 3751 Mean : 0.2211 1 : 7191
G : 2553 3rd Qu.: 0.0000 14 : 3640
C : 2375 Max. :21.0000 7 : 3539
(Other): 3530 (Other):42797
Table continues below
company days_in_waiting_list customer_type adr
NULL :112593 Min. : 0.000 Contract : 4076 Min. : -6.38
40 : 927 1st Qu.: 0.000 Group : 577 1st Qu.: 69.29
223 : 784 Median : 0.000 Transient :89613 Median : 94.58
67 : 267 Mean : 2.321 Transient-Party:25124 Mean : 101.83
45 : 250 3rd Qu.: 0.000 3rd Qu.: 126.00
153 : 215 Max. :391.000 Max. :5400.00
(Other): 4354
Table continues below
required_car_parking_spaces total_of_special_requests reservation_status
Min. : 0.00000 Min. :0.0000 Canceled :43017
1st Qu.: 0.00000 1st Qu.:0.0000 Check-Out:75166
Median : 0.00000 Median :0.0000 No-Show : 1207
Mean : 0.06331 Mean :0.5714
3rd Qu.: 0.00000 3rd Qu.:1.0000
Max. :94.57500 Max. :5.0000
Table continues below
reservation_status_date arrival_date stays_nights total_guests
2015-10-21: 1461 Min. :2015-07-01 Min. : 0.000 Min. : 0.000
2015-07-06: 805 1st Qu.:2016-03-13 1st Qu.: 2.000 1st Qu.: 2.000
2016-11-25: 790 Median :2016-09-06 Median : 3.000 Median : 2.000
2015-01-01: 763 Mean :2016-08-28 Mean : 3.428 Mean : 1.968
2016-01-18: 625 3rd Qu.:2017-03-18 3rd Qu.: 4.000 3rd Qu.: 2.000
2015-07-02: 469 Max. :2017-08-31 Max. :69.000 Max. :55.000
(Other) :114477
meal_new
FB: 798
HB:14463
BB:92310
SC:11819

Exploratory Data Analysis

Cancellation by Month

The hotel booking data contains booking records from July 2015 to August 2017. Thus, there exist three years of records for July and August, but only two years of records for other months. Figure 1. is the barchart of total Bookings for two hotel. The city hotel has much more customers than the resort hotel. People made more reservations for staying in April, May, June, and September and October. On the other hand, less reservations were made for staying in November, December, and January.

hotels %>% ggplot(aes(x=arrival_date_month, fill=hotel)) + 
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("azure4", "azure3"),labels=c("City Hotel", "Resort Hotel")) +
  scale_y_continuous(name = "Bookings",labels = scales::comma) +
  guides(fill=guide_legend(title=NULL))  +
  facet_grid(arrival_date_year ~ .) + 
  theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
  scale_x_discrete(labels=c("Jan", 
                            "Feb",
                            "Mar",
                            "Apr",
                            "May",
                            "Jun",
                            "Jul",
                            "Aug",
                            "Sep",
                            "Oct",
                            "Nov",
                            "Dec" )) +
  labs(x="", y="Bookings") +
  ggtitle("Figure 1. Bookings Data", subtitle="7/1/2015 - 8/31/2017") 

The barchart in Figure 2. displays 37% of total bookings was canceled, and majority of cancellation was from the city hotel.

hotels %>% ggplot( aes(x=hotel, 
                       fill=is_canceled)) + 
  geom_bar(position="dodge") +
  geom_text(stat = "Count", aes(label=scales::percent(..count../sum(..count..))),position=position_dodge(0.9), vjust=1.5) +
  scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
  guides(fill=guide_legend(title=NULL))  +
  ggtitle("Figure 2. Two Hotels Booking with/without Cancellation") 

Figure 3. displays that as the city hotels’ bookings without cancellation increase, canceled bookings also increase. However, many bookings without cancellation were in the months of July and August, but much less cancellation were made in those two months.
Both Bookings and Cancellations in Resort Hotel (bottom row) less fluctuates and stays fairly constant except Winter months. In November, December, and January, less cancellation were made comparing to the other months.

hotels %>% ggplot(aes(x=arrival_date_month, fill=is_canceled)) + 
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
  scale_y_continuous(name = "Bookings",labels = scales::comma) +
  guides(fill=guide_legend(title=NULL))  +
  facet_wrap(~hotel + arrival_date_year ) + 
  theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) +
  scale_x_discrete(labels=c("Jan", 
                            "Feb",
                            "Mar",
                            "Apr",
                            "May",
                            "Jun",
                            "Jul",
                            "Aug",
                            "Sep",
                            "Oct",
                            "Nov",
                            "Dec" )) +
  labs(x="", y="Bookings") +
  ggtitle("Figure 3. Bookings and cancellation: City vs Resort Hotel", subtitle="7/1/2015 - 8/31/2017") 

Average booking, not total booking, was used for plotting. Since the duration of the data collection doesn’t allow equal numbers per month, average booking records per month was calculated. The following is a line chart of monthly average booking for two hotels.

monthly_booking_all<- hotels %>%  group_by(arrival_date_year, arrival_date_month, hotel) %>% 
  count()

agg_monthly_booking_all2 <- monthly_booking_all %>% 
  group_by(arrival_date_month, hotel) %>% 
  summarize(avg_count=mean(n))
agg_monthly_booking_all2 %>% ggplot(aes(x=as.numeric(arrival_date_month), y=avg_count, color=hotel, linetype=hotel))+
  geom_line(size=1) + 
  scale_color_manual(values=c("azure4", "azure3")) +
  theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
  scale_x_continuous(breaks=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),labels=c("Jan", 
                                                                              "Feb",
                                                                              "Mar",
                                                                              "Apr",
                                                                              "May",
                                                                              "Jun",
                                                                              "Jul",
                                                                              "Aug",
                                                                              "Sep",
                                                                              "Oct",
                                                                              "Nov",
                                                                              "Dec" )) +
  labs(x="", y="Bookings", title = "Figure 4. Monthly Average Booking") 

With the same reason above, average canceled and not canceled bookings per month was calculated, and a line chart was plotted for booking with and without cancellation.

monthly_booking_all3<- hotels %>%  group_by(arrival_date_year, arrival_date_month, is_canceled) %>% 
  count()

agg_monthly_booking_all4 <- monthly_booking_all3 %>% 
  group_by(arrival_date_month, is_canceled) %>% 
  summarize(avg_count=mean(n))
agg_monthly_booking_all4 %>% ggplot(aes(x=as.numeric(arrival_date_month), y=avg_count, color=is_canceled, linetype=is_canceled))+
  geom_line(size=1) + 
  scale_color_manual(values=c("#80CDC1", "#C2A5CF")) +
  theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
  scale_x_continuous(breaks=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),labels=c("Jan", 
                                                                              "Feb",
                                                                              "Mar",
                                                                              "Apr",
                                                                              "May",
                                                                              "Jun",
                                                                              "Jul",
                                                                              "Aug",
                                                                              "Sep",
                                                                              "Oct",
                                                                              "Nov",
                                                                              "Dec" )) +
  labs(x="", y="Bookings", title = "Figure 5. Average Monthly Booking with/without cancellation") 

Lead Time

The following table displays the average lead time per month. July and September have two longest ‘lead time’, and it can be interpreted as people make a hotel reservation in advance for staying in July or September.

#Average Lead_time per month
monthly_lead_overall <- hotels %>%  group_by(arrival_date_month) %>% 
  summarize(avg_lead_time=mean(lead_time))
pander(monthly_lead_overall, caption="Monthly Lead Time (days)")
Monthly Lead Time (days)
arrival_date_month avg_lead_time
January 45.31
February 47.86
March 70.12
April 94.82
May 117.6
June 128.2
July 136.3
August 121.1
September 136.7
October 123
November 76.58
December 73.56

Figure 6. shows average lead time per month.

hotels %>%  group_by(arrival_date_month, is_canceled) %>% 
  summarize(avg_lead_time=mean(lead_time)) %>% 
  ggplot(aes(x=arrival_date_month, y=avg_lead_time, fill=is_canceled)) +
  geom_col(position="dodge") + 
  geom_text(aes(label=round(avg_lead_time)), position=position_dodge(0.9), vjust=1.5, size=3) +
  scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
  guides(fill=guide_legend(title=NULL)) +
  theme(legend.position="bottom", axis.text.x=element_text(angle=45, hjust=1, vjust=0.5)) +
  labs(x="", y="Lead Time (days)", title = "Figure 6. Average Lead time per month")
## `summarise()` regrouping output by 'arrival_date_month' (override with `.groups` argument)

Figure 7. shows lead time for a canceled booking was much longer than that of a booking without cancellation.

hotels %>%  ggplot(aes( x = hotel, y = lead_time, fill = is_canceled)) +
geom_boxplot(position = position_dodge()) +
scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +  
labs( title = "Figure 7. Lead time and Cancellation", x = "", y = "Lead Time (days)") 

Types of Customer and Deposit

There are four categories of customer_type. As shown in Data Summary earlier, majorities are Transient and Transient-Party. Almost all types of customer, city hotel bookings were more frequently canceled than resort hotel bookings, except Group customer. For Group customer, there is no significant difference between two hotels.

# hotel and cancellation - facet: customer_type
hotels %>% ggplot( aes(x=hotel, fill=is_canceled)) + 
           geom_bar(position="fill") +
           scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
           guides(fill=guide_legend(title=NULL)) +
           facet_grid(. ~ customer_type) + 
           theme(legend.position="bottom", axis.text.x=element_text(angle=45, hjust=1, vjust=0.5)) +
           labs(x="", y="Proportion", title = "Figure 8. Cancelation and Customer Type ")

There are three deposit_type: No Deposit, Non-Refundable, and Refundable. As shown in Data Summary, majority type was No Deposit, followed by Non-refundable. A Non-refundable booking is likely to be canceled regardless of hotel types. For bookings with No deposit, the proportion of cancellation in city hotel is slightly higher, but pretty close to that in resort hotel. For Refundable bookings, significantly more cancellations were made in city hotel.

# Deposit type
hotels%>% ggplot( aes(x=hotel, fill=is_canceled)) + 
  geom_bar(position="fill") +
  scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
  guides(fill=guide_legend(title=NULL)) +
  facet_grid(. ~ deposit_type) + 
  theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) +
  labs(x="", y="Proportion", title = "Figure 9. Cancelation and Deposit Type ")

Out of 8 market_segment, Within Groups market, City hotel bookings are likely to be canceled than Resort Hotel. Similary, within Offline TA/To, City hotel bookings are likely to be canceled than Resort Hotel. However, for Online TA, Corporate and Direct Market, there are no significant difference between Two hotels.

# hotel and cancellation - facet: market_segment
hotels %>% ggplot(aes(x=hotel, fill=is_canceled)) + 
  geom_bar(position="fill") +
  scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
  guides(fill=guide_legend(title=NULL)) +
  facet_grid(. ~ market_segment) + 
  theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

  labs(x="", y="Proportion", title = "Figure 10. Cancelation and Market Segment ")
## $x
## [1] ""
## 
## $y
## [1] "Proportion"
## 
## $title
## [1] "Figure 10. Cancelation and Market Segment "
## 
## attr(,"class")
## [1] "labels"
hotels %>% filter(market_segment %in% c( "complementary", "Corporate", "Direct", "Groups", "Offline TA/TO", "Online TA")) %>% 
    ggplot(aes(x=hotel, fill=is_canceled)) + 
    geom_bar(position="fill") +
    scale_fill_manual(values=c("#80CDC1", "#C2A5CF"),labels=c("not canceled", "canceled")) +
    guides(fill=guide_legend(title=NULL)) +
    facet_grid(. ~ market_segment) + 
    theme(legend.position="bottom", axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))+
  labs(x="", y="Proportion", title = "Figure 10. Cancelation and Market Segment ")

Logistic Regression

Splitting the Data into Training and Testing Sets

The data was Randomly splitted to training (80%) and testing (20%) datasets.

index <- createDataPartition(hotels$is_canceled, p=0.80, list=FALSE)
# select 20% of the data for testing
testset <- hotels[-index,]
# select 80% of data to train the models
trainset <- hotels[index,]

Train a logistic regression model

Let’s fit the full model with some explanatory variables, after excluding some less interesting variables such as date, month, year, country etc.

fullmodel <- glm(is_canceled ~ hotel + lead_time  + adr +  stays_nights + total_guests + 
                  meal_new + total_of_special_requests + distribution_channel +
                 is_repeated_guest + previous_cancellations + booking_changes +
                 deposit_type + days_in_waiting_list + required_car_parking_spaces +
                 customer_type, family = "binomial" , data=trainset)

Let’s find the best model using backward elimination technique.

# Backward Elimination
model_step_b <- step(fullmodel,direction='backward')
## Start:  AIC=85891.19
## is_canceled ~ hotel + lead_time + adr + stays_nights + total_guests + 
##     meal_new + total_of_special_requests + distribution_channel + 
##     is_repeated_guest + previous_cancellations + booking_changes + 
##     deposit_type + days_in_waiting_list + required_car_parking_spaces + 
##     customer_type
## 
##                               Df Deviance   AIC
## <none>                              85843 85891
## - hotel                        1    85853 85899
## - days_in_waiting_list         1    85863 85909
## - total_guests                 1    85935 85981
## - stays_nights                 1    85949 85995
## - is_repeated_guest            1    86158 86204
## - meal_new                     3    86180 86222
## - booking_changes              1    86423 86469
## - adr                          1    86798 86844
## - distribution_channel         4    86810 86850
## - customer_type                3    87388 87430
## - lead_time                    1    87394 87440
## - previous_cancellations       1    87694 87740
## - total_of_special_requests    1    88648 88694
## - required_car_parking_spaces  1    88911 88957
## - deposit_type                 2    95056 95100
#Remove hotel and days_in_waiting_list, Lower AIC or BOC value indicates a better fit.
Finalmodel <- glm(is_canceled ~  lead_time  + adr +  stays_nights + total_guests + 
                    meal_new + total_of_special_requests + distribution_channel +
                    is_repeated_guest + previous_cancellations + booking_changes +
                    deposit_type + required_car_parking_spaces +
                    customer_type, family = "binomial" , data=trainset)
summary(Finalmodel)
## 
## Call:
## glm(formula = is_canceled ~ lead_time + adr + stays_nights + 
##     total_guests + meal_new + total_of_special_requests + distribution_channel + 
##     is_repeated_guest + previous_cancellations + booking_changes + 
##     deposit_type + required_car_parking_spaces + customer_type, 
##     family = "binomial", data = trainset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.3868  -0.7851  -0.4586   0.2277   8.4904  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.652e+00  1.401e-01 -18.928  < 2e-16 ***
## lead_time                      3.815e-03  9.702e-05  39.319  < 2e-16 ***
## adr                            6.336e-03  1.995e-04  31.761  < 2e-16 ***
## stays_nights                   3.874e-02  3.322e-03  11.663  < 2e-16 ***
## total_guests                   1.191e-01  1.355e-02   8.790  < 2e-16 ***
## meal_newHB                    -8.712e-01  1.201e-01  -7.252 4.11e-13 ***
## meal_newBB                    -4.915e-01  1.181e-01  -4.161 3.17e-05 ***
## meal_newSC                    -2.367e-01  1.204e-01  -1.966   0.0493 *  
## total_of_special_requests     -5.962e-01  1.192e-02 -50.001  < 2e-16 ***
## distribution_channelDirect    -3.599e-01  5.340e-02  -6.739 1.60e-11 ***
## distribution_channelGDS       -5.194e-01  2.172e-01  -2.391   0.0168 *  
## distribution_channelTA/TO      4.922e-01  4.687e-02  10.500  < 2e-16 ***
## distribution_channelUndefined  1.284e+01  5.124e+01   0.251   0.8021    
## is_repeated_guest1            -1.304e+00  8.129e-02 -16.043  < 2e-16 ***
## previous_cancellations         1.786e+00  4.942e-02  36.146  < 2e-16 ***
## booking_changes               -3.703e-01  1.680e-02 -22.043  < 2e-16 ***
## deposit_typeNon Refund         5.028e+00  1.200e-01  41.905  < 2e-16 ***
## deposit_typeRefundable         3.295e-01  2.343e-01   1.406   0.1597    
## required_car_parking_spaces   -3.208e+00  1.068e-01 -30.035  < 2e-16 ***
## customer_typeGroup            -1.081e-01  1.806e-01  -0.599   0.5495    
## customer_typeTransient         1.133e+00  5.424e-02  20.896  < 2e-16 ***
## customer_typeTransient-Party   3.276e-01  5.686e-02   5.762 8.30e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125920  on 95512  degrees of freedom
## Residual deviance:  85874  on 95491  degrees of freedom
## AIC: 85918
## 
## Number of Fisher Scoring iterations: 8

The following displays Model fitting Criteria. * Model deviance

Finalmodel$deviance
## [1] 85873.65
  • AIC
AIC(Finalmodel)
## [1] 85917.65
  • BIC
BIC(Finalmodel)
## [1] 86125.93

In-sample Prediction and ROC curve

In-sample prediction and Confusion Matrix with cut-off probability of 0.5 were calculated.

#Setting cut-off probability=0.5
table(predict(Finalmodel,type="response") > 0.5)
## 
## FALSE  TRUE 
## 73438 22075
#confusion matrix
pred_prob=predict(Finalmodel, data=trainset, type="response")
pred_value=1*(pred_prob>0.5)
actual_value <-trainset$is_canceled
confusion_matrix <- table(actual_value, pred_value)
confusion_matrix
##               pred_value
## actual_value       0     1
##   Not canceled 56897  3236
##   Canceled     16541 18839
#misclasscification or error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate #0.21
## [1] 0.2070608

A ROC curve is a graph showing the performance of a classification model at all classsification thresholds. ROC curve for training data set was created.

#In-sample prediction
pred.glm0.train<- predict(Finalmodel, type="response")
##ROC Curve
pred <- prediction(pred.glm0.train, as.numeric(trainset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

And Area Under the Curve(AUC) for training dataset was computed.

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8356457

Out-of-sample Prediction and ROC curve

Similarly, out-of-sample prediction was evaluated and ROC curve for testing data set was created.

The Receiver Operating Characteristic (ROC) curve was created.

#out-of-sample prediction
pred.glm0.test<- predict(Finalmodel, newdata = testset, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(testset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

AUC was calculated.

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8352637

Summary

Here are some data-driven insights from this project.

  • Overall, 37% of total bookings were canceled.
  • April, May, June, September, and October were busiest months for two hotels, and more cancellations were made during those months. On the other hand, November, December, and January are less occupied months, and less cancellations were made.
  • Lead time for a canceled booking was much longer than that of a booking without cancellation.
  • Majority of cancellations were from the city hotel.
    • Almost all types of customer, city hotel bookings were more frequently canceled than resort hotel bookings, except Group customer.
    • A Non-refundable booking is likely to be canceled regardless of hotel types.
    • For bookings with No deposit, the proportion of cancellation is similar between two hotels. - For Refundable bookings, significantly more cancellations were made in city hotel.
  • The logistic regression model correctly predicts the cancellation with 84% of accuracy.
p <- hotels %>% group_by(arrival_date) %>% 
  count(is_canceled) %>% 
  ggplot(aes(x=arrival_date, y=n, color=is_canceled)) + 
  geom_line( ) + 
  geom_hline(aes(yintercept=mean(n, na.rm=TRUE)), linetype="dashed")+
  scale_x_date(date_breaks= "2 month", date_labels = "%Y %b") + 
  theme(axis.text.x=element_text(angle=45, size=10, hjust=1)) +
  ggtitle("") +
  labs(x="", y="Booking", color="") +
  scale_color_manual(values=c("#80CDC1", "#C2A5CF"), labels=c("Not canceled", "canceled")) +
  ggtitle("Interactive Daily Booking with/wo Cancellation") 

ggplotly(p)