Hotel Booking Data Analysis

In this project, Our goal is to develop a prediction model for predicting the possibility of hotel booking cancellation using data mining algorithms. This data is extracted from Hotel Profit Management System (PMS) and it contains booking information due to arrive between 7/1/2015 and 8/31/2017 from one resort hotel and one city hotel located in Portugal.

Introduction

Hotel cancellations are always a bad news for the business yet it’s unavoidable. The incorrect estimation of guest can lead to wastage of resources for eg. if a large group of individuals were to cancel their booking, at the last moment, this may lead to wastage of raw food materials and it is often difficult to find new guests at such a short notice

If hotel’s booking system were to have an algorithm, which could predict the possibility of cancellation, it would be easier for the hotel management to develop better cancellation policy, to predict demand accurately and to take proactive actions to manage business and resources accordingly. We are trying to analyze the booking data of 3 years (between 7/1/2015 and 8/31/2017) of multiple hotels and trying to find a common trend which can be seen in the case of cancellation. In short, we are trying to determine, what are some of the common characteristics which indicate possibility of a cancellation.

We will first identify which parameters are correlated with booking cancellation. After deciding the correlated variables, we will see whether they positively influence the cancellation or negatively influence the cancellation. Finally, we plan to come up with a logistic regression model which can predict cancellation.

Packages Required

Below is the list of packages used in the project.

library(magrittr)   
library(tidyverse)  
library(car)
library(knitr)
library(caret)
library(ROCR)
library(scales)
library(pander)
library(plotly)
library(MASS)
library(psych)

The purpose of the packages are as below:

  • Package psych is used to plot multiple histograms.
  • Package tidyverse consists of 6 core packages out of which the below 3 are most important for this project:
  • ggplot2: Used for creating powerful visualizations
  • dplyr: Used for data manipulation
  • tidyr: Used for data modifications
  • Package car is used for identifying outliers in a boxplot.
  • Package knitr is for displaying tables.

Data Preparation

The data was obtained at tidytuesday and detailed information can be found in the original publication (Nuno Antonio, Ana de Almeida, Luis Nunes, 2019). Since it is uncommon to find real business data, the authors intended to provide the hotel demand data to researchers and educators in revenue management, machine learning or data mining.

Data Import

There are 119390 observation and 32 variables in hotels.csv file. We have set the seed to ensure the repeatability of the code and data split during logistic regression.

Data type of each variable with sample values:

# Importing .csv files 
set.seed(70252021)
hotels <- read.csv("data/hotels.csv")
# View the structure of hotels
str(hotels)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

Each observation represents a hotel booking. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields. The data was extracted from hotels’ Property Management System (PMS) SQL databases Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon.

The following diagram presents the PMS database tables from where the data variables were extracted:

Viewing the raw data

# View the first 10 rows of hotels
knitr::kable(head(hotels,n = 10), "pandoc")
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
Resort Hotel 0 342 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 737 2015 July 27 1 0 0 2 0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit NULL NULL 0 Transient 0.0 0 0 Check-Out 2015-07-01
Resort Hotel 0 7 2015 July 27 1 0 1 1 0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit NULL NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 13 2015 July 27 1 0 1 1 0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304 NULL 0 Transient 75.0 0 0 Check-Out 2015-07-02
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 14 2015 July 27 1 0 2 2 0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 98.0 0 1 Check-Out 2015-07-03
Resort Hotel 0 0 2015 July 27 1 0 2 2 0 0 BB PRT Direct Direct 0 0 0 C C 0 No Deposit NULL NULL 0 Transient 107.0 0 0 Check-Out 2015-07-03
Resort Hotel 0 9 2015 July 27 1 0 2 2 0 0 FB PRT Direct Direct 0 0 0 C C 0 No Deposit 303 NULL 0 Transient 103.0 0 1 Check-Out 2015-07-03
Resort Hotel 1 85 2015 July 27 1 0 3 2 0 0 BB PRT Online TA TA/TO 0 0 0 A A 0 No Deposit 240 NULL 0 Transient 82.0 0 1 Canceled 2015-05-06
Resort Hotel 1 75 2015 July 27 1 0 3 2 0 0 HB PRT Offline TA/TO TA/TO 0 0 0 D D 0 No Deposit 15 NULL 0 Transient 105.5 0 0 Canceled 2015-04-22

Change Variable Types

Following integer/Character variables have been switched to factors for convenience in analysis.

Variables Data Types
hotel character
is_canceled Integer
arrival_date_year Integer
arrival_date_month character
arrival_date_week_number Integer
arrival_date_day_of_month Integer
meal character
country character
market_segment character
distribution_channel character
is_repeated_guest Integer
previous_cancellations Integer
previous_bookings_not_canceled Integer
reserved_room_type character
assigned_room_type character
booking_changes Integer
deposit_type character
agent character
company character
customer_type character
reservation_status character
reservation_status_date character
hotels$hotel <- as.factor(hotels$hotel)
hotels$is_canceled  <- as.factor(hotels$is_canceled)
hotels$arrival_date_year   <- as.factor(hotels$arrival_date_year )
hotels$arrival_date_month   <- as.factor(hotels$arrival_date_month )
hotels$arrival_date_week_number  <- as.factor(hotels$arrival_date_week_number )
hotels$arrival_date_day_of_month   <- as.factor(hotels$arrival_date_day_of_month)
hotels$meal <- as.factor(hotels$meal)
hotels$country <- as.factor(hotels$country)
hotels$market_segment <- as.factor(hotels$market_segment)
hotels$distribution_channel <- as.factor(hotels$distribution_channel)
hotels$is_repeated_guest   <- as.factor(hotels$is_repeated_guest)
hotels$previous_cancellations  <- as.factor(hotels$previous_cancellations)
hotels$previous_bookings_not_canceled <- as.factor(hotels$previous_bookings_not_canceled)
hotels$reserved_room_type <- as.factor(hotels$reserved_room_type)
hotels$assigned_room_type <- as.factor(hotels$assigned_room_type)
hotels$booking_changes  <- as.factor(hotels$booking_changes)
hotels$deposit_type <- as.factor(hotels$deposit_type)
hotels$agent <- as.factor(hotels$agent)
hotels$company <- as.factor(hotels$company)
hotels$customer_type <- as.factor(hotels$customer_type)
hotels$reservation_status <- as.factor(hotels$reservation_status)
hotels$reservation_status_date <- as.factor(hotels$reservation_status_date)

Create new variables

Following new variables “arrival_date” and “stay_nights” were created using existing variables.

Variables Data Types
arrival_date character
stay_nights Integer
#create new variable: arrival_date
hotels$arrival_date <-paste(hotels$arrival_date_month, 
                            hotels$arrival_date_day_of_month,
                            hotels$arrival_date_year,sep="-")
hotels$arrival_date <-as.Date(hotels$arrival_date, format="%B-%d-%Y")
#create new variable: stays_nights
hotels %>% mutate(stays_nights= stays_in_weekend_nights + stays_in_week_nights) -> hotels_new

Regrouping variable and dropping values

Regarding type of meal, there are 10,650 records of SC (no meal package) and 1,169 records of Undefined. The level of “Undefined” were combined as the level of SC. Further we also dropped 16 values from booking_changes and 21 values from previous_cancellations since they were not part of the training data but were part of the testing dataset.

levels(hotels_new$meal)[5] <- "SC"
# Data Summary (Booking Data)

#hotels_new<-hotels_new[!(hotels_new$booking_changes=="18" ),] 
hotels_new<-hotels_new[!(hotels_new$booking_changes=="16"| hotels_new$previous_cancellations=="21"),]
summary(hotels_new)
##           hotel       is_canceled   lead_time   arrival_date_year
##  City Hotel  :79328   0:75165     Min.   :  0   2015:21995       
##  Resort Hotel:40059   1:44222     1st Qu.: 18   2016:56706       
##                                   Median : 69   2017:40686       
##                                   Mean   :104                    
##                                   3rd Qu.:160                    
##                                   Max.   :737                    
##                                                                  
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :13877      33     :  3580           17     : 4406            
##  July   :12660      30     :  3087           5      : 4317            
##  May    :11790      32     :  3045           15     : 4196            
##  October:11159      34     :  3040           25     : 4160            
##  April  :11089      18     :  2926           26     : 4147            
##  June   :10939      21     :  2854           9      : 4095            
##  (Other):47873      (Other):100855           (Other):94066            
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies          meal          country     
##  Min.   : 0.0000   Min.   : 0.000000   BB:92307   PRT    :48589  
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB:  798   GBR    :12129  
##  Median : 0.0000   Median : 0.000000   HB:14463   FRA    :10415  
##  Mean   : 0.1039   Mean   : 0.007949   SC:11819   ESP    : 8568  
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000              DEU    : 7287  
##  Max.   :10.0000   Max.   :10.000000              ITA    : 3766  
##  NA's   :4                                        (Other):28633  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56475   Corporate: 6677      0:115578         
##  Offline TA/TO:24219   Direct   :14645      1:  3809         
##  Groups       :19810   GDS      :  193                       
##  Direct       :12606   TA/TO    :97867                       
##  Corporate    : 5295   Undefined:    5                       
##  Complementary:  743                                         
##  (Other)      :  239                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  0      :112904         0      :115767                 A      :85991     
##  1      :  6051         1      :  1542                 D      :19201     
##  2      :   116         2      :   580                 E      : 6535     
##  3      :    65         3      :   333                 F      : 2897     
##  24     :    48         4      :   229                 G      : 2094     
##  11     :    35         5      :   181                 B      : 1118     
##  (Other):   168         (Other):   755                 (Other): 1551     
##  assigned_room_type booking_changes      deposit_type        agent      
##  A      :74050      0      :101313   No Deposit:104638   9      :31960  
##  D      :25322      1      : 12701   Non Refund: 14587   NULL   :16340  
##  E      : 7806      2      :  3805   Refundable:   162   240    :13921  
##  F      : 3751      3      :   927                       1      : 7190  
##  G      : 2553      4      :   376                       14     : 3640  
##  C      : 2375      5      :   118                       7      : 3539  
##  (Other): 3530      (Other):   147                       (Other):42797  
##     company       days_in_waiting_list         customer_type  
##  NULL   :112590   Min.   :  0.000      Contract       : 4076  
##  40     :   927   1st Qu.:  0.000      Group          :  577  
##  223    :   784   Median :  0.000      Transient      :89611  
##  67     :   267   Mean   :  2.321      Transient-Party:25123  
##  45     :   250   3rd Qu.:  0.000                             
##  153    :   215   Max.   :391.000                             
##  (Other):  4354                                               
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.59   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5713           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##                                                                         
##  reservation_status reservation_status_date  arrival_date       
##  Canceled :43015    2015-10-21:  1461       Min.   :2015-07-01  
##  Check-Out:75165    2015-07-06:   805       1st Qu.:2016-03-13  
##  No-Show  : 1207    2016-11-25:   790       Median :2016-09-06  
##                     2015-01-01:   763       Mean   :2016-08-28  
##                     2016-01-18:   625       3rd Qu.:2017-03-18  
##                     2015-07-02:   469       Max.   :2017-08-31  
##                     (Other)   :114474                           
##   stays_nights   
##  Min.   : 0.000  
##  1st Qu.: 2.000  
##  Median : 3.000  
##  Mean   : 3.428  
##  3rd Qu.: 4.000  
##  Max.   :69.000  
## 

Missing Values

There are almost no missing values except for “children”.

colSums(is.na(hotels_new))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0 
##                   arrival_date                   stays_nights 
##                              0                              0

There are 4 missing values for “children” variable and they were replaced by zero.

hotels_new$children[is.na(hotels_new$children)]  <- 0

Outlier

To see the distribution of adr (Average Daily Rate) and to investigate any existing outliers, a histogram and a boxplot were created.

hist(hotels_new$adr,prob = T, col ="grey", breaks = 100, xlim = c(min(hotels_new$adr) - 1 ,max(hotels_new$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")

boxplot(hotels_new$adr,main = "Boxplot : Average Daily Rate") 

The outlier in adr was identified.

Boxplot(~adr, data=hotels_new, id=list(n=Inf),main = "Boxplot : Average Daily Rate")  

The outlier was replaced by NA and below is new boxplot without the outlier.

hotels_new[48516, 28] <- NA # row num.  is 48516; column num. for adr is 28
boxplot(hotels_new$adr, main = "Boxplot : Average Daily Rate") 

Variable Description

Below is the table of variable description.

Variable Type Description
hotel factor Hotel (City Hotel or Resort Hotel)
is_canceled factor Value indicating if the booking was canceled (1) or not (0)
lead_time integer Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year factor Year of arrival date
arrival_date_month factor Month of arrival date
arrival_date_week_number factor Week number of year for arrival date
arrival_date_day_of_month factor Day of arrival date
stays_in_weekend_nights integer Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights integer Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults integer Number of adults
children numeric Number of children
babies integer Number of babies
meal factor Type of meal booked. SC - no meal package; BB - Bed & Breakfast; HB - Half board (breakfast and one other meal – usually dinner); FB - Full board (breakfast, lunch and dinner)
country factor Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment factor Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel factor Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest factor Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations factor Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled factor Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type factor Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type factor Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes factor Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type factor Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
agent factor ID of the travel agency that made the booking
company factor ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list integer Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type factor Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
adr numeric Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces integer Number of car parking spaces required by the customer
total_of_special_requests integer Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status factor Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date factor Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel
arrival_date Date Arrival Date
stays_nights integer Stays in nights

Exploratory Data Analysis

Variable histograms

We begin Exploratory Data Analysis by looking at the histograms of the numeric columns.This helps us to visualize if there is a skewness to the data or if any particular variable follows a perfectly normal distribution. We see that stays_nights follows almost a normal distribution curve with a longer tail to the right.

multi.hist(hotels_new[,sapply(hotels_new, is.numeric)],freq=TRUE, dcol= c("blue","red"),dlty=c("dotted", "solid"))

City vs resort cancellations

The canceled and not canceled groups were displayed for the two hotels. It is clear that the number of cancellations for city hotel were higher than resort hotel.

# hotel and cancellation
hotels_new %>% ggplot( aes(x=hotel, fill=is_canceled)) + geom_bar() + scale_fill_discrete(labels=c("not canceled", "canceled")) + guides(fill=guide_legend(title=NULL)) +  scale_fill_manual(values=c("#F8766D","#00BFc4")) 

Monthly cancellations

We further expand our analysis to visualize the number of cancellations on a monthly basis as below. This will help us to know in which months the hotel owners were affected the worst.

count <- table(hotels_new$is_canceled, hotels_new$arrival_date_month)

barplot(count, main = "Number of cancellations by Months",horiz = TRUE, names.arg = c("Jan", "Feb","March","April","May","June","July","Aug","Sept","Oct","Nov","Dec"),col=c("#F8766D","#00BFc4") ,cex.names = 0.4)

We can find the total number of guests by adding 3 columns of adults, children and babies. This will tell us the total number of guests per room which is an important metric but was not included in the original dataset.

hotels_new$no_of_guests = rowSums(cbind(hotels_new$adults,hotels_new$children, hotels_new$babies),na.rm = TRUE)
summary (hotels_new$no_of_guests)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.968   2.000  55.000

Total bookings across months for 2015, 2016 and 2017 is as below:

hotels_new1 <- hotels_new %>% 
  mutate(Months = fct_relevel(arrival_date_month,"January",
                              "February",
                              "March",
                              "April",
                              "May",
                              "June",
                              "July",
                              "August",
                              "September",
                              "October",
                              "November",
                              "December"))
ggplot(data = hotels_new1,aes(Months,fill = (hotel))) +
  geom_bar(position = 'dodge') +
  facet_grid(hotels_new1$arrival_date_year) +
  scale_y_continuous(name = "No. of Bookings",labels = scales::comma) +
  xlab("Months") +
  ggtitle("Hotel Type vs No. of bookings across months") +
  labs(fill = 'Hotel Type')

We can see that the data for 2015 and 2017 is not available for all the months. This indicates that all the annual numbers need to be normalized to a common baseline.

Yearly trend

Next we see the trend of bookings that were converted and canceled over the period of three years period over which the data was collected. This will also help us to identify if any outside factors or events led people to cancel their hotel bookings during a particular year.

ggplot(data = hotels_new,aes(arrival_date_year,fill = (is_canceled))) +
  geom_bar(position = 'dodge') +
  scale_y_continuous(name = "No. of Bookings",labels = scales::comma) +
  xlab("Years") +
  ggtitle("Status by number of bookings across years") +
  labs(fill = 'Booking Status')

We tried correlating the variables with the dependent variable is_canceled but we were not able to make conclusive decisions on the basis of that information. We only see variables that are weakly correlated or are negatively correlated.

corr_hotels <- hotels_new[unlist(lapply(hotels_new, is.numeric))]
cor(corr_hotels)[,1] * 100
##                   lead_time     stays_in_weekend_nights 
##                  100.000000                    8.569450 
##        stays_in_week_nights                      adults 
##                   16.579416                   11.951311 
##                    children                      babies 
##                   -3.760406                   -2.091256 
##        days_in_waiting_list                         adr 
##                   17.009473                          NA 
## required_car_parking_spaces   total_of_special_requests 
##                  -11.644463                   -9.576870 
##                stays_nights                no_of_guests 
##                   15.717236                    7.226568

Logistic Regression

Linear Regression

We start with a full model to observe with all the variable and then we will proceed to use a logistic regression model with the number of cancellations as response variable.

lm_model<- glm(is_canceled ~ lead_time  + hotel + no_of_guests + deposit_type + adr + total_of_special_requests + booking_changes + required_car_parking_spaces  + days_in_waiting_list + customer_type, data = hotels_new , family = "binomial")

summary(lm_model)
## 
## Call:
## glm(formula = is_canceled ~ lead_time + hotel + no_of_guests + 
##     deposit_type + adr + total_of_special_requests + booking_changes + 
##     required_car_parking_spaces + days_in_waiting_list + customer_type, 
##     family = "binomial", data = hotels_new)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0455  -0.8120  -0.4961   0.2007   2.8011  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.186e+00  5.108e-02 -42.787  < 2e-16 ***
## lead_time                     4.661e-03  8.211e-05  56.764  < 2e-16 ***
## hotelResort Hotel            -9.598e-02  1.598e-02  -6.006 1.90e-09 ***
## no_of_guests                  1.609e-01  1.233e-02  13.046  < 2e-16 ***
## deposit_typeNon Refund        5.096e+00  1.055e-01  48.309  < 2e-16 ***
## deposit_typeRefundable        1.005e-01  2.037e-01   0.493   0.6218    
## adr                           5.100e-03  1.729e-04  29.497  < 2e-16 ***
## total_of_special_requests    -5.497e-01  1.042e-02 -52.755  < 2e-16 ***
## booking_changes1             -1.085e+00  2.873e-02 -37.779  < 2e-16 ***
## booking_changes2             -5.999e-01  4.409e-02 -13.607  < 2e-16 ***
## booking_changes3             -8.255e-01  9.648e-02  -8.556  < 2e-16 ***
## booking_changes4             -7.208e-01  1.449e-01  -4.976 6.50e-07 ***
## booking_changes5             -5.260e-01  2.639e-01  -1.993   0.0462 *  
## booking_changes6             -7.012e-02  2.984e-01  -0.235   0.8142    
## booking_changes7             -1.507e+00  6.263e-01  -2.405   0.0162 *  
## booking_changes8             -2.206e-01  6.183e-01  -0.357   0.7212    
## booking_changes9             -1.613e+00  1.109e+00  -1.454   0.1458    
## booking_changes10            -1.614e-01  1.141e+00  -0.141   0.8876    
## booking_changes11            -1.577e+01  2.637e+03  -0.006   0.9952    
## booking_changes12            -1.609e+01  2.754e+03  -0.006   0.9953    
## booking_changes13            -1.699e+01  1.722e+03  -0.010   0.9921    
## booking_changes14            -4.488e-01  1.162e+00  -0.386   0.6994    
## booking_changes15            -1.716e+01  2.242e+03  -0.008   0.9939    
## booking_changes17            -1.682e+01  2.653e+03  -0.006   0.9949    
## booking_changes18            -1.690e+01  3.956e+03  -0.004   0.9966    
## booking_changes20            -1.613e+01  3.956e+03  -0.004   0.9967    
## booking_changes21            -1.540e+01  3.956e+03  -0.004   0.9969    
## required_car_parking_spaces  -1.659e+01  4.251e+01  -0.390   0.6963    
## days_in_waiting_list         -3.203e-03  4.778e-04  -6.704 2.02e-11 ***
## customer_typeGroup           -6.541e-01  1.608e-01  -4.068 4.74e-05 ***
## customer_typeTransient        7.501e-01  4.521e-02  16.591  < 2e-16 ***
## customer_typeTransient-Party -5.918e-02  4.742e-02  -1.248   0.2121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157391  on 119385  degrees of freedom
## Residual deviance: 110283  on 119354  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 110347
## 
## Number of Fisher Scoring iterations: 16
AIC (lm_model)
## [1] 110346.8

Splitting the Data

We split the full data in 80:20 ratio as training and testing datasets.

index <- createDataPartition(hotels_new$is_canceled, p=0.80, list=FALSE)

# select 20% of the data for testing
testset <- hotels_new[-index,]
# select 80% of data to train the models
trainset <- hotels_new[index,]

Training the model

fullmodel <- glm(is_canceled ~ hotel + lead_time  + adr +  stays_nights + adults + children + babies +
                  meal + total_of_special_requests + distribution_channel +
                 is_repeated_guest + previous_cancellations + booking_changes +
                 deposit_type + days_in_waiting_list + required_car_parking_spaces +
                 customer_type, family = "binomial" , data=trainset)

Let’s find the best model using backward elimination technique.

# Backward Elimination
model_step_b <- step(fullmodel,direction='backward')
## Start:  AIC=82509.79
## is_canceled ~ hotel + lead_time + adr + stays_nights + adults + 
##     children + babies + meal + total_of_special_requests + distribution_channel + 
##     is_repeated_guest + previous_cancellations + booking_changes + 
##     deposit_type + days_in_waiting_list + required_car_parking_spaces + 
##     customer_type
## 
##                               Df Deviance   AIC
## <none>                              82398 82510
## - babies                       1    82405 82515
## - days_in_waiting_list         1    82406 82516
## - hotel                        1    82411 82521
## - adults                       1    82431 82541
## - children                     1    82464 82574
## - stays_nights                 1    82522 82632
## - meal                         3    82720 82826
## - is_repeated_guest            1    82963 83073
## - distribution_channel         4    83263 83367
## - booking_changes             19    83410 83484
## - adr                          1    83442 83552
## - lead_time                    1    83746 83856
## - customer_type                3    84230 84336
## - total_of_special_requests    1    85019 85129
## - required_car_parking_spaces  1    85712 85822
## - previous_cancellations      13    86569 86655
## - deposit_type                 2    91159 91267

Considering the lowest AIC for days_in_waiting_list, babies and hotel, we tried below 2 cases.

  • Removed days_in_waiting_list and babies.
  • Removed days_in_waiting_list, babies and hotel.

We select our final model after removing days_in_waiting_list and babies as this model gave a lower AIC value.

#Dropped previous cancellations

Finalmodel <- glm(is_canceled ~ hotel + lead_time  + adr +  stays_nights + adults + children +
                  meal + total_of_special_requests + distribution_channel +
                 is_repeated_guest + booking_changes +previous_cancellations +
                 deposit_type + days_in_waiting_list + required_car_parking_spaces +
                 customer_type, family = "binomial" , data=trainset)

summary(Finalmodel)
## 
## Call:
## glm(formula = is_canceled ~ hotel + lead_time + adr + stays_nights + 
##     adults + children + meal + total_of_special_requests + distribution_channel + 
##     is_repeated_guest + booking_changes + previous_cancellations + 
##     deposit_type + days_in_waiting_list + required_car_parking_spaces + 
##     customer_type, family = "binomial", data = trainset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9832  -0.7698  -0.4160   0.1849   3.3571  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -3.489e+00  8.522e-02 -40.942  < 2e-16 ***
## hotelResort Hotel              7.376e-02  2.035e-02   3.624  0.00029 ***
## lead_time                      3.730e-03  1.014e-04  36.801  < 2e-16 ***
## adr                            6.714e-03  2.074e-04  32.380  < 2e-16 ***
## stays_nights                   3.921e-02  3.492e-03  11.225  < 2e-16 ***
## adults                         8.562e-02  1.569e-02   5.457 4.84e-08 ***
## children                       1.717e-01  2.101e-02   8.172 3.02e-16 ***
## mealFB                         5.297e-01  1.219e-01   4.344 1.40e-05 ***
## mealHB                        -3.375e-01  2.930e-02 -11.518  < 2e-16 ***
## mealSC                         3.152e-01  2.612e-02  12.067  < 2e-16 ***
## total_of_special_requests     -5.879e-01  1.216e-02 -48.338  < 2e-16 ***
## distribution_channelDirect    -3.673e-01  5.555e-02  -6.613 3.78e-11 ***
## distribution_channelGDS       -4.323e-01  2.073e-01  -2.086  0.03701 *  
## distribution_channelTA/TO      4.601e-01  4.922e-02   9.350  < 2e-16 ***
## distribution_channelUndefined  1.387e+01  6.542e+01   0.212  0.83212    
## is_repeated_guest1            -2.239e+00  1.159e-01 -19.319  < 2e-16 ***
## booking_changes1              -9.302e-01  3.331e-02 -27.931  < 2e-16 ***
## booking_changes2              -5.033e-01  5.123e-02  -9.824  < 2e-16 ***
## booking_changes3              -6.394e-01  1.105e-01  -5.789 7.10e-09 ***
## booking_changes4              -5.041e-01  1.653e-01  -3.050  0.00229 ** 
## booking_changes5              -2.258e-01  2.836e-01  -0.796  0.42600    
## booking_changes6              -5.836e-02  3.261e-01  -0.179  0.85798    
## booking_changes7              -1.440e+00  7.637e-01  -1.885  0.05938 .  
## booking_changes8              -8.610e-01  8.303e-01  -1.037  0.29976    
## booking_changes9              -1.649e+00  1.126e+00  -1.465  0.14284    
## booking_changes10             -1.746e+01  3.013e+03  -0.006  0.99538    
## booking_changes11             -1.603e+01  6.523e+03  -0.002  0.99804    
## booking_changes12             -1.900e+01  6.523e+03  -0.003  0.99768    
## booking_changes13             -1.790e+01  3.127e+03  -0.006  0.99543    
## booking_changes14             -6.065e-01  1.239e+00  -0.490  0.62439    
## booking_changes15             -1.821e+01  4.597e+03  -0.004  0.99684    
## booking_changes17             -1.832e+01  6.523e+03  -0.003  0.99776    
## booking_changes18             -1.755e+01  6.523e+03  -0.003  0.99785    
## booking_changes20             -1.768e+01  6.523e+03  -0.003  0.99784    
## booking_changes21             -1.818e+01  6.523e+03  -0.003  0.99778    
## previous_cancellations1        4.465e+00  1.080e-01  41.344  < 2e-16 ***
## previous_cancellations2        2.877e+00  2.951e-01   9.749  < 2e-16 ***
## previous_cancellations3        2.978e+00  3.814e-01   7.809 5.74e-15 ***
## previous_cancellations4        3.748e+00  5.519e-01   6.790 1.12e-11 ***
## previous_cancellations5        2.614e+00  7.957e-01   3.285  0.00102 ** 
## previous_cancellations6       -6.496e-01  2.702e+00  -0.240  0.80998    
## previous_cancellations11       1.649e+00  5.482e-01   3.007  0.00264 ** 
## previous_cancellations13       5.625e+00  1.070e+00   5.257 1.46e-07 ***
## previous_cancellations14       1.384e+01  1.967e+03   0.007  0.99439    
## previous_cancellations19       1.351e+01  1.684e+03   0.008  0.99360    
## previous_cancellations24       1.818e+01  7.532e+02   0.024  0.98074    
## previous_cancellations25       1.332e+01  1.458e+03   0.009  0.99271    
## previous_cancellations26       1.368e+01  1.423e+03   0.010  0.99233    
## deposit_typeNon Refund         5.052e+00  1.233e-01  40.976  < 2e-16 ***
## deposit_typeRefundable         5.975e-01  2.307e-01   2.591  0.00958 ** 
## days_in_waiting_list          -1.502e-03  5.385e-04  -2.789  0.00528 ** 
## required_car_parking_spaces   -2.150e+01  9.104e+01  -0.236  0.81330    
## customer_typeGroup             1.723e-01  1.920e-01   0.897  0.36969    
## customer_typeTransient         1.515e+00  6.418e-02  23.599  < 2e-16 ***
## customer_typeTransient-Party   6.581e-01  6.698e-02   9.825  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125913  on 95508  degrees of freedom
## Residual deviance:  82405  on 95454  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 82515
## 
## Number of Fisher Scoring iterations: 17

Next, we look at the model fitting criteria.

** Model deviance

Finalmodel$deviance
## [1] 82405.07

** We check the AIC of the final model.

AIC(Finalmodel)
## [1] 82515.07

** We also check the BIC of the final model.

BIC(Finalmodel)
## [1] 83035.76

In sample prediction & ROC curve

A Receiver operating characteristic curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True positive rate and False positive rate. Below ROC curve for training data set is created.

#In-sample prediction

pred.glm0.train<- predict(Finalmodel, type="response")
##ROC Curve

trainset <- trainset[-c(1),]

length(pred.glm0.train)==length(as.numeric(trainset$is_canceled)) #This checks if the lengths of pred.glm0.train and is_cancelled variable are equal to ensure the next function runs correctly. 
## [1] TRUE
pred <- ROCR::prediction(pred.glm0.train, as.numeric(trainset$is_canceled))

perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

Area Under the Curve (AUC) for training dataset was computed.

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8264023

Out of sample prediction

Now that we have our final model, we will use it to predict on the testing data set.

#out-of-sample prediction


pred.glm0.test<- predict(Finalmodel, newdata = testset, type="response")



##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(testset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

Area under the curve (AUC) for out of sample prediction.

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8351158

Summary

Below are our observations from the project:

  • City hotel saw higher number of bookings and even higher number of cancellations.It clearly shows city hotel has much more demand than resort hotel.
  • The year 2016 saw the highest number of cancellations but also saw the highest number of bookings.
  • March, April, May and October clearly saw the least number of overall bookings and cancellations. Hence, we can say the overall demand is low in within these months.
  • February, June, September and November were the peak months in demand, and consequently they saw higher number of cancellations.
  • The logistic regression model correctly predicts the cancellation with 83.5% of accuracy.
p <- hotels %>% group_by(arrival_date) %>% 
  count(is_canceled) %>% 
  ggplot(aes(x=arrival_date, y=n, color=is_canceled)) + 
  geom_line( ) + 
  geom_hline(aes(yintercept=mean(n, na.rm=TRUE)), linetype="dashed")+
  scale_x_date(date_breaks= "2 month", date_labels = "%Y %b") + 
  theme(axis.text.x=element_text(angle=45, size=10, hjust=1)) +
  ggtitle("") +
  labs(x="", y="Booking", color="") +
  scale_color_manual(values=c("#F8766D","#00BFC4"), labels=c("Not canceled", "canceled")) +
  ggtitle("Interactive Daily Booking with/wo Cancellation") 

ggplotly(p)