Introduction

Three years ago my fammily have visited the Yellow Stone national park and stayed in a very expensive lodge even though I was trying to book the less expensive ones two months earlier. I was checking the booking page every hour expecting someone may cancel a booking and still ended up with that very expensive one. I was wondering if there were any trends associated with Hotel booking cancelation. Admittedly, booking cancellation prediction is more practical for hotel managers to orgnize hospitality and optimize revenue.

The hotel booking data contains comprehensive information to predict hotel booking cancellations and more.

I will go through every variable, conduct univariat analyses on most of them, use univariat and bivariat graphs to explore connections between variables, and conduct logistic regression and classification tree approaches to predict booking cansellation probability for certain bookings.

This analyses can benifit hotel managers “making the right room available for the right guest and the right price at the right time via the right distribution channel” (Mehrotra & Ruttley, 2006)

Packages required

These packages are required to load and munipulate data

library(data.table) # load tata 
library(tidyverse)  # tidy data
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(dplyr) # monipulate data
library(feasts) 
## Warning: package 'feasts' was built under R version 3.6.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 3.6.3
library(knitr)

These packages are required to build model

Data preparation

The original datasets come from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019.

Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking.

I load the data “hotel” and split it into “h1” for resort hotel and “h2” for city hotel and create a new dataframe combining “h1” and “h2”.

# resort hotel
h1 <- fread("hotels.csv") %>% 
  janitor::clean_names() %>% 
  mutate(hotel = "Resort Hotel") %>% 
  select(hotel, everything())

# city hotel
h2 <- fread("hotels.csv") %>% 
  janitor::clean_names() %>% 
  mutate(hotel = "City Hotel") %>% 
  select(hotel, everything())

hotel_df <- bind_rows(h1, h2)

I checked the data stucture, it has 238780 observations of 32 variables, there are 8 missing values in “children”, I removed the 8 observations of each clumn.

hotel_df <- na.omit(hotel_df)
glimpse(hotel_df)
## Observations: 238,772
## Variables: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ arrival_date_month             <chr> "July", "July", "July", "July", "Jul...
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, ...
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corpo...
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Depo...
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240"...
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type                  <chr> "Transient", "Transient", "Transient...
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out...
## $ reservation_status_date        <chr> "2015-07-01", "2015-07-01", "2015-07...

I removed some variables which are redundant or not effective for booking cancellation prediction.

drop.col <- c("adults","agent","arrival_date_day_of_month","market_segment",
              "reservation_status","reservation_status_date",
              "arrival_date_year","arrival_date_week_number")
hotel <- hotel_df %>% select(-one_of(drop.col))
#glimpse(hotel)

People cancel their booking sometimes because the assigned room type is not what they reserved. Here I combined the two columns into one categorrical variable “wanted_type” with 2 categories: “0” means same and “1” means diffrent.

Considering hotel booking is seasonal as some months in a year are more popular like summer vaction months, I define July, August, popullar months as “2”, “December”,“February” “November” as “1”, and other months as “0”.

hotel <- hotel %>% 
  mutate(arrival_date_month = ifelse(arrival_date_month %in% c("July","August"),2,ifelse(arrival_date_month %in% c("December","January","November"),0,1)))

In the categorical variable Company, “NULL” means that the booking did not came from a company. Here I define “NULL” as “individual” and other observations than “Null” as “company”

hotel <- hotel %>% 
  mutate(company = ifelse(company == "NULL","individual","company"))

Domestic tourists and international tourists may have different decissions when conselling a booking, so I condense the Country variable into two categorries: “Domestic” and “international”.

hotel <- hotel %>% 
  mutate(country = ifelse(country == "PRT","domestic","international"))

If tourists have children especially babies they are more likely to cancele a booking due to kid sickness. Here I combine “children” and “babies” and condense into three categorries: “0”,“2” and “1” for bookings having kids but no babies.

hotel <- hotel %>% 
  mutate(kids = ifelse(children == 0 & babies==0,0,ifelse(babies != 0,2,1))) %>% 
  mutate(children = NULL,babies = NULL)

Guests can be classified by three categorries based on cancellation record: new guest(0), loyal guest(1) and non loyal guest(2).

hotel <- hotel %>%
  mutate(loyalty = ifelse(is_repeated_guest == 0,0, ifelse(previous_bookings_not_canceled != 0, 1, 2))) %>% 
  mutate(is_repeated_guest = NULL, previous_bookings_not_canceled = NULL, previous_cancellations = NULL)

Convert most of the categorical variables to factors using forcats as_factor function, and then drop previous character version of that variable

hotel <- hotel %>% 
  mutate(hotel = as_factor(hotel),
         distribution_channel = as_factor(distribution_channel),
         is_canceled = as_factor(is_canceled),
         arrival_date_month = as_factor(arrival_date_month),
         meal = as_factor(meal),
         country  = as_factor(country),
         deposit_type = as_factor(deposit_type),
         company = as_factor(company),
         customer_type = as_factor(customer_type),
         kids = as_factor(kids),
         wanted_type = as_factor(wanted_type),
         loyalty = as_factor(loyalty)) %>% 
  glimpse()
## Observations: 238,772
## Variables: 20
## $ hotel                       <fct> Resort Hotel, Resort Hotel, Resort Hote...
## $ is_canceled                 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
## $ lead_time                   <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ...
## $ arrival_date_month          <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ stays_in_weekend_nights     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights        <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ...
## $ meal                        <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,...
## $ country                     <fct> domestic, domestic, international, inte...
## $ distribution_channel        <fct> Direct, Direct, Direct, Corporate, TA/T...
## $ booking_changes             <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                <fct> No Deposit, No Deposit, No Deposit, No ...
## $ company                     <fct> individual, individual, individual, ind...
## $ days_in_waiting_list        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type               <fct> Transient, Transient, Transient, Transi...
## $ adr                         <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,...
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests   <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ...
## $ wanted_type                 <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ kids                        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ loyalty                     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Below is the preview of cleaned data

hotel %>% head(6) %>% data.table()
##           hotel is_canceled lead_time arrival_date_month
## 1: Resort Hotel           0       342                  2
## 2: Resort Hotel           0       737                  2
## 3: Resort Hotel           0         7                  2
## 4: Resort Hotel           0        13                  2
## 5: Resort Hotel           0        14                  2
## 6: Resort Hotel           0        14                  2
##    stays_in_weekend_nights stays_in_week_nights meal       country
## 1:                       0                    0   BB      domestic
## 2:                       0                    0   BB      domestic
## 3:                       0                    1   BB international
## 4:                       0                    1   BB international
## 5:                       0                    2   BB international
## 6:                       0                    2   BB international
##    distribution_channel booking_changes deposit_type    company
## 1:               Direct               3   No Deposit individual
## 2:               Direct               4   No Deposit individual
## 3:               Direct               0   No Deposit individual
## 4:            Corporate               0   No Deposit individual
## 5:                TA/TO               0   No Deposit individual
## 6:                TA/TO               0   No Deposit individual
##    days_in_waiting_list customer_type adr required_car_parking_spaces
## 1:                    0     Transient   0                           0
## 2:                    0     Transient   0                           0
## 3:                    0     Transient  75                           0
## 4:                    0     Transient  75                           0
## 5:                    0     Transient  98                           0
## 6:                    0     Transient  98                           0
##    total_of_special_requests wanted_type kids loyalty
## 1:                         0           1    0       0
## 2:                         0           1    0       0
## 3:                         0           1    0       0
## 4:                         0           1    0       0
## 5:                         1           1    0       0
## 6:                         1           1    0       0

Below is a table of variable names, data type and description of variables

hotel.type <- lapply(hotel, class)
hotel.var_desc <- c('Hotel tpye',
               'If the booking was canceled(1) or not(0)',
               'Number of date between booking and arrival',
               'if the arrival date is in a popular month',
               'Number of nights booked in weekend nights',
               'Number of nights booked in week nights',
               'Type of meal booked',
               'Country of origin',
               'Booking distribution channel',
               'Number of booking changed',
               'If deopsit was made to guarantee booking',
               'ID of the company that made the booking',
               'Number of days the books was booking',
               'Type of booking categories',
               'Average daily rate',
               'Number of car parking space required',
               'Number of special request made',
               'If wanted type was matched(1) or not(0)',
               'If booking with kids or babies',
               'If guest is new(0) or loyal(1) or less loyal(2)'
               )
hotel.var_names <- colnames(hotel)
data.description <- cbind(hotel.var_names, hotel.type, hotel.var_desc)
colnames(data.description) <- c('Variable Name', 'Data Type', 'Variable Description')
#data.description
kable(data.description,row.names = FALSE)
Variable Name Data Type Variable Description
hotel factor Hotel tpye
is_canceled factor If the booking was canceled(1) or not(0)
lead_time integer Number of date between booking and arrival
arrival_date_month factor if the arrival date is in a popular month
stays_in_weekend_nights integer Number of nights booked in weekend nights
stays_in_week_nights integer Number of nights booked in week nights
meal factor Type of meal booked
country factor Country of origin
distribution_channel factor Booking distribution channel
booking_changes integer Number of booking changed
deposit_type factor If deopsit was made to guarantee booking
company factor ID of the company that made the booking
days_in_waiting_list integer Number of days the books was booking
customer_type factor Type of booking categories
adr numeric Average daily rate
required_car_parking_spaces integer Number of car parking space required
total_of_special_requests integer Number of special request made
wanted_type factor If wanted type was matched(1) or not(0)
kids factor If booking with kids or babies
loyalty factor If guest is new(0) or loyal(1) or less loyal(2)

Here is a summary of the cleaned dataset.

summary(hotel)
##           hotel        is_canceled   lead_time   arrival_date_month
##  Resort Hotel:119386   0:150332    Min.   :  0   0: 39006          
##  City Hotel  :119386   1: 88440    1st Qu.: 18   1:146698          
##                                    Median : 69   2: 53068          
##                                    Mean   :104                     
##                                    3rd Qu.:160                     
##                                    Max.   :737                     
##  stays_in_weekend_nights stays_in_week_nights        meal       
##  Min.   : 0.0000         Min.   : 0.0         BB       :184612  
##  1st Qu.: 0.0000         1st Qu.: 1.0         FB       :  1596  
##  Median : 1.0000         Median : 2.0         HB       : 28926  
##  Mean   : 0.9276         Mean   : 2.5         SC       : 21300  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         Undefined:  2338  
##  Max.   :19.0000         Max.   :50.0                           
##           country       distribution_channel booking_changes  
##  domestic     : 97172   Direct   : 29290     Min.   : 0.0000  
##  international:141600   Corporate: 13354     1st Qu.: 0.0000  
##                         TA/TO    :195740     Median : 0.0000  
##                         Undefined:     2     Mean   : 0.2211  
##                         GDS      :   386     3rd Qu.: 0.0000  
##                                              Max.   :21.0000  
##      deposit_type          company       days_in_waiting_list
##  No Deposit:209274   individual:225178   Min.   :  0.000     
##  Refundable:   324   company   : 13594   1st Qu.:  0.000     
##  Non Refund: 29174                       Median :  0.000     
##                                          Mean   :  2.321     
##                                          3rd Qu.:  0.000     
##                                          Max.   :391.000     
##          customer_type         adr          required_car_parking_spaces
##  Transient      :179226   Min.   :  -6.38   Min.   :0.00000            
##  Contract       :  8152   1st Qu.:  69.29   1st Qu.:0.00000            
##  Transient-Party: 50240   Median :  94.59   Median :0.00000            
##  Group          :  1154   Mean   : 101.83   Mean   :0.06252            
##                           3rd Qu.: 126.00   3rd Qu.:0.00000            
##                           Max.   :5400.00   Max.   :8.00000            
##  total_of_special_requests wanted_type kids       loyalty   
##  Min.   :0.0000            0:  1284    0:220108   0:231152  
##  1st Qu.:0.0000            1:237488    1: 16830   1:  5676  
##  Median :0.0000                        2:  1834   2:  1944  
##  Mean   :0.5713                                             
##  3rd Qu.:1.0000                                             
##  Max.   :5.0000

Exploretory data analyses

Since I roughly went through each variable and already did combining, creating and slicing of some variables, some variables may need to be dug much further in order to draw more information from the data. I’m going to rethink each variable and optimize.

I’m going to product some univariat and bivariat gragh approches like histgram, boxplot, pairewise correlation, ext. to identify outliers and trends.

Until this step, I’m not very clear if those removed variables have any helpful informations and I’m going to remeasure them.

Many models can be deployed on this data, I’m going to conduct logistic regression and classification tree approaches then compare the two models.

https://rpubs.com/Jing_Wang/594408