Three years ago my fammily have visited the Yellow Stone national park and stayed in a very expensive lodge even though I was trying to book the less expensive ones two months earlier. I was checking the booking page every hour expecting someone may cancel a booking and still ended up with that very expensive one. I was wondering if there were any trends associated with Hotel booking cancelation. Admittedly, booking cancellation prediction is more practical for hotel managers to orgnize hospitality and optimize revenue.
The hotel booking data contains comprehensive information to predict hotel booking cancellations and more.
I will go through every variable, conduct univariat analyses on most of them, use univariat and bivariat graphs to explore connections between variables, and conduct logistic regression and classification tree approaches to predict booking cansellation probability for certain bookings.
This analyses can benifit hotel managers “making the right room available for the right guest and the right price at the right time via the right distribution channel” (Mehrotra & Ruttley, 2006)
These packages are required to load and munipulate data
library(data.table) # load tata
library(tidyverse) # tidy data
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages ------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(dplyr) # monipulate data
library(feasts)
## Warning: package 'feasts' was built under R version 3.6.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 3.6.3
library(knitr)
These packages are required to build model
The original datasets come from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019.
Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking.
I load the data “hotel” and split it into “h1” for resort hotel and “h2” for city hotel and create a new dataframe combining “h1” and “h2”.
# resort hotel
h1 <- fread("hotels.csv") %>%
janitor::clean_names() %>%
mutate(hotel = "Resort Hotel") %>%
select(hotel, everything())
# city hotel
h2 <- fread("hotels.csv") %>%
janitor::clean_names() %>%
mutate(hotel = "City Hotel") %>%
select(hotel, everything())
hotel_df <- bind_rows(h1, h2)
I checked the data stucture, it has 238780 observations of 32 variables, there are 8 missing values in “children”, I removed the 8 observations of each clumn.
hotel_df <- na.omit(hotel_df)
glimpse(hotel_df)
## Observations: 238,772
## Variables: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ arrival_date_month <chr> "July", "July", "July", "July", "Jul...
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, ...
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corpo...
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Depo...
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240"...
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <chr> "Transient", "Transient", "Transient...
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out...
## $ reservation_status_date <chr> "2015-07-01", "2015-07-01", "2015-07...
I removed some variables which are redundant or not effective for booking cancellation prediction.
“Adults” is the number of adults which is not so relevant in this case.
“Agent” was removed because there were too many choices of agents and the choices had no significant impact on cancellation.
“ArrivalDateDayofMonth” was removed because this information was already included in other variables.
“MarketSegment” was removed because a symillar variable “DistributionChannel” is more relevant.
“ReservationStatus” and “reservationstatusdate” are not relevant and removed.
“ArrivalDateofYear” and “ArrivalDateofWeek” are also considered not relevant as other variables like “daysofweekend” and “arrivaldateofmonth”.
drop.col <- c("adults","agent","arrival_date_day_of_month","market_segment",
"reservation_status","reservation_status_date",
"arrival_date_year","arrival_date_week_number")
hotel <- hotel_df %>% select(-one_of(drop.col))
#glimpse(hotel)
People cancel their booking sometimes because the assigned room type is not what they reserved. Here I combined the two columns into one categorrical variable “wanted_type” with 2 categories: “0” means same and “1” means diffrent.
Considering hotel booking is seasonal as some months in a year are more popular like summer vaction months, I define July, August, popullar months as “2”, “December”,“February” “November” as “1”, and other months as “0”.
hotel <- hotel %>%
mutate(arrival_date_month = ifelse(arrival_date_month %in% c("July","August"),2,ifelse(arrival_date_month %in% c("December","January","November"),0,1)))
In the categorical variable Company, “NULL” means that the booking did not came from a company. Here I define “NULL” as “individual” and other observations than “Null” as “company”
hotel <- hotel %>%
mutate(company = ifelse(company == "NULL","individual","company"))
Domestic tourists and international tourists may have different decissions when conselling a booking, so I condense the Country variable into two categorries: “Domestic” and “international”.
hotel <- hotel %>%
mutate(country = ifelse(country == "PRT","domestic","international"))
If tourists have children especially babies they are more likely to cancele a booking due to kid sickness. Here I combine “children” and “babies” and condense into three categorries: “0”,“2” and “1” for bookings having kids but no babies.
hotel <- hotel %>%
mutate(kids = ifelse(children == 0 & babies==0,0,ifelse(babies != 0,2,1))) %>%
mutate(children = NULL,babies = NULL)
Guests can be classified by three categorries based on cancellation record: new guest(0), loyal guest(1) and non loyal guest(2).
hotel <- hotel %>%
mutate(loyalty = ifelse(is_repeated_guest == 0,0, ifelse(previous_bookings_not_canceled != 0, 1, 2))) %>%
mutate(is_repeated_guest = NULL, previous_bookings_not_canceled = NULL, previous_cancellations = NULL)
Convert most of the categorical variables to factors using forcats as_factor function, and then drop previous character version of that variable
hotel <- hotel %>%
mutate(hotel = as_factor(hotel),
distribution_channel = as_factor(distribution_channel),
is_canceled = as_factor(is_canceled),
arrival_date_month = as_factor(arrival_date_month),
meal = as_factor(meal),
country = as_factor(country),
deposit_type = as_factor(deposit_type),
company = as_factor(company),
customer_type = as_factor(customer_type),
kids = as_factor(kids),
wanted_type = as_factor(wanted_type),
loyalty = as_factor(loyalty)) %>%
glimpse()
## Observations: 238,772
## Variables: 20
## $ hotel <fct> Resort Hotel, Resort Hotel, Resort Hote...
## $ is_canceled <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ...
## $ arrival_date_month <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ...
## $ meal <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,...
## $ country <fct> domestic, domestic, international, inte...
## $ distribution_channel <fct> Direct, Direct, Direct, Corporate, TA/T...
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <fct> No Deposit, No Deposit, No Deposit, No ...
## $ company <fct> individual, individual, individual, ind...
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <fct> Transient, Transient, Transient, Transi...
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,...
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ...
## $ wanted_type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ kids <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ loyalty <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Below is the preview of cleaned data
hotel %>% head(6) %>% data.table()
## hotel is_canceled lead_time arrival_date_month
## 1: Resort Hotel 0 342 2
## 2: Resort Hotel 0 737 2
## 3: Resort Hotel 0 7 2
## 4: Resort Hotel 0 13 2
## 5: Resort Hotel 0 14 2
## 6: Resort Hotel 0 14 2
## stays_in_weekend_nights stays_in_week_nights meal country
## 1: 0 0 BB domestic
## 2: 0 0 BB domestic
## 3: 0 1 BB international
## 4: 0 1 BB international
## 5: 0 2 BB international
## 6: 0 2 BB international
## distribution_channel booking_changes deposit_type company
## 1: Direct 3 No Deposit individual
## 2: Direct 4 No Deposit individual
## 3: Direct 0 No Deposit individual
## 4: Corporate 0 No Deposit individual
## 5: TA/TO 0 No Deposit individual
## 6: TA/TO 0 No Deposit individual
## days_in_waiting_list customer_type adr required_car_parking_spaces
## 1: 0 Transient 0 0
## 2: 0 Transient 0 0
## 3: 0 Transient 75 0
## 4: 0 Transient 75 0
## 5: 0 Transient 98 0
## 6: 0 Transient 98 0
## total_of_special_requests wanted_type kids loyalty
## 1: 0 1 0 0
## 2: 0 1 0 0
## 3: 0 1 0 0
## 4: 0 1 0 0
## 5: 1 1 0 0
## 6: 1 1 0 0
Below is a table of variable names, data type and description of variables
hotel.type <- lapply(hotel, class)
hotel.var_desc <- c('Hotel tpye',
'If the booking was canceled(1) or not(0)',
'Number of date between booking and arrival',
'if the arrival date is in a popular month',
'Number of nights booked in weekend nights',
'Number of nights booked in week nights',
'Type of meal booked',
'Country of origin',
'Booking distribution channel',
'Number of booking changed',
'If deopsit was made to guarantee booking',
'ID of the company that made the booking',
'Number of days the books was booking',
'Type of booking categories',
'Average daily rate',
'Number of car parking space required',
'Number of special request made',
'If wanted type was matched(1) or not(0)',
'If booking with kids or babies',
'If guest is new(0) or loyal(1) or less loyal(2)'
)
hotel.var_names <- colnames(hotel)
data.description <- cbind(hotel.var_names, hotel.type, hotel.var_desc)
colnames(data.description) <- c('Variable Name', 'Data Type', 'Variable Description')
#data.description
kable(data.description,row.names = FALSE)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| hotel | factor | Hotel tpye |
| is_canceled | factor | If the booking was canceled(1) or not(0) |
| lead_time | integer | Number of date between booking and arrival |
| arrival_date_month | factor | if the arrival date is in a popular month |
| stays_in_weekend_nights | integer | Number of nights booked in weekend nights |
| stays_in_week_nights | integer | Number of nights booked in week nights |
| meal | factor | Type of meal booked |
| country | factor | Country of origin |
| distribution_channel | factor | Booking distribution channel |
| booking_changes | integer | Number of booking changed |
| deposit_type | factor | If deopsit was made to guarantee booking |
| company | factor | ID of the company that made the booking |
| days_in_waiting_list | integer | Number of days the books was booking |
| customer_type | factor | Type of booking categories |
| adr | numeric | Average daily rate |
| required_car_parking_spaces | integer | Number of car parking space required |
| total_of_special_requests | integer | Number of special request made |
| wanted_type | factor | If wanted type was matched(1) or not(0) |
| kids | factor | If booking with kids or babies |
| loyalty | factor | If guest is new(0) or loyal(1) or less loyal(2) |
Here is a summary of the cleaned dataset.
summary(hotel)
## hotel is_canceled lead_time arrival_date_month
## Resort Hotel:119386 0:150332 Min. : 0 0: 39006
## City Hotel :119386 1: 88440 1st Qu.: 18 1:146698
## Median : 69 2: 53068
## Mean :104
## 3rd Qu.:160
## Max. :737
## stays_in_weekend_nights stays_in_week_nights meal
## Min. : 0.0000 Min. : 0.0 BB :184612
## 1st Qu.: 0.0000 1st Qu.: 1.0 FB : 1596
## Median : 1.0000 Median : 2.0 HB : 28926
## Mean : 0.9276 Mean : 2.5 SC : 21300
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 Undefined: 2338
## Max. :19.0000 Max. :50.0
## country distribution_channel booking_changes
## domestic : 97172 Direct : 29290 Min. : 0.0000
## international:141600 Corporate: 13354 1st Qu.: 0.0000
## TA/TO :195740 Median : 0.0000
## Undefined: 2 Mean : 0.2211
## GDS : 386 3rd Qu.: 0.0000
## Max. :21.0000
## deposit_type company days_in_waiting_list
## No Deposit:209274 individual:225178 Min. : 0.000
## Refundable: 324 company : 13594 1st Qu.: 0.000
## Non Refund: 29174 Median : 0.000
## Mean : 2.321
## 3rd Qu.: 0.000
## Max. :391.000
## customer_type adr required_car_parking_spaces
## Transient :179226 Min. : -6.38 Min. :0.00000
## Contract : 8152 1st Qu.: 69.29 1st Qu.:0.00000
## Transient-Party: 50240 Median : 94.59 Median :0.00000
## Group : 1154 Mean : 101.83 Mean :0.06252
## 3rd Qu.: 126.00 3rd Qu.:0.00000
## Max. :5400.00 Max. :8.00000
## total_of_special_requests wanted_type kids loyalty
## Min. :0.0000 0: 1284 0:220108 0:231152
## 1st Qu.:0.0000 1:237488 1: 16830 1: 5676
## Median :0.0000 2: 1834 2: 1944
## Mean :0.5713
## 3rd Qu.:1.0000
## Max. :5.0000
Since I roughly went through each variable and already did combining, creating and slicing of some variables, some variables may need to be dug much further in order to draw more information from the data. I’m going to rethink each variable and optimize.
I’m going to product some univariat and bivariat gragh approches like histgram, boxplot, pairewise correlation, ext. to identify outliers and trends.
Until this step, I’m not very clear if those removed variables have any helpful informations and I’m going to remeasure them.
Many models can be deployed on this data, I’m going to conduct logistic regression and classification tree approaches then compare the two models.