Hotel Cancellation Forecasting

Synopsis

As one of the most important components on our travelling agenda, securing a hotel room in a strange city or even country ahead of our travelling seem to be one of the first things to hit on our to-do list. Thanks to the technologies and various hotel booking platforms, travelers nowadays are able to book or cancel hotel reservations in just a blink of an eye. Along with the convinience of making hotel reservations, the cancellation of a room reservation has also become much easier. As a result, the potential loss due to hotel cancellation might post additional cost to hotel management in general. In other words, to have a good sense about how cancellation activities look like has become increasingly important for hotels for their daily management as well as long-term business decision making. For this final project, I used the data collected by hotel booking platforms regarding different attributes of the reservation to better understand hotel cancellation in general, then try to answer questions in related to hotel cancellation such as what factors are closely associated with hotel cancellation, and how can we better predict hotel cancellation. With these answers and information, the hotel management team would be better informed with the following aspects:

  • How does cancellation look like in general (e.g., cancellation rate, cancellation distribution, etc.)?
  • What factors are more closely related to cancellation (e.g., attributes of the customers, timing of the booking, room characteristics, booking activities, booking avenues, etc.)?
  • How to better predict cancellation based on other known factors? And
  • actions to minimize potential cancellations.

To answer these questions, I will

  1. clean the dataset (e.g., check for variable names, types, missing values, and duplicates; fix any probelms to make the dataset tidy)
  2. perform a diagnostic analysis on the dataset to visualize distributions, detect any invalid values and/or outliers
  3. perform an exploratory data analysis to have a better understanding about the relationships among variables (e.g., check data/variable distributions, check data by group, x-y plots, scatterplots, etc.)
  4. conduct regressional analyses to build a predicting model for cancellation

Package Required

The main R package used for this project is tidyverse. The core functions inbeded in tidyverse include ggplot2, tidyr, dplyr etc. Some non-core functions of the tidyverse package, some plotting packages, and the decision tree package were also use in this project.

#install.packages('tidyverse')
#install.packages('readxl')
#install.packages('lubridate')
#install.packages('magrittr')
#install.packages('glue')
#install.packages("ggpubr")
#install.packages("rpart")
#install.packages("party")
#install.packages("rpart.plot")
#install.packages("patchwork")
#install.packages("kableExtra")
#install.packages("readxl")
library(kableExtra)
library(patchwork)
library(tidyverse)
library(readxl)
library(lubridate)
library(magrittr)
library(glue)
library(ggpubr)
library(rpart)
library(party)
library(rpart)
library(rpart.plot)
library(readxl)

Data Set & Variables

The hotels dataset used for this project was downloaded from the BANA 7025 Course Website. A total number of 32 varibales are included in the hotel dataset, most of which are categorical variables. Detailed variable name, type, and descriptions are listed below.

  1. hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)

  2. is_canceled double Value indicating if the booking was canceled (1) or not (0)

  3. lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

  4. arrival_date_year double Year of arrival date

  5. arrival_date_month character Month of arrival date

  6. arrival_date_week_number double Week number of year for arrival date

  7. arrival_date_day_of_month double Day of arrival date

  8. stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

  9. stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

  10. adults double Number of adults

  11. children double Number of children

  12. babies double Number of babies

  13. meal character Type of meal booked. Categories are presented in standard hospitality meal packages:
  • Undefined/SC – no meal package;
  • BB – Bed & Breakfast;
  • HB – Half board (breakfast and one other meal – usually dinner);
  • FB – Full board (breakfast, lunch and dinner)
  1. country character Country of origin. Categories are represented in the ISO 3155–3:2013 format

  2. market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  3. distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  4. is_repeated_guest double Value indicating if the booking name was from a repeated guest (1) or not (0)

  5. previous_cancellations double Number of previous bookings that were cancelled by the customer prior to the current booking

  6. previous_bookings_not_canceled double Number of previous bookings not cancelled by the customer prior to the current booking

  7. reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons

  8. assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

  9. booking_changes double Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

  10. deposit_type character Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
  • No Deposit – no deposit was made;
  • Non Refund – a deposit was made in the value of the total stay cost;
  • Refundable – a deposit was made with a value under the total cost of stay.
  1. agent character ID of the travel agency that made the booking

  2. company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

  3. days_in_waiting_list double Number of days the booking was in the waiting list before it was confirmed to the customer

  4. customer_type character Type of booking, assuming one of four categories:
  • Contract - when the booking has an allotment or other type of contract associated to it;
  • Group – when the booking is associated to a group;
  • Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;
  • Transient-party – when the booking is transient, but is associated to at least other transient booking
  1. adr double Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

  2. required_car_parking_spaces double Number of car parking spaces required by the customer

  3. total_of_special_requests double Number of special requests made by the customer (e.g. twin bed or high floor)

  4. reservation_status character Reservation last status, assuming one of three categories:
  • Canceled – booking was canceled by the customer;
  • Check-Out – customer has checked in but already departed;
  • No-Show – customer did not check-in and did inform the hotel of the reason why
  1. reservation_status_date double Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

As you can see, it is a rather long variable list, and it might be not easy to digest. To have a better idea about the variables, I grouped them into four general categories:

  1. Customer Attributes

adults, children, babies, country, is_repeated_guest, previous_cancellations, previous_bookings_not_cancelled, agent, company, customer_type, and total_of_special_requests

  1. Hotel Attributes

hotel, meal, reserved_room_type, assigned_room_type, adr, and required_car_parking_spaces

  1. Booking Attributes

lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_week_nights, stays_in_weekend_nights, market_segment, distribution_channel, booking_changes, deposit_type, days_in_waiting_list, and reservation_status_date

  1. Outcome Variables

is_canceled, and reservation_status


Data Preparation

Glimpse of Data

Now let's take a closer look at the dataset.

#load dataset.
hotels <- read.csv("hotels.csv", header = TRUE)
glimpse(hotels)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient", "Transient", …
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date        <chr> "2015-07-01", "2015-07-01", "2015-07-02…
  • As the output above shows, there are a total of 119390 observations and 32 variables in the dataset

Missing Data

# missing data in general
sum(is.na(hotels))
## [1] 4
# missing data broken down by variable
# colSums(is.na(hotels))

The output above indicates that the hotels dataset is a fairly complete dataset, with only 4 cases of missing data for children. To further check for missing values that might be labeled by other expressions in the dataset (e.g., -99, 99, NULL, etc.), unique values for each variable were extracted using the R fuction table(data$variable). Two variavles agent and company seem to have more NULL values than the others.

table(hotels$agent)
table(hotels$company)
  • Variable agent and company, there were a large amount of cases being labeled as "NULL" (n=16340, 112593 respectively), which indicated missing data

To replace the "NULL" values with NA for variable agent and company, the codes below was used. All missing data will remain in the form of "NA" due to the fact that data inputation does not really apply to categorical variables.

hotels$agent[hotels$agent %in% c("NULL")] <- "NA"
hotels$company[hotels$company %in% c("NULL")] <- "NA"
# QA
table(hotels$agent)
table(hotels$company)

Check Duplicates

Dupicate rows can be problematic and cause issues for data analysis and interpretation. Therefore in this step, I first identified duplicate rows (if any), and then removed them. The code and output below demonstrate this step.

# check the total number of duplicates in the dataset
sum(duplicated(hotels))
## [1] 31994
# remove duplicates
hotels_no_dup <- hotels[!duplicated(hotels), ]
# QA
sum(duplicated(hotels_no_dup))
## [1] 0
  • As the output above shows, there were 31994 duplicate rows in the hotel dataset. After removing the duplicate rows, the new dataset was saved as hotels_no_dup, and no more duplicate rows remianed in the new dataset

Summary Statistics

Of the 32 variables, variable lead_timeand adr are more understandable as numeric, so summary statistics were computed as shown below. Other categorical variables were demonstrated in the form of barplots.

Numerical Variables

summary(hotels_no_dup$lead_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   49.00   79.89  125.00  737.00
summary(hotels_no_dup$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   72.00   98.10  106.34  134.00 5400.00

The minimum value for variable adr seems to be invalid as the averaged hotel rate should not be negative in most situations. Examined the dataset again (with the table(dataset$variable function), there was one value of "-6.38" in the dataset, along with a number of cases with very low adr values such as "0", "0.26" and "0.5" etc. The existance of various low adr values suggests that the case of "-6.38" might not be a misentering or invalid data. Instead, there must be a systematic reason for those low adr values such as gift cards or credits being redeamed or hotel compensation. In a word, without conforming with the client first, all those low values in adr are still considered to be valid and thus no action is needed now.

Categorical Variables

R function barplot(table(data$variable)) was used to visualize frequencies for several selected categorical variables. Due to the large variable numbers, the output figures are not shown here. Instead, the table below summarizes some of the major patterns found by just eyeballing the data.

Variable Aspect Measured Finding
hotel hotel type About 2/3 were city hotels, the rest were resort hotels
is_canceled Is the reservation cancelled? Most reservations (about 2/3) were not cancelled, while about 1/3 were cancelled.
lead_time number of days between booking and arrival Mean lead time was 79.89 (not sure in what unit)
arrival_date_year arrival year Data was for 2015, 2016, 2017, and 2016 and 2017 seem to have most of the arrivals.
arrival_date_month arrival month Arrivals seem to be evenly distributed throughout the year, with some peaks in July and August, some lows in January and November.
arrival_date_week_number arrival week N/A
arrival_date_day_of_month arrival day N/A
stays_in_weekend_nights number of weekend nights stayed Most guest who stayed on weekends had spent 1 to two nights.
stays_in_week_nights number of week nights stayed Most guest who stayed on week days had spent 1 to five nights. In addition, more guest stayed on week days than weekends.
adults number of adults 2 adults for most of the cases
children number of children 0 child for most of the cases
babies number of babies 0 babies for most of the cases
meal type of mean booked Most guests booked bed & breakfast meal type
country guest country of origin PRT, FRA, and GBR seem to be the top 3 countries of origin
market_segment market segment Mostly booked through online travel agencies
distribution_channel distribution channel Mostly booked through travel agents or tour operators
is_repeated_guest is repeated guest Most were not repeated guests
previous_cancellations number of previous cancellations Most guests did not cancel reservations previously
previous_bookings_not_canceled Number of previous bookings not canceled Most had value in 0
reserved_room_type room type reserved Mostly type A and D
assigned_room_type room type assigned Mostly type A and D
booking_changes number of changes made to the booking Mostly no changes made to the reservation
deposit_type deposit type Mostly no deposit
agent ID of the travel agency that made the booking 9 and 240 account for most
company ID of the company/entity that made the booking or responsible for paying the booking NA
days_in_waiting_list number of days in waitlist Mostly 0
customer_type type of booking Mostly transient guests
adr average daily rate Mean is $106.34
required_car_parking_spaces number of parking spaces required Mostly 0, with some 1
total_of_special_requests number of special requests made Mostly 0, some had 1 to 3 special requests
reservation_status reservation last status Mostly checked out
reservation_status_date reservation last status date N/A

Final Dataset

  • Now after replacing the missing value, removing duplicates, checking for data distribution and invalid values, the datset hotels_no_dup is considered clean, and it was renamed as hotels_final. The final dataset has 87396 rows and 32 variables.

EDA

The purpose of EDA is to have a better idea about the interelationships among predicting variables and whether the reservation was canceled. To achieve this, data-subsetting, dataset-joinning, and composing new variables were done in this step.

Cancellation Overview

To see whether certain patterns would emerge, the final dataset was subsetted for cases that were canceled only. Barplots were also created for the subset hotels_canceled to see if there are any observable patterns in the canceled data. The barplots (not shown here due to large amount) revealed that all variables seemed to have followed the same pattern in the overall dataset and the canceled only dataset. In other words, no specific patterns emerged at this step.

# Create a subset with canceled cases only
hotels_canceled <- hotels_final %>% 
  filter(is_canceled == 1)
hotels_canceled %>% 
  summarise(cancelation_rate = sum (is_canceled == 1)/87396 * 100, na.rm = TRUE)
##   cancelation_rate na.rm
## 1         27.48982  TRUE
  • A total number of 24025 reservations were canceled in the dataset, and the cancellation rate is 27.49%.

Subsetting by Predicting Variable Type

The hotels_final dataset were subsetted in this step. The dataset was subsetted by predicting variable categoris (i.e., customer attributes, hotel attributes, and booking attributes).

# customer attribute subset
hotels_ca <- hotels_final %>% 
  select(is_canceled, adults, children, babies, country, is_repeated_guest, previous_cancellations, agent, company, customer_type, total_of_special_requests) %>% 
  glimpse()
# hotel attribute subset
hotels_ha <- hotels_final %>% 
  select(is_canceled, hotel, meal, reserved_room_type, assigned_room_type, adr, required_car_parking_spaces) %>%
  glimpse()
# booking attribute subset, `reservation_status_date` variable excluded due to small relavancy
hotels_ba <- hotels_final %>% 
  select(is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_week_nights, stays_in_weekend_nights, market_segment, distribution_channel, booking_changes, deposit_type, days_in_waiting_list) %>% 
  glimpse()

Cancellation & Customer Attributes

Due to the fact that the outcome variable is_canceled and most of the variables in the dataset are categorical in nature, chi-squre test was used as the primary method to test variable independency. In other words, test statistics will be used to determine whether a predicting variable is correlated with the outcome variable is_canceled. For some double type variables (i.e., adults, children, babies, previous_cancellations, and total_of_special_requests), dummy variables were created first so that chi-square test could be performed. The codes and output are shown below.

# independency test for is_canceled with other variables
## Whether traveling with child/baby matters
hotels_ca_1 <- hotels_ca %>%
  mutate(with_children = case_when(
    children > 0 | babies >0 ~ 1,
    TRUE ~ 0)) 
### Visualize with stacked bar plot.
p1 <- ggplot(hotels_ca_1, aes(is_canceled, fill = factor(with_children))) +
  geom_bar(position = "fill")
chisq.test(hotels_ca_1$with_children, hotels_ca_1$is_canceled)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hotels_ca_1$with_children and hotels_ca_1$is_canceled
## X-squared = 232.76, df = 1, p-value < 2.2e-16
## Whether guest who previously canceled matters
hotels_ca_2 <- hotels_ca_1 %>%
  mutate(previous_canceled = case_when(
    previous_cancellations > 0 ~ 1,
    TRUE ~ 0))
### Visualize with stacked bar plot.
p2 <- ggplot(hotels_ca_2, aes(is_canceled, fill = factor(previous_canceled))) +
  geom_bar(position = "fill")
chisq.test(hotels_ca_2$previous_canceled, hotels_ca_2$is_canceled)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hotels_ca_2$previous_canceled and hotels_ca_2$is_canceled
## X-squared = 1405, df = 1, p-value < 2.2e-16
## Whether customer with special request matters
hotels_ca_3 <- hotels_ca_2 %>%
  mutate(special_requests = case_when(
    total_of_special_requests > 0 ~ 1,
    TRUE ~ 0))
### Visualize with stacked bar plot.
p3 <- ggplot(hotels_ca_3, aes(is_canceled, fill = factor(special_requests))) +
  geom_bar(position = "fill")
chisq.test(hotels_ca_3$special_requests, hotels_ca_3$is_canceled)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hotels_ca_3$special_requests and hotels_ca_3$is_canceled
## X-squared = 1443.3, df = 1, p-value < 2.2e-16
## Whether customer type matters
p4 <- ggplot(hotels_ca, aes(is_canceled, fill = factor(customer_type))) +
  geom_bar(position = "fill")
chisq.test(hotels_ca$customer_type, hotels_ca$is_canceled)
## 
##  Pearson's Chi-squared test
## 
## data:  hotels_ca$customer_type and hotels_ca$is_canceled
## X-squared = 1411.2, df = 3, p-value < 2.2e-16
(p1 | p2) /
  (p3 | p4)

  • The chi-square test statistics showed that all customer attributes examined (with_children, previous_canceled, special_requests, and customer_type) were significantly associated with the cancellation status.

  • Triangluating with the figures, variable special_requests was selected as the most correlatable customer attribute with hotel cancellation becuase it had the largest x-squared value (1443.3)

Cancellation & Hotel Attributes

Similarly, several hotel attribute variables hotel, reserved_room_type, assigned_room_type, and adr were examined in terms of their relationships with is_canceled. Sepcifically, a variavle room_type_matched was created based on reserved_room_type and assigned_room_type to see whether not getting what they wanted would result in room being canceled. Codes, figures, and chi-square test statistics are all shown below.

# independency test for is_canceled with other variables
## Whether hotel type matters
p5 <- ggplot(hotels_ha, aes(is_canceled, fill = factor(hotel))) +
  geom_bar(position = "fill")
chisq.test(hotels_ha$hotel, hotels_ha$is_canceled)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hotels_ha$hotel and hotels_ha$is_canceled
## X-squared = 447.66, df = 1, p-value < 2.2e-16
## Whether getting the correct room type matters
hotels_ha_1 <- hotels_ha %>%
  mutate(room_type_matched = case_when(
    reserved_room_type == assigned_room_type ~ 1,
    TRUE ~ 0)) 
### Visualize with stacked bar plot.
p6 <- ggplot(hotels_ha_1, aes(is_canceled, fill = factor(room_type_matched))) +
  geom_bar(position = "fill")
chisq.test(hotels_ha_1$room_type_matched, hotels_ha_1$is_canceled)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hotels_ha_1$room_type_matched and hotels_ha_1$is_canceled
## X-squared = 4014.8, df = 1, p-value < 2.2e-16
(p5 | p6)

## Whether rate/night matters
### Visualization.
boxplot(adr ~ is_canceled, data = hotels)

### Correlation test
cor.test(hotels_ha$is_canceled, hotels_ha$adr)
## 
##  Pearson's product-moment correlation
## 
## data:  hotels_ha$is_canceled and hotels_ha$adr
## t = 38.149, df = 87394, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1214588 0.1345013
## sample estimates:
##       cor 
## 0.1279856

The test statistics revealed that hotel type, whether customers get the room type that they wanted, and hotel rate all significantly correlated with a reservation's cancellation status. specifically, whether the room type reserved and assigned match matters a lot in terms of whether the room would be canceled or not (x-squared = 4041.8, p < .001). The visualizations suggested that there were slightly more city hotels being canceled than the resort hotels, but test statistics was not very high comparing to room_type_mathced. The adr was positively correlated with cancellation with a statistical significance (correlation coefficient = 0.128, p < 0.001), however, the correlation coefficient was not very high.

  • In summery, room_type_matched is selected as the most correlatable hotel attribute with hotel cancellation.

Cancellation & Booking Attributes

Lastly, several booking attribute variables arrival_date_month, stays_in_week_nights, stays_in_weekend_nights, and distribution_channel were examined in terms of their relationships with is_canceled. To simplify the analysis, arrival_date_month were aggregated into arrival_season, stays_in_week_nights and stays_in_weekend_nights were combined as nights_staying. Codes and output are shown below.

# independency test for is_canceled with other BA variables
## Whether season matters
### Relevel the month
hotels_ba_1 <- hotels_ba %>% 
  mutate(arrival_date_month = fct_relevel(arrival_date_month, "January",
                                          "February",
                                          "March",
                                          "April",
                                          "May",
                                          "June",
                                          "July",
                                          "August",
                                          "September",
                                          "October",
                                          "November",
                                          "December"))
### Categorize months into seasons
hotels_ba_2 <- hotels_ba_1 %>% 
  mutate(arrival_season = fct_collapse(arrival_date_month,
    spring = c("March", "April", "May"),
    summer = c("June", "July", "August"),
    fall = c("September", "October", "November"),
    winter = c("December", "January", "February")
  ),
  arrival_season = fct_explicit_na(arrival_season)
  )
### Visualization
p7 <- ggplot(hotels_ba_2, aes(is_canceled, fill = factor(arrival_season))) +
  geom_bar(position = "fill")
### Chi-squared test
chisq.test(hotels_ba_2$arrival_season, hotels_ba_2$is_canceled)
## 
##  Pearson's Chi-squared test
## 
## data:  hotels_ba_2$arrival_season and hotels_ba_2$is_canceled
## X-squared = 503.14, df = 3, p-value < 2.2e-16
## Whether distribution channel matters
### Visualization
p8 <- ggplot(hotels_ba_2, aes(is_canceled, fill = factor(distribution_channel))) +
  geom_bar(position = "fill")
### Chi-squared test
chisq.test(hotels_ba_2$distribution_channel, hotels_ba_2$is_canceled)
## Warning in chisq.test(hotels_ba_2$distribution_channel,
## hotels_ba_2$is_canceled): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  hotels_ba_2$distribution_channel and hotels_ba_2$is_canceled
## X-squared = 2031.2, df = 4, p-value < 2.2e-16
(p7 | p8)

## Whether number of nights staying matters
hotels_ba_3 <- hotels_ba_2 %>% 
  mutate(nights_staying = stays_in_week_nights + stays_in_weekend_nights)
### Visualization
boxplot(nights_staying ~ is_canceled, data = hotels_ba_3)

### Correlation test
cor.test(hotels_ba_3$is_canceled, hotels_ba_3$nights_staying)
## 
##  Pearson's product-moment correlation
## 
## data:  hotels_ba_3$is_canceled and hotels_ba_3$nights_staying
## t = 24.951, df = 87394, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07751542 0.09068133
## sample estimates:
##        cor 
## 0.08410204

Test statistics suggested that season, number of nights staying, and distribution channel all significantly correlated with hotel cancellation. Traingulating with the figures, we know that summer time had more cancellation than other seasons, and bookings through travelling agents or tour operators had more cancellations.

  • Integrating both statistics and visualizations, distribution_channel was selected as the booking attribute that correlated the most with hotel cancellation because it had the largest x-squared value (2031.2).

EDA Summary

The final predicting variables selected for a further predictive analysis were special_requests, room_type_mathced, and distribution_channel.


Cancellation Prediction

The EDA results showed that a lot of the variables were strongly correlated with hotel cancellation. Therefore, in this step, a prediction model for hotel cancellation was tested and fine-tuned. Decision tree was used for the creation of this model.

Decision Tree

A final dataset for decision tree needs to be created first. This final dataset should include all relative variables inlcuding the composite variables that were computed in the EDA step (e.g., special_requests, room_type_matched, and distribution_channel. Decision tree will first be perfomred with all related variables, then the model will be compared with the selected composite variables by EDA. Ideally, the variables selected by decision tree would match the variables selected by the correlational analysis in EDA, and a predicting model consist of those variables would be considered to be a good fit for our data.

Creating Final Datasets

# Joining EDA subsets
## compute an identifier variable for later merging purpose
hotels_ca_3a <- hotels_ca_3 %>% 
  mutate(ID = row_number())
hotels_ha_1a <- hotels_ha_1 %>% 
  mutate(ID = row_number())
hotels_ba_3a <- hotels_ba_3 %>% 
  mutate(ID = row_number())
## Merge the three datasets to create a general dataset
hotels_dt_full <- hotels_ca_3a %>% 
  full_join(hotels_ha_1a, by = "ID") %>% 
  full_join(hotels_ba_3a, by = "ID") %>% 
  select(is_canceled, with_children, previous_canceled, special_requests, adr, room_type_matched, distribution_channel, arrival_season, nights_staying)
## convert is_canceled to factor
hotels_dt_full$canceled_f <- factor(hotels_dt_full$is_canceled)
hotels_dt_full$children_f <- factor(hotels_dt_full$with_children)
hotels_dt_full$pre_canceled_f <- factor(hotels_dt_full$previous_canceled)
hotels_dt_full$special_f <- factor(hotels_dt_full$special_requests)
hotels_dt_full$room_matched_f <- factor(hotels_dt_full$room_type_matched)
hotels_dt_full$channel_f <- factor(hotels_dt_full$distribution_channel)
hotels_dt_full$season_f <- factor(hotels_dt_full$arrival_season)
hotels_dt <- hotels_dt_full %>% 
  select(canceled_f, children_f, pre_canceled_f, special_f, adr, room_matched_f, channel_f, season_f, nights_staying)
glimpse(hotels_dt)

Grow the Decision Tree

# Partition data into training and testing subsets
set.seed(1234)
pd <- sample(2, nrow(hotels_dt), replace = TRUE, prob = c(0.8, 0.2))
train <- hotels_dt[pd == 1,]
test <- hotels_dt[pd == 2,]
## plant the tree in the training set
tree <- rpart(canceled_f ~., train)
## visualize the tree
rpart.plot(tree)

The decision tree above suggests that room_type_matched, distribution_channel, and special_requests are the top tree important variables in predicting cancellation, which is an exact match with our EDA selected variables. It means that our variable selection was robust and valid.

Specifically, the tree above suggests that a customer who received a matched room type, booked through travel agent/tour operators, had special requests, and had previously canceled an reservation would have a chance of 93% to cancel his/her reservation. Also, a customer who received a matched room type, booked through travel agent/tour operators, did not have special requests, paid less than $96/night, and had previously canceled an reservation would have a 97% chance to cancel the reservation.


Accuray Rate of the Model

The accuray rate of our tree was probed by calculating the classification error. As the output below above, our predicting model has a classification error of 0.24, meaning that for 76% of the time, our model made the correct prediction. A similar accuracy rate was found in in the test dataset as well.

## misclassification table for train data
tree1 <- ctree(canceled_f ~., data = train, controls = ctree_control(mincriterion = 0.99, minsplit = 500))
mis_tab <- table(predict(tree1, train), train$canceled_f)
## Misclassification Rate
1 - sum(diag(mis_tab))/sum(mis_tab)
## [1] 0.2440926
## Misclassification Table with Test data
testPred <- predict(tree1, newdata = test)
tab <- table(testPred, test$canceled_f)
### Misclassification Rate in test data
1 - sum(diag(tab))/sum(tab)
## [1] 0.2446809
  • The tree that we have now seems to have offered us more information than we wanted (too many variables), therefore, a tree-pruning process was applied to see if any redundant variables could be removed. However, the pruned tree was exactly the same as our original tree, indicating that our current model is the best already.
## Prune the tree to optimize
ptree <- prune (tree, 
                cp = tree$cptable[which.min(tree$cptable[,"xerror"]), 
                                  "CP"])
rpart.plot(ptree)

Conclusion

In conclusion, to better predict hotel cancellations, a decision tree model has been created and tested with strong reliability. This model includes the following variables: room_type_matched, distribution_channel,special_requests, previously_canceled, and adr. Our finding suggests that in order to minimize cancellation rate, hotels could consider using the following strategies:

  1. upgrading a reservation when it is possible (no to room_type_matched)
  2. setting more strict cancellation restrictions for travelling agents and tour operators
  3. setting more strict cancellation restrictions for customers who had previously canceled their reservations