As one of the most important components on our travelling agenda, securing a hotel room in a strange city or even country ahead of our travelling seem to be one of the first things to hit on our to-do list. Thanks to the technologies and various hotel booking platforms, travelers nowadays are able to book or cancel hotel reservations in just a blink of an eye. Along with the convinience of making hotel reservations, the cancellation of a room reservation has also become much easier. As a result, the potential loss due to hotel cancellation might post additional cost to hotel management in general. In other words, to have a good sense about how cancellation activities look like has become increasingly important for hotels for their daily management as well as long-term business decision making. For this final project, I used the data collected by hotel booking platforms regarding different attributes of the reservation to better understand hotel cancellation in general, then try to answer questions in related to hotel cancellation such as what factors are closely associated with hotel cancellation, and how can we better predict hotel cancellation. With these answers and information, the hotel management team would be better informed with the following aspects:
To answer these questions, I will
The main R package used for this project is tidyverse. The core functions inbeded in tidyverse include ggplot2, tidyr, dplyr etc. Some non-core functions of the tidyverse package, some plotting packages, and the decision tree package were also use in this project.
#install.packages('tidyverse')
#install.packages('readxl')
#install.packages('lubridate')
#install.packages('magrittr')
#install.packages('glue')
#install.packages("ggpubr")
#install.packages("rpart")
#install.packages("party")
#install.packages("rpart.plot")
#install.packages("patchwork")
#install.packages("kableExtra")
#install.packages("readxl")
library(kableExtra)
library(patchwork)
library(tidyverse)
library(readxl)
library(lubridate)
library(magrittr)
library(glue)
library(ggpubr)
library(rpart)
library(party)
library(rpart)
library(rpart.plot)
library(readxl)
The hotels dataset used for this project was downloaded from the BANA 7025 Course Website. A total number of 32 varibales are included in the hotel dataset, most of which are categorical variables. Detailed variable name, type, and descriptions are listed below.
hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled double Value indicating if the booking was canceled (1) or not (0)
lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year double Year of arrival date
arrival_date_month character Month of arrival date
arrival_date_week_number double Week number of year for arrival date
arrival_date_day_of_month double Day of arrival date
stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults double Number of adults
children double Number of children
babies double Number of babies
country character Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest double Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations double Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled double Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes double Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
agent character ID of the travel agency that made the booking
company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list double Number of days the booking was in the waiting list before it was confirmed to the customer
adr double Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces double Number of car parking spaces required by the customer
total_of_special_requests double Number of special requests made by the customer (e.g. twin bed or high floor)
As you can see, it is a rather long variable list, and it might be not easy to digest. To have a better idea about the variables, I grouped them into four general categories:
adults, children, babies, country, is_repeated_guest, previous_cancellations, previous_bookings_not_cancelled, agent, company, customer_type, and total_of_special_requests
hotel, meal, reserved_room_type, assigned_room_type, adr, and required_car_parking_spaces
lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_week_nights, stays_in_weekend_nights, market_segment, distribution_channel, booking_changes, deposit_type, days_in_waiting_list, and reservation_status_date
is_canceled, and reservation_status
Now let's take a closer look at the dataset.
#load dataset.
hotels <- read.csv("hotels.csv", header = TRUE)
glimpse(hotels)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date <chr> "2015-07-01", "2015-07-01", "2015-07-02…
# missing data in general
sum(is.na(hotels))
## [1] 4
# missing data broken down by variable
# colSums(is.na(hotels))
The output above indicates that the hotels dataset is a fairly complete dataset, with only 4 cases of missing data for children. To further check for missing values that might be labeled by other expressions in the dataset (e.g., -99, 99, NULL, etc.), unique values for each variable were extracted using the R fuction table(data$variable). Two variavles agent and company seem to have more NULL values than the others.
table(hotels$agent)
table(hotels$company)
agent and company, there were a large amount of cases being labeled as "NULL" (n=16340, 112593 respectively), which indicated missing dataTo replace the "NULL" values with NA for variable agent and company, the codes below was used. All missing data will remain in the form of "NA" due to the fact that data inputation does not really apply to categorical variables.
hotels$agent[hotels$agent %in% c("NULL")] <- "NA"
hotels$company[hotels$company %in% c("NULL")] <- "NA"
# QA
table(hotels$agent)
table(hotels$company)
Dupicate rows can be problematic and cause issues for data analysis and interpretation. Therefore in this step, I first identified duplicate rows (if any), and then removed them. The code and output below demonstrate this step.
# check the total number of duplicates in the dataset
sum(duplicated(hotels))
## [1] 31994
# remove duplicates
hotels_no_dup <- hotels[!duplicated(hotels), ]
# QA
sum(duplicated(hotels_no_dup))
## [1] 0
hotel dataset. After removing the duplicate rows, the new dataset was saved as hotels_no_dup, and no more duplicate rows remianed in the new datasetOf the 32 variables, variable lead_timeand adr are more understandable as numeric, so summary statistics were computed as shown below. Other categorical variables were demonstrated in the form of barplots.
summary(hotels_no_dup$lead_time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.00 49.00 79.89 125.00 737.00
summary(hotels_no_dup$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 72.00 98.10 106.34 134.00 5400.00
The minimum value for variable adr seems to be invalid as the averaged hotel rate should not be negative in most situations. Examined the dataset again (with the table(dataset$variable function), there was one value of "-6.38" in the dataset, along with a number of cases with very low adr values such as "0", "0.26" and "0.5" etc. The existance of various low adr values suggests that the case of "-6.38" might not be a misentering or invalid data. Instead, there must be a systematic reason for those low adr values such as gift cards or credits being redeamed or hotel compensation. In a word, without conforming with the client first, all those low values in adr are still considered to be valid and thus no action is needed now.
R function barplot(table(data$variable)) was used to visualize frequencies for several selected categorical variables. Due to the large variable numbers, the output figures are not shown here. Instead, the table below summarizes some of the major patterns found by just eyeballing the data.
| Variable | Aspect Measured | Finding |
|---|---|---|
| hotel | hotel type | About 2/3 were city hotels, the rest were resort hotels |
| is_canceled | Is the reservation cancelled? | Most reservations (about 2/3) were not cancelled, while about 1/3 were cancelled. |
| lead_time | number of days between booking and arrival | Mean lead time was 79.89 (not sure in what unit) |
| arrival_date_year | arrival year | Data was for 2015, 2016, 2017, and 2016 and 2017 seem to have most of the arrivals. |
| arrival_date_month | arrival month | Arrivals seem to be evenly distributed throughout the year, with some peaks in July and August, some lows in January and November. |
| arrival_date_week_number | arrival week | N/A |
| arrival_date_day_of_month | arrival day | N/A |
| stays_in_weekend_nights | number of weekend nights stayed | Most guest who stayed on weekends had spent 1 to two nights. |
| stays_in_week_nights | number of week nights stayed | Most guest who stayed on week days had spent 1 to five nights. In addition, more guest stayed on week days than weekends. |
| adults | number of adults | 2 adults for most of the cases |
| children | number of children | 0 child for most of the cases |
| babies | number of babies | 0 babies for most of the cases |
| meal | type of mean booked | Most guests booked bed & breakfast meal type |
| country | guest country of origin | PRT, FRA, and GBR seem to be the top 3 countries of origin |
| market_segment | market segment | Mostly booked through online travel agencies |
| distribution_channel | distribution channel | Mostly booked through travel agents or tour operators |
| is_repeated_guest | is repeated guest | Most were not repeated guests |
| previous_cancellations | number of previous cancellations | Most guests did not cancel reservations previously |
| previous_bookings_not_canceled | Number of previous bookings not canceled | Most had value in 0 |
| reserved_room_type | room type reserved | Mostly type A and D |
| assigned_room_type | room type assigned | Mostly type A and D |
| booking_changes | number of changes made to the booking | Mostly no changes made to the reservation |
| deposit_type | deposit type | Mostly no deposit |
| agent | ID of the travel agency that made the booking | 9 and 240 account for most |
| company | ID of the company/entity that made the booking or responsible for paying the booking | NA |
| days_in_waiting_list | number of days in waitlist | Mostly 0 |
| customer_type | type of booking | Mostly transient guests |
| adr | average daily rate | Mean is $106.34 |
| required_car_parking_spaces | number of parking spaces required | Mostly 0, with some 1 |
| total_of_special_requests | number of special requests made | Mostly 0, some had 1 to 3 special requests |
| reservation_status | reservation last status | Mostly checked out |
| reservation_status_date | reservation last status date | N/A |
hotels_no_dup is considered clean, and it was renamed as hotels_final. The final dataset has 87396 rows and 32 variables.The purpose of EDA is to have a better idea about the interelationships among predicting variables and whether the reservation was canceled. To achieve this, data-subsetting, dataset-joinning, and composing new variables were done in this step.
To see whether certain patterns would emerge, the final dataset was subsetted for cases that were canceled only. Barplots were also created for the subset hotels_canceled to see if there are any observable patterns in the canceled data. The barplots (not shown here due to large amount) revealed that all variables seemed to have followed the same pattern in the overall dataset and the canceled only dataset. In other words, no specific patterns emerged at this step.
# Create a subset with canceled cases only
hotels_canceled <- hotels_final %>%
filter(is_canceled == 1)
hotels_canceled %>%
summarise(cancelation_rate = sum (is_canceled == 1)/87396 * 100, na.rm = TRUE)
## cancelation_rate na.rm
## 1 27.48982 TRUE
The hotels_final dataset were subsetted in this step. The dataset was subsetted by predicting variable categoris (i.e., customer attributes, hotel attributes, and booking attributes).
# customer attribute subset
hotels_ca <- hotels_final %>%
select(is_canceled, adults, children, babies, country, is_repeated_guest, previous_cancellations, agent, company, customer_type, total_of_special_requests) %>%
glimpse()
# hotel attribute subset
hotels_ha <- hotels_final %>%
select(is_canceled, hotel, meal, reserved_room_type, assigned_room_type, adr, required_car_parking_spaces) %>%
glimpse()
# booking attribute subset, `reservation_status_date` variable excluded due to small relavancy
hotels_ba <- hotels_final %>%
select(is_canceled, lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_week_nights, stays_in_weekend_nights, market_segment, distribution_channel, booking_changes, deposit_type, days_in_waiting_list) %>%
glimpse()
Due to the fact that the outcome variable is_canceled and most of the variables in the dataset are categorical in nature, chi-squre test was used as the primary method to test variable independency. In other words, test statistics will be used to determine whether a predicting variable is correlated with the outcome variable is_canceled. For some double type variables (i.e., adults, children, babies, previous_cancellations, and total_of_special_requests), dummy variables were created first so that chi-square test could be performed. The codes and output are shown below.
# independency test for is_canceled with other variables
## Whether traveling with child/baby matters
hotels_ca_1 <- hotels_ca %>%
mutate(with_children = case_when(
children > 0 | babies >0 ~ 1,
TRUE ~ 0))
### Visualize with stacked bar plot.
p1 <- ggplot(hotels_ca_1, aes(is_canceled, fill = factor(with_children))) +
geom_bar(position = "fill")
chisq.test(hotels_ca_1$with_children, hotels_ca_1$is_canceled)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hotels_ca_1$with_children and hotels_ca_1$is_canceled
## X-squared = 232.76, df = 1, p-value < 2.2e-16
## Whether guest who previously canceled matters
hotels_ca_2 <- hotels_ca_1 %>%
mutate(previous_canceled = case_when(
previous_cancellations > 0 ~ 1,
TRUE ~ 0))
### Visualize with stacked bar plot.
p2 <- ggplot(hotels_ca_2, aes(is_canceled, fill = factor(previous_canceled))) +
geom_bar(position = "fill")
chisq.test(hotels_ca_2$previous_canceled, hotels_ca_2$is_canceled)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hotels_ca_2$previous_canceled and hotels_ca_2$is_canceled
## X-squared = 1405, df = 1, p-value < 2.2e-16
## Whether customer with special request matters
hotels_ca_3 <- hotels_ca_2 %>%
mutate(special_requests = case_when(
total_of_special_requests > 0 ~ 1,
TRUE ~ 0))
### Visualize with stacked bar plot.
p3 <- ggplot(hotels_ca_3, aes(is_canceled, fill = factor(special_requests))) +
geom_bar(position = "fill")
chisq.test(hotels_ca_3$special_requests, hotels_ca_3$is_canceled)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hotels_ca_3$special_requests and hotels_ca_3$is_canceled
## X-squared = 1443.3, df = 1, p-value < 2.2e-16
## Whether customer type matters
p4 <- ggplot(hotels_ca, aes(is_canceled, fill = factor(customer_type))) +
geom_bar(position = "fill")
chisq.test(hotels_ca$customer_type, hotels_ca$is_canceled)
##
## Pearson's Chi-squared test
##
## data: hotels_ca$customer_type and hotels_ca$is_canceled
## X-squared = 1411.2, df = 3, p-value < 2.2e-16
(p1 | p2) /
(p3 | p4)
The chi-square test statistics showed that all customer attributes examined (with_children, previous_canceled, special_requests, and customer_type) were significantly associated with the cancellation status.
Triangluating with the figures, variable special_requests was selected as the most correlatable customer attribute with hotel cancellation becuase it had the largest x-squared value (1443.3)
Similarly, several hotel attribute variables hotel, reserved_room_type, assigned_room_type, and adr were examined in terms of their relationships with is_canceled. Sepcifically, a variavle room_type_matched was created based on reserved_room_type and assigned_room_type to see whether not getting what they wanted would result in room being canceled. Codes, figures, and chi-square test statistics are all shown below.
# independency test for is_canceled with other variables
## Whether hotel type matters
p5 <- ggplot(hotels_ha, aes(is_canceled, fill = factor(hotel))) +
geom_bar(position = "fill")
chisq.test(hotels_ha$hotel, hotels_ha$is_canceled)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hotels_ha$hotel and hotels_ha$is_canceled
## X-squared = 447.66, df = 1, p-value < 2.2e-16
## Whether getting the correct room type matters
hotels_ha_1 <- hotels_ha %>%
mutate(room_type_matched = case_when(
reserved_room_type == assigned_room_type ~ 1,
TRUE ~ 0))
### Visualize with stacked bar plot.
p6 <- ggplot(hotels_ha_1, aes(is_canceled, fill = factor(room_type_matched))) +
geom_bar(position = "fill")
chisq.test(hotels_ha_1$room_type_matched, hotels_ha_1$is_canceled)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hotels_ha_1$room_type_matched and hotels_ha_1$is_canceled
## X-squared = 4014.8, df = 1, p-value < 2.2e-16
(p5 | p6)
## Whether rate/night matters
### Visualization.
boxplot(adr ~ is_canceled, data = hotels)
### Correlation test
cor.test(hotels_ha$is_canceled, hotels_ha$adr)
##
## Pearson's product-moment correlation
##
## data: hotels_ha$is_canceled and hotels_ha$adr
## t = 38.149, df = 87394, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1214588 0.1345013
## sample estimates:
## cor
## 0.1279856
The test statistics revealed that hotel type, whether customers get the room type that they wanted, and hotel rate all significantly correlated with a reservation's cancellation status. specifically, whether the room type reserved and assigned match matters a lot in terms of whether the room would be canceled or not (x-squared = 4041.8, p < .001). The visualizations suggested that there were slightly more city hotels being canceled than the resort hotels, but test statistics was not very high comparing to room_type_mathced. The adr was positively correlated with cancellation with a statistical significance (correlation coefficient = 0.128, p < 0.001), however, the correlation coefficient was not very high.
room_type_matched is selected as the most correlatable hotel attribute with hotel cancellation.Lastly, several booking attribute variables arrival_date_month, stays_in_week_nights, stays_in_weekend_nights, and distribution_channel were examined in terms of their relationships with is_canceled. To simplify the analysis, arrival_date_month were aggregated into arrival_season, stays_in_week_nights and stays_in_weekend_nights were combined as nights_staying. Codes and output are shown below.
# independency test for is_canceled with other BA variables
## Whether season matters
### Relevel the month
hotels_ba_1 <- hotels_ba %>%
mutate(arrival_date_month = fct_relevel(arrival_date_month, "January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"))
### Categorize months into seasons
hotels_ba_2 <- hotels_ba_1 %>%
mutate(arrival_season = fct_collapse(arrival_date_month,
spring = c("March", "April", "May"),
summer = c("June", "July", "August"),
fall = c("September", "October", "November"),
winter = c("December", "January", "February")
),
arrival_season = fct_explicit_na(arrival_season)
)
### Visualization
p7 <- ggplot(hotels_ba_2, aes(is_canceled, fill = factor(arrival_season))) +
geom_bar(position = "fill")
### Chi-squared test
chisq.test(hotels_ba_2$arrival_season, hotels_ba_2$is_canceled)
##
## Pearson's Chi-squared test
##
## data: hotels_ba_2$arrival_season and hotels_ba_2$is_canceled
## X-squared = 503.14, df = 3, p-value < 2.2e-16
## Whether distribution channel matters
### Visualization
p8 <- ggplot(hotels_ba_2, aes(is_canceled, fill = factor(distribution_channel))) +
geom_bar(position = "fill")
### Chi-squared test
chisq.test(hotels_ba_2$distribution_channel, hotels_ba_2$is_canceled)
## Warning in chisq.test(hotels_ba_2$distribution_channel,
## hotels_ba_2$is_canceled): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: hotels_ba_2$distribution_channel and hotels_ba_2$is_canceled
## X-squared = 2031.2, df = 4, p-value < 2.2e-16
(p7 | p8)
## Whether number of nights staying matters
hotels_ba_3 <- hotels_ba_2 %>%
mutate(nights_staying = stays_in_week_nights + stays_in_weekend_nights)
### Visualization
boxplot(nights_staying ~ is_canceled, data = hotels_ba_3)
### Correlation test
cor.test(hotels_ba_3$is_canceled, hotels_ba_3$nights_staying)
##
## Pearson's product-moment correlation
##
## data: hotels_ba_3$is_canceled and hotels_ba_3$nights_staying
## t = 24.951, df = 87394, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07751542 0.09068133
## sample estimates:
## cor
## 0.08410204
Test statistics suggested that season, number of nights staying, and distribution channel all significantly correlated with hotel cancellation. Traingulating with the figures, we know that summer time had more cancellation than other seasons, and bookings through travelling agents or tour operators had more cancellations.
distribution_channel was selected as the booking attribute that correlated the most with hotel cancellation because it had the largest x-squared value (2031.2).The final predicting variables selected for a further predictive analysis were special_requests, room_type_mathced, and distribution_channel.
The EDA results showed that a lot of the variables were strongly correlated with hotel cancellation. Therefore, in this step, a prediction model for hotel cancellation was tested and fine-tuned. Decision tree was used for the creation of this model.
A final dataset for decision tree needs to be created first. This final dataset should include all relative variables inlcuding the composite variables that were computed in the EDA step (e.g., special_requests, room_type_matched, and distribution_channel. Decision tree will first be perfomred with all related variables, then the model will be compared with the selected composite variables by EDA. Ideally, the variables selected by decision tree would match the variables selected by the correlational analysis in EDA, and a predicting model consist of those variables would be considered to be a good fit for our data.
# Joining EDA subsets
## compute an identifier variable for later merging purpose
hotels_ca_3a <- hotels_ca_3 %>%
mutate(ID = row_number())
hotels_ha_1a <- hotels_ha_1 %>%
mutate(ID = row_number())
hotels_ba_3a <- hotels_ba_3 %>%
mutate(ID = row_number())
## Merge the three datasets to create a general dataset
hotels_dt_full <- hotels_ca_3a %>%
full_join(hotels_ha_1a, by = "ID") %>%
full_join(hotels_ba_3a, by = "ID") %>%
select(is_canceled, with_children, previous_canceled, special_requests, adr, room_type_matched, distribution_channel, arrival_season, nights_staying)
## convert is_canceled to factor
hotels_dt_full$canceled_f <- factor(hotels_dt_full$is_canceled)
hotels_dt_full$children_f <- factor(hotels_dt_full$with_children)
hotels_dt_full$pre_canceled_f <- factor(hotels_dt_full$previous_canceled)
hotels_dt_full$special_f <- factor(hotels_dt_full$special_requests)
hotels_dt_full$room_matched_f <- factor(hotels_dt_full$room_type_matched)
hotels_dt_full$channel_f <- factor(hotels_dt_full$distribution_channel)
hotels_dt_full$season_f <- factor(hotels_dt_full$arrival_season)
hotels_dt <- hotels_dt_full %>%
select(canceled_f, children_f, pre_canceled_f, special_f, adr, room_matched_f, channel_f, season_f, nights_staying)
glimpse(hotels_dt)
# Partition data into training and testing subsets
set.seed(1234)
pd <- sample(2, nrow(hotels_dt), replace = TRUE, prob = c(0.8, 0.2))
train <- hotels_dt[pd == 1,]
test <- hotels_dt[pd == 2,]
## plant the tree in the training set
tree <- rpart(canceled_f ~., train)
## visualize the tree
rpart.plot(tree)
The decision tree above suggests that room_type_matched, distribution_channel, and special_requests are the top tree important variables in predicting cancellation, which is an exact match with our EDA selected variables. It means that our variable selection was robust and valid.
Specifically, the tree above suggests that a customer who received a matched room type, booked through travel agent/tour operators, had special requests, and had previously canceled an reservation would have a chance of 93% to cancel his/her reservation. Also, a customer who received a matched room type, booked through travel agent/tour operators, did not have special requests, paid less than $96/night, and had previously canceled an reservation would have a 97% chance to cancel the reservation.
The accuray rate of our tree was probed by calculating the classification error. As the output below above, our predicting model has a classification error of 0.24, meaning that for 76% of the time, our model made the correct prediction. A similar accuracy rate was found in in the test dataset as well.
## misclassification table for train data
tree1 <- ctree(canceled_f ~., data = train, controls = ctree_control(mincriterion = 0.99, minsplit = 500))
mis_tab <- table(predict(tree1, train), train$canceled_f)
## Misclassification Rate
1 - sum(diag(mis_tab))/sum(mis_tab)
## [1] 0.2440926
## Misclassification Table with Test data
testPred <- predict(tree1, newdata = test)
tab <- table(testPred, test$canceled_f)
### Misclassification Rate in test data
1 - sum(diag(tab))/sum(tab)
## [1] 0.2446809
## Prune the tree to optimize
ptree <- prune (tree,
cp = tree$cptable[which.min(tree$cptable[,"xerror"]),
"CP"])
rpart.plot(ptree)
In conclusion, to better predict hotel cancellations, a decision tree model has been created and tested with strong reliability. This model includes the following variables: room_type_matched, distribution_channel,special_requests, previously_canceled, and adr. Our finding suggests that in order to minimize cancellation rate, hotels could consider using the following strategies:
room_type_matched)