Three years ago my fammily have visited the Yellow Stone national park and stayed in a very expensive lodge even though I was trying to book the less expensive ones two months earlier. I was checking the booking page every hour expecting someone may cancel a booking and still ended up with that very expensive one. I was wondering if there were any trends associated with Hotel booking cancelation. Admittedly, booking cancellation prediction is more practical for hotel managers to orgnize hospitality and optimize revenue.
The hotel booking data contains comprehensive information to predict hotel booking cancellations and more.
I will go through every variable, conduct univariat analyses on most of them, use univariat and bivariat graphs to explore connections between variables, and conduct logistic regression and classification tree approaches to predict booking cansellation probability for certain bookings.
This analyses can benifit hotel managers “making the right room available for the right guest and the right price at the right time via the right distribution channel” (Mehrotra & Ruttley, 2006)
These packages are required to load and munipulate data
library(data.table) # load tata
library(tidyverse) # tidy data
## -- Attaching packages --------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.0 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(dplyr) # monipulate data
library(feasts)
## Loading required package: fabletools
library(knitr) # knit
library(stringr) # monipulate strings
## modeling package
library(rpart) # fit tree models
library(rpart.plot) # draw result of tree models
library(rattle) # fancy tree plot
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
These packages are required to build model
The original datasets come from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019.
Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking.
I load the data “hotel” and split it into “h1” for resort hotel and “h2” for city hotel and create a new dataframe combining “h1” and “h2”.
# resort hotel
h1 <- fread("hotels.csv")%>%
janitor::clean_names() %>%
filter(hotel== "Resort Hotel")
# city hotel
h2 <- fread("hotels.csv")%>%
janitor::clean_names() %>%
filter(hotel== "City Hotel")
#table(h1$hotel)
hotel_df <- bind_rows(h1, h2)
I checked the data stucture, it has 119386 observations of 32 variables, after 4 observations which has missing values in “children” were removed.
hotel_df <- na.omit(hotel_df)
glimpse(hotel_df)
## Rows: 119,386
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ arrival_date_month <chr> "July", "July", "July", "July", "Jul...
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, ...
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corpo...
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Depo...
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240"...
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <chr> "Transient", "Transient", "Transient...
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out...
## $ reservation_status_date <chr> "2015-07-01", "2015-07-01", "2015-07...
I removed some variables which are redundant or not effective for booking cancellation prediction.
“Adults” is the number of adults which is not so relevant in this case.
“Agent” was removed because there were too many choices of agents and the choices had no significant impact on cancellation.
“ArrivalDateDayofMonth” was removed because this information was already included in other variables.
“MarketSegment” was removed because a symillar variable “DistributionChannel” is more relevant.
“ReservationStatus” and “reservationstatusdate” are not relevant and removed.
“ArrivalDateofYear” and “ArrivalDateofWeek” are also considered not relevant as other variables like “daysofweekend” and “arrivaldateofmonth”.
drop.col <- c("adults","agent","arrival_date_day_of_month","market_segment",
"reservation_status","reservation_status_date",
"arrival_date_year","arrival_date_week_number")
hotel <- hotel_df %>% select(-one_of(drop.col))
glimpse(hotel)
## Rows: 119,386
## Columns: 24
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_month <chr> "July", "July", "July", "July", "Jul...
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Depo...
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <chr> "Transient", "Transient", "Transient...
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
People cancel their booking sometimes because the assigned room type is not what they reserved. Here I combined the two columns into one categorrical variable “wanted_type” with 2 categories: “0” means same and “1” means diffrent.
Considering hotel booking is seasonal as some months in a year are more popular like summer vaction months, I define July, August, popullar months as “2”, “December”,“February” “November” as “1”, and other months as “0”.
hotel <- hotel %>%
mutate(arrival_date_month = ifelse(arrival_date_month %in% c("July","August"),2,ifelse(arrival_date_month %in% c("December","January","November"),0,1)))
In the categorical variable Company, “NULL” means that the booking did not came from a company. Here I define “NULL” as “individual” and other observations than “Null” as “company”
hotel <- hotel %>%
mutate(company = ifelse(company == "NULL","individual","company"))
Domestic tourists and international tourists may have different decissions when conselling a booking, so I condense the Country variable into two categorries: “Domestic” and “international”.
hotel <- hotel %>%
mutate(country = ifelse(country == "PRT","domestic","international"))
If tourists have children especially babies they are more likely to cancele a booking due to kid sickness. Here I combine “children” and “babies” and condense into three categorries: “0”,“2” and “1” for bookings having kids but no babies.
hotel <- hotel %>%
mutate(kids = ifelse(children == 0 & babies==0,0,ifelse(babies != 0,2,1))) %>%
mutate(children = NULL,babies = NULL)
Guests can be classified by three categorries based on cancellation record: new guest(0), loyal guest(1) and non loyal guest(2).
hotel <- hotel %>%
mutate(loyalty = ifelse(is_repeated_guest == 0,0, ifelse(previous_bookings_not_canceled != 0, 1, 2))) %>%
mutate(is_repeated_guest = NULL, previous_bookings_not_canceled = NULL, previous_cancellations = NULL)
Convert most of the categorical variables to factors using forcats as_factor function, and then drop previous character version of that variable
hotel <- hotel %>%
mutate(hotel = as_factor(hotel),
distribution_channel = as_factor(distribution_channel),
is_canceled = as_factor(is_canceled),
arrival_date_month = as_factor(arrival_date_month),
meal = as_factor(meal),
country = as_factor(country),
deposit_type = as_factor(deposit_type),
company = as_factor(company),
customer_type = as_factor(customer_type),
kids = as_factor(kids),
wanted_type = as_factor(wanted_type),
loyalty = as_factor(loyalty)) %>%
glimpse()
## Rows: 119,386
## Columns: 20
## $ hotel <fct> Resort Hotel, Resort Hotel, Resort Hote...
## $ is_canceled <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ...
## $ arrival_date_month <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ...
## $ meal <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,...
## $ country <fct> domestic, domestic, international, inte...
## $ distribution_channel <fct> Direct, Direct, Direct, Corporate, TA/T...
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type <fct> No Deposit, No Deposit, No Deposit, No ...
## $ company <fct> individual, individual, individual, ind...
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type <fct> Transient, Transient, Transient, Transi...
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,...
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ...
## $ wanted_type <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ kids <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ loyalty <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Check outliers
As shown below, there is a significant outliers in “adr” which is mucher higher than the rest, so I removed it.
#check outliers in numeric variables
hotel_num <- hotel %>% select_if(is.numeric)
ggplot(gather(hotel_num,key,value=cancelation),
aes(x=cancelation)) +
geom_boxplot () +
facet_wrap(~key,scales = "free_x") +
ggtitle(" Boxplot of numeric variables")
hotel<- hotel %>% filter(adr<2000)
Below is the preview of cleaned data
hotel %>% head(6) %>% data.table()
## hotel is_canceled lead_time arrival_date_month
## 1: Resort Hotel 0 342 2
## 2: Resort Hotel 0 737 2
## 3: Resort Hotel 0 7 2
## 4: Resort Hotel 0 13 2
## 5: Resort Hotel 0 14 2
## 6: Resort Hotel 0 14 2
## stays_in_weekend_nights stays_in_week_nights meal country
## 1: 0 0 BB domestic
## 2: 0 0 BB domestic
## 3: 0 1 BB international
## 4: 0 1 BB international
## 5: 0 2 BB international
## 6: 0 2 BB international
## distribution_channel booking_changes deposit_type company
## 1: Direct 3 No Deposit individual
## 2: Direct 4 No Deposit individual
## 3: Direct 0 No Deposit individual
## 4: Corporate 0 No Deposit individual
## 5: TA/TO 0 No Deposit individual
## 6: TA/TO 0 No Deposit individual
## days_in_waiting_list customer_type adr required_car_parking_spaces
## 1: 0 Transient 0 0
## 2: 0 Transient 0 0
## 3: 0 Transient 75 0
## 4: 0 Transient 75 0
## 5: 0 Transient 98 0
## 6: 0 Transient 98 0
## total_of_special_requests wanted_type kids loyalty
## 1: 0 1 0 0
## 2: 0 1 0 0
## 3: 0 1 0 0
## 4: 0 1 0 0
## 5: 1 1 0 0
## 6: 1 1 0 0
Below is a table of variable names, data type and description of variables
hotel.type <- lapply(hotel, class)
hotel.var_desc <- c('Hotel tpye',
'If the booking was canceled(1) or not(0)',
'Number of date between booking and arrival',
'if the arrival date is in a popular month',
'Number of nights booked in weekend nights',
'Number of nights booked in week nights',
'Type of meal booked',
'Country of origin',
'Booking distribution channel',
'Number of booking changed',
'If deopsit was made to guarantee booking',
'ID of the company that made the booking',
'Number of days the books was booking',
'Type of booking categories',
'Average daily rate',
'Number of car parking space required',
'Number of special request made',
'If wanted type was matched(1) or not(0)',
'If booking with kids or babies',
'If guest is new(0) or loyal(1) or less loyal(2)'
)
hotel.var_names <- colnames(hotel)
data.description <- cbind(hotel.var_names, hotel.type, hotel.var_desc)
colnames(data.description) <- c('Variable Name', 'Data Type', 'Variable Description')
#data.description
kable(data.description,row.names = FALSE)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| hotel | factor | Hotel tpye |
| is_canceled | factor | If the booking was canceled(1) or not(0) |
| lead_time | integer | Number of date between booking and arrival |
| arrival_date_month | factor | if the arrival date is in a popular month |
| stays_in_weekend_nights | integer | Number of nights booked in weekend nights |
| stays_in_week_nights | integer | Number of nights booked in week nights |
| meal | factor | Type of meal booked |
| country | factor | Country of origin |
| distribution_channel | factor | Booking distribution channel |
| booking_changes | integer | Number of booking changed |
| deposit_type | factor | If deopsit was made to guarantee booking |
| company | factor | ID of the company that made the booking |
| days_in_waiting_list | integer | Number of days the books was booking |
| customer_type | factor | Type of booking categories |
| adr | numeric | Average daily rate |
| required_car_parking_spaces | integer | Number of car parking space required |
| total_of_special_requests | integer | Number of special request made |
| wanted_type | factor | If wanted type was matched(1) or not(0) |
| kids | factor | If booking with kids or babies |
| loyalty | factor | If guest is new(0) or loyal(1) or less loyal(2) |
Here is a summary of the cleaned dataset.
summary(hotel)
## hotel is_canceled lead_time arrival_date_month
## Resort Hotel:40060 0:75166 Min. : 0 0:19503
## City Hotel :79325 1:44219 1st Qu.: 18 1:73348
## Median : 69 2:26534
## Mean :104
## 3rd Qu.:160
## Max. :737
## stays_in_weekend_nights stays_in_week_nights meal
## Min. : 0.0000 Min. : 0.0 BB :92305
## 1st Qu.: 0.0000 1st Qu.: 1.0 FB : 798
## Median : 1.0000 Median : 2.0 HB :14463
## Mean : 0.9276 Mean : 2.5 SC :10650
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 Undefined: 1169
## Max. :19.0000 Max. :50.0
## country distribution_channel booking_changes
## domestic :48585 Direct :14645 Min. : 0.0000
## international:70800 Corporate: 6677 1st Qu.: 0.0000
## TA/TO :97869 Median : 0.0000
## Undefined: 1 Mean : 0.2211
## GDS : 193 3rd Qu.: 0.0000
## Max. :21.0000
## deposit_type company days_in_waiting_list
## No Deposit:104637 individual:112588 Min. : 0.000
## Refundable: 162 company : 6797 1st Qu.: 0.000
## Non Refund: 14586 Median : 0.000
## Mean : 2.321
## 3rd Qu.: 0.000
## Max. :391.000
## customer_type adr required_car_parking_spaces
## Transient :89612 Min. : -6.38 Min. :0.00000
## Contract : 4076 1st Qu.: 69.29 1st Qu.:0.00000
## Transient-Party:25120 Median : 94.59 Median :0.00000
## Group : 577 Mean :101.79 Mean :0.06252
## 3rd Qu.:126.00 3rd Qu.:0.00000
## Max. :510.00 Max. :8.00000
## total_of_special_requests wanted_type kids loyalty
## Min. :0.0000 0: 642 0:110053 0:115575
## 1st Qu.:0.0000 1:118743 1: 8415 1: 2838
## Median :0.0000 2: 917 2: 972
## Mean :0.5713
## 3rd Qu.:1.0000
## Max. :5.0000
The data set came frome two hotels, one is a city hotel the orher is a resort hotel. First I checked the overall cancelation percentage and cancelation percentage for each hotel. There are 40060 resort hetels bookings and 79330 city hotel bookings, and the cancelation percentage of each hotel was shown blew. The city hotel has a higher percentage of cancelation (27.7%) than the resort hotel cancelation (9.3%).
#hot <- hotel_df
hot <- hotel %>%
group_by(hotel)%>%
count(is_canceled) %>%
unite("hot_status",1:2)
pie(hot$n,labels = as.character(hot$hot_status))
hot %>% mutate(percent = n/sum(n)) %>%
select(-n)
## # A tibble: 4 x 2
## hot_status percent
## <chr> <dbl>
## 1 Resort Hotel_0 0.242
## 2 Resort Hotel_1 0.0932
## 3 City Hotel_0 0.387
## 4 City Hotel_1 0.277
The probability of canceling a booking for the city hotel is about three times of the probability for the resort hotel.
Here are histograms of numeric featues showing by tatus of cancelation.
hotel_num <- hotel_num %>% filter(adr<2000)
hotel_num <- data.frame(is_canceled=hotel$is_canceled,hotel_num)
ggplot(gather(hotel_num,key,value=cancelation,2:9),
aes(x=cancelation,fill = is_canceled))+
geom_histogram()+
facet_wrap(~key,scales = "free") +
ggtitle(" Histogram of numeric variables")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
According to the histogram of numeric variables, I found certain paterns exist between cancelation likeliness and average daily rate and the number of days staying. The two scater plots below revealed some patterns. More cancelations happened on those bookings which have more than one week stays and less than three weeks. For bookings staying less than 4 weekdays, it’s more likly to be canceled if it is more expensive.
ggplot(hotel_num,aes(x=stays_in_weekend_nights,y=adr,color=is_canceled))+
geom_point()+
ggtitle("Booking cancelation by adr and weekenddays")
ggplot(hotel_num,aes(x=stays_in_week_nights,y=adr,color=is_canceled))+
geom_point()+
ggtitle("Booking cancelation by adr and weekddays")
Here are the bar charts of categorical features by their status of cancelation. It show some patterns:
Bookings of arrival_dte_month is 0 meaning arriving in November, December or Januarry are more likely to be canceled.
Bookings from outside of the country are less likely to be canceled.
Most Non Refund bookings were canceled.
Booking with kids envoled especially withe babies envoled are more likely to be canceled.
## histogram of factor variables
hotel_hist_fact <- hotel %>% select_if(is.factor)
ggplot(gather(hotel_hist_fact,key,value = cancelation,3:ncol(hotel_hist_fact)),
aes(x=cancelation,fill=is_canceled))+
geom_bar()+
facet_wrap(~key,scales = "free",nrow = 3)+
ggtitle("Histogram of numeric variables")
## Warning: attributes are not identical across measure variables;
## they will be dropped
A classifying Tree is a simple representation for Supervised Machine Learning where the data is continuously split according to a certain parameter. Such a tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches. Classification tree is inexpensive to construct,extremely fast at classifying unknown records, and very easy to interpret. To prevent overfitting we usually set limits on depth of trees and prun the tree.
In this case we are tring to predict if a booking will be canceled according to 19 features. First I split the data set into training set and testing set randemly by 7:3.
set.seed(123)
index <- sample(nrow(hotel),0.7*nrow(hotel))
hotel_train <- hotel[index,]
hotel_test <- hotel[-index,]
Then use rpart and rpartplot to fit and graph the model
fit <- rpart(is_canceled ~., data = hotel_train, method = 'class')
fancyRpartPlot(fit,tweak = 3)
I test the model using testing data set, at last compute the in sample and out-of-sample accuracy.
in_sample_pred <- predict(fit,hotel_train,type = "class")
in_table <- table(in_sample_pred,hotel_train$is_canceled)
in_sample_cost <- (in_table[1,2]+in_table[2,1])/(in_table[2,1]+in_table[2,2]+in_table[1,1]+in_table[1,2])
in_sample_accuracy <- 1- in_sample_cost
out_pred <- predict(fit,hotel_test,type = "class")
out_table <- table(out_pred,hotel_test$is_canceled)
out_cost <- (out_table[1,2]+out_table[2,1])/(out_table[2,1]+out_table[2,2]+out_table[1,1]+out_table[1,2])
out_accuracy = 1- out_cost
model_result = data.frame(round(cbind(c(in_sample_accuracy,out_accuracy),c(in_sample_cost,out_cost)),3))
colnames(model_result) <- c("accuracy","cost")
rownames(model_result) <- c("in sample","out-of-sample")
kable(model_result)
| accuracy | cost | |
|---|---|---|
| in sample | 0.801 | 0.199 |
| out-of-sample | 0.796 | 0.204 |
This project revealed some trends of hotel cancelation and producted a model with acceptable performance.
Bookings with nonrefundable deposit are more likely to cancel.
Bookings from outside of the country are less likely to cancel.
Bookings with high everage daily rate are more likely to cancel.
Bookings frequently chenged are more likely to cancel.
Bookings envoling kids or babies are more likely to cancel.
This analysis can give hotel managers predict custmer needs, arrange room asignment, make booking policies and so on.
Unsupervised learning models may also help to do a good job.