Hotel Cancelation Prediction

Introduction

Three years ago my fammily have visited the Yellow Stone national park and stayed in a very expensive lodge even though I was trying to book the less expensive ones two months earlier. I was checking the booking page every hour expecting someone may cancel a booking and still ended up with that very expensive one. I was wondering if there were any trends associated with Hotel booking cancelation. Admittedly, booking cancellation prediction is more practical for hotel managers to orgnize hospitality and optimize revenue.

The hotel booking data contains comprehensive information to predict hotel booking cancellations and more.

I will go through every variable, conduct univariat analyses on most of them, use univariat and bivariat graphs to explore connections between variables, and conduct logistic regression and classification tree approaches to predict booking cansellation probability for certain bookings.

This analyses can benifit hotel managers “making the right room available for the right guest and the right price at the right time via the right distribution channel” (Mehrotra & Ruttley, 2006)

Packages required

These packages are required to load and munipulate data

library(data.table) # load tata 
library(tidyverse)  # tidy data

## -- Attaching packages --------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()

library(dplyr) # monipulate data
library(feasts)

## Loading required package: fabletools

library(knitr) # knit 
library(stringr) # monipulate strings
## modeling package
library(rpart)   # fit tree models
library(rpart.plot)  # draw result of tree models
library(rattle) # fancy tree plot

## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

These packages are required to build model

Data preparation

The original datasets come from an open hotel booking demand dataset from Antonio, Almeida and Nunes, 2019.

Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking.

I load the data “hotel” and split it into “h1” for resort hotel and “h2” for city hotel and create a new dataframe combining “h1” and “h2”.

# resort hotel
h1 <- fread("hotels.csv")%>% 
  janitor::clean_names() %>% 
  filter(hotel== "Resort Hotel")

# city hotel
h2 <- fread("hotels.csv")%>% 
  janitor::clean_names() %>% 
  filter(hotel== "City Hotel")
#table(h1$hotel)
hotel_df <- bind_rows(h1, h2)

I checked the data stucture, it has 119386 observations of 32 variables, after 4 observations which has missing values in “children” were removed.

hotel_df <- na.omit(hotel_df)
glimpse(hotel_df)

## Rows: 119,386
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ arrival_date_month             <chr> "July", "July", "July", "July", "Jul...
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, ...
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corpo...
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Depo...
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240"...
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type                  <chr> "Transient", "Transient", "Transient...
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out...
## $ reservation_status_date        <chr> "2015-07-01", "2015-07-01", "2015-07...

I removed some variables which are redundant or not effective for booking cancellation prediction.

“Adults” is the number of adults which is not so relevant in this case.
“Agent” was removed because there were too many choices of agents and the choices had no significant impact on cancellation.
“ArrivalDateDayofMonth” was removed because this information was already included in other variables.
“MarketSegment” was removed because a symillar variable “DistributionChannel” is more relevant.
“ReservationStatus” and “reservationstatusdate” are not relevant and removed.
“ArrivalDateofYear” and “ArrivalDateofWeek” are also considered not relevant as other variables like “daysofweekend” and “arrivaldateofmonth”.

drop.col <- c("adults","agent","arrival_date_day_of_month","market_segment",
              "reservation_status","reservation_status_date",
              "arrival_date_year","arrival_date_week_number")
hotel <- hotel_df %>% select(-one_of(drop.col))
glimpse(hotel)

## Rows: 119,386
## Columns: 24
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Res...
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 7...
## $ arrival_date_month             <chr> "July", "July", "July", "July", "Jul...
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, ...
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", ...
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "...
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corpo...
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "...
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "...
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Depo...
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NUL...
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type                  <chr> "Transient", "Transient", "Transient...
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98....
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, ...

People cancel their booking sometimes because the assigned room type is not what they reserved. Here I combined the two columns into one categorrical variable “wanted_type” with 2 categories: “0” means same and “1” means diffrent.

Considering hotel booking is seasonal as some months in a year are more popular like summer vaction months, I define July, August, popullar months as “2”, “December”,“February” “November” as “1”, and other months as “0”.

hotel <- hotel %>% 
  mutate(arrival_date_month = ifelse(arrival_date_month %in% c("July","August"),2,ifelse(arrival_date_month %in% c("December","January","November"),0,1)))

In the categorical variable Company, “NULL” means that the booking did not came from a company. Here I define “NULL” as “individual” and other observations than “Null” as “company”

hotel <- hotel %>% 
  mutate(company = ifelse(company == "NULL","individual","company"))

Domestic tourists and international tourists may have different decissions when conselling a booking, so I condense the Country variable into two categorries: “Domestic” and “international”.

hotel <- hotel %>% 
  mutate(country = ifelse(country == "PRT","domestic","international"))

If tourists have children especially babies they are more likely to cancele a booking due to kid sickness. Here I combine “children” and “babies” and condense into three categorries: “0”,“2” and “1” for bookings having kids but no babies.

hotel <- hotel %>% 
  mutate(kids = ifelse(children == 0 & babies==0,0,ifelse(babies != 0,2,1))) %>% 
  mutate(children = NULL,babies = NULL)

Guests can be classified by three categorries based on cancellation record: new guest(0), loyal guest(1) and non loyal guest(2).

hotel <- hotel %>%
  mutate(loyalty = ifelse(is_repeated_guest == 0,0, ifelse(previous_bookings_not_canceled != 0, 1, 2))) %>% 
  mutate(is_repeated_guest = NULL, previous_bookings_not_canceled = NULL, previous_cancellations = NULL)

Convert most of the categorical variables to factors using forcats as_factor function, and then drop previous character version of that variable

hotel <- hotel %>% 
  mutate(hotel = as_factor(hotel),
         distribution_channel = as_factor(distribution_channel),
         is_canceled = as_factor(is_canceled),
         arrival_date_month = as_factor(arrival_date_month),
         meal = as_factor(meal),
         country  = as_factor(country),
         deposit_type = as_factor(deposit_type),
         company = as_factor(company),
         customer_type = as_factor(customer_type),
         kids = as_factor(kids),
         wanted_type = as_factor(wanted_type),
         loyalty = as_factor(loyalty)) %>% 
  glimpse()

## Rows: 119,386
## Columns: 20
## $ hotel                       <fct> Resort Hotel, Resort Hotel, Resort Hote...
## $ is_canceled                 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
## $ lead_time                   <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ...
## $ arrival_date_month          <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ stays_in_weekend_nights     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stays_in_week_nights        <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ...
## $ meal                        <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,...
## $ country                     <fct> domestic, domestic, international, inte...
## $ distribution_channel        <fct> Direct, Direct, Direct, Corporate, TA/T...
## $ booking_changes             <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deposit_type                <fct> No Deposit, No Deposit, No Deposit, No ...
## $ company                     <fct> individual, individual, individual, ind...
## $ days_in_waiting_list        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ customer_type               <fct> Transient, Transient, Transient, Transi...
## $ adr                         <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,...
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ total_of_special_requests   <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ...
## $ wanted_type                 <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ kids                        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ loyalty                     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Check outliers

As shown below, there is a significant outliers in “adr” which is mucher higher than the rest, so I removed it.

#check outliers in numeric variables
hotel_num <- hotel %>% select_if(is.numeric)
ggplot(gather(hotel_num,key,value=cancelation),
              aes(x=cancelation)) + 
    geom_boxplot () +
      facet_wrap(~key,scales = "free_x") +
  ggtitle(" Boxplot of numeric variables")

hotel<- hotel %>% filter(adr<2000)

Below is the preview of cleaned data

hotel %>% head(6) %>% data.table()

##           hotel is_canceled lead_time arrival_date_month
## 1: Resort Hotel           0       342                  2
## 2: Resort Hotel           0       737                  2
## 3: Resort Hotel           0         7                  2
## 4: Resort Hotel           0        13                  2
## 5: Resort Hotel           0        14                  2
## 6: Resort Hotel           0        14                  2
##    stays_in_weekend_nights stays_in_week_nights meal       country
## 1:                       0                    0   BB      domestic
## 2:                       0                    0   BB      domestic
## 3:                       0                    1   BB international
## 4:                       0                    1   BB international
## 5:                       0                    2   BB international
## 6:                       0                    2   BB international
##    distribution_channel booking_changes deposit_type    company
## 1:               Direct               3   No Deposit individual
## 2:               Direct               4   No Deposit individual
## 3:               Direct               0   No Deposit individual
## 4:            Corporate               0   No Deposit individual
## 5:                TA/TO               0   No Deposit individual
## 6:                TA/TO               0   No Deposit individual
##    days_in_waiting_list customer_type adr required_car_parking_spaces
## 1:                    0     Transient   0                           0
## 2:                    0     Transient   0                           0
## 3:                    0     Transient  75                           0
## 4:                    0     Transient  75                           0
## 5:                    0     Transient  98                           0
## 6:                    0     Transient  98                           0
##    total_of_special_requests wanted_type kids loyalty
## 1:                         0           1    0       0
## 2:                         0           1    0       0
## 3:                         0           1    0       0
## 4:                         0           1    0       0
## 5:                         1           1    0       0
## 6:                         1           1    0       0

Below is a table of variable names, data type and description of variables

hotel.type <- lapply(hotel, class)
hotel.var_desc <- c('Hotel tpye',
               'If the booking was canceled(1) or not(0)',
               'Number of date between booking and arrival',
               'if the arrival date is in a popular month',
               'Number of nights booked in weekend nights',
               'Number of nights booked in week nights',
               'Type of meal booked',
               'Country of origin',
               'Booking distribution channel',
               'Number of booking changed',
               'If deopsit was made to guarantee booking',
               'ID of the company that made the booking',
               'Number of days the books was booking',
               'Type of booking categories',
               'Average daily rate',
               'Number of car parking space required',
               'Number of special request made',
               'If wanted type was matched(1) or not(0)',
               'If booking with kids or babies',
               'If guest is new(0) or loyal(1) or less loyal(2)'
               )
hotel.var_names <- colnames(hotel)
data.description <- cbind(hotel.var_names, hotel.type, hotel.var_desc)
colnames(data.description) <- c('Variable Name', 'Data Type', 'Variable Description')
#data.description
kable(data.description,row.names = FALSE)

Variable Name	Data Type	Variable Description
hotel	factor	Hotel tpye
is_canceled	factor	If the booking was canceled(1) or not(0)
lead_time	integer	Number of date between booking and arrival
arrival_date_month	factor	if the arrival date is in a popular month
stays_in_weekend_nights	integer	Number of nights booked in weekend nights
stays_in_week_nights	integer	Number of nights booked in week nights
meal	factor	Type of meal booked
country	factor	Country of origin
distribution_channel	factor	Booking distribution channel
booking_changes	integer	Number of booking changed
deposit_type	factor	If deopsit was made to guarantee booking
company	factor	ID of the company that made the booking
days_in_waiting_list	integer	Number of days the books was booking
customer_type	factor	Type of booking categories
adr	numeric	Average daily rate
required_car_parking_spaces	integer	Number of car parking space required
total_of_special_requests	integer	Number of special request made
wanted_type	factor	If wanted type was matched(1) or not(0)
kids	factor	If booking with kids or babies
loyalty	factor	If guest is new(0) or loyal(1) or less loyal(2)

Here is a summary of the cleaned dataset.

summary(hotel)

##           hotel       is_canceled   lead_time   arrival_date_month
##  Resort Hotel:40060   0:75166     Min.   :  0   0:19503           
##  City Hotel  :79325   1:44219     1st Qu.: 18   1:73348           
##                                   Median : 69   2:26534           
##                                   Mean   :104                     
##                                   3rd Qu.:160                     
##                                   Max.   :737                     
##  stays_in_weekend_nights stays_in_week_nights        meal      
##  Min.   : 0.0000         Min.   : 0.0         BB       :92305  
##  1st Qu.: 0.0000         1st Qu.: 1.0         FB       :  798  
##  Median : 1.0000         Median : 2.0         HB       :14463  
##  Mean   : 0.9276         Mean   : 2.5         SC       :10650  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         Undefined: 1169  
##  Max.   :19.0000         Max.   :50.0                          
##           country      distribution_channel booking_changes  
##  domestic     :48585   Direct   :14645      Min.   : 0.0000  
##  international:70800   Corporate: 6677      1st Qu.: 0.0000  
##                        TA/TO    :97869      Median : 0.0000  
##                        Undefined:    1      Mean   : 0.2211  
##                        GDS      :  193      3rd Qu.: 0.0000  
##                                             Max.   :21.0000  
##      deposit_type          company       days_in_waiting_list
##  No Deposit:104637   individual:112588   Min.   :  0.000     
##  Refundable:   162   company   :  6797   1st Qu.:  0.000     
##  Non Refund: 14586                       Median :  0.000     
##                                          Mean   :  2.321     
##                                          3rd Qu.:  0.000     
##                                          Max.   :391.000     
##          customer_type        adr         required_car_parking_spaces
##  Transient      :89612   Min.   : -6.38   Min.   :0.00000            
##  Contract       : 4076   1st Qu.: 69.29   1st Qu.:0.00000            
##  Transient-Party:25120   Median : 94.59   Median :0.00000            
##  Group          :  577   Mean   :101.79   Mean   :0.06252            
##                          3rd Qu.:126.00   3rd Qu.:0.00000            
##                          Max.   :510.00   Max.   :8.00000            
##  total_of_special_requests wanted_type kids       loyalty   
##  Min.   :0.0000            0:   642    0:110053   0:115575  
##  1st Qu.:0.0000            1:118743    1:  8415   1:  2838  
##  Median :0.0000                        2:   917   2:   972  
##  Mean   :0.5713                                             
##  3rd Qu.:1.0000                                             
##  Max.   :5.0000

Exploretory data analyses

Cancelation Percentage

The data set came frome two hotels, one is a city hotel the orher is a resort hotel. First I checked the overall cancelation percentage and cancelation percentage for each hotel. There are 40060 resort hetels bookings and 79330 city hotel bookings, and the cancelation percentage of each hotel was shown blew. The city hotel has a higher percentage of cancelation (27.7%) than the resort hotel cancelation (9.3%).

#hot <- hotel_df
hot <- hotel %>% 
  group_by(hotel)%>% 
  count(is_canceled) %>% 
  unite("hot_status",1:2) 
  pie(hot$n,labels =  as.character(hot$hot_status))

hot %>% mutate(percent = n/sum(n)) %>% 
  select(-n)

## # A tibble: 4 x 2
##   hot_status     percent
##   <chr>            <dbl>
## 1 Resort Hotel_0  0.242 
## 2 Resort Hotel_1  0.0932
## 3 City Hotel_0    0.387 
## 4 City Hotel_1    0.277

Cancelation Features

Booking cancelation by hotel types

The probability of canceling a booking for the city hotel is about three times of the probability for the resort hotel.

Booking cancelation by numeric features

Here are histograms of numeric featues showing by tatus of cancelation.

hotel_num <- hotel_num %>% filter(adr<2000)
hotel_num <- data.frame(is_canceled=hotel$is_canceled,hotel_num)

ggplot(gather(hotel_num,key,value=cancelation,2:9),
              aes(x=cancelation,fill = is_canceled))+
  geom_histogram()+
  facet_wrap(~key,scales = "free") +
  ggtitle(" Histogram of numeric variables")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scater plots of numeric varibles

According to the histogram of numeric variables, I found certain paterns exist between cancelation likeliness and average daily rate and the number of days staying. The two scater plots below revealed some patterns. More cancelations happened on those bookings which have more than one week stays and less than three weeks. For bookings staying less than 4 weekdays, it’s more likly to be canceled if it is more expensive.

ggplot(hotel_num,aes(x=stays_in_weekend_nights,y=adr,color=is_canceled))+
  geom_point()+
  ggtitle("Booking cancelation by adr and weekenddays")

ggplot(hotel_num,aes(x=stays_in_week_nights,y=adr,color=is_canceled))+
  geom_point()+
  ggtitle("Booking cancelation by adr and weekddays")

Booking cancelation by categorical features

Here are the bar charts of categorical features by their status of cancelation. It show some patterns:

Bookings of arrival_dte_month is 0 meaning arriving in November, December or Januarry are more likely to be canceled.
Bookings from outside of the country are less likely to be canceled.
Most Non Refund bookings were canceled.
Booking with kids envoled especially withe babies envoled are more likely to be canceled.

## histogram of factor variables
hotel_hist_fact <- hotel %>% select_if(is.factor)
ggplot(gather(hotel_hist_fact,key,value = cancelation,3:ncol(hotel_hist_fact)),
       aes(x=cancelation,fill=is_canceled))+
         geom_bar()+
         facet_wrap(~key,scales = "free",nrow = 3)+
         ggtitle("Histogram of numeric variables")

## Warning: attributes are not identical across measure variables;
## they will be dropped

Modeling

Classification tree

A classifying Tree is a simple representation for Supervised Machine Learning where the data is continuously split according to a certain parameter. Such a tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches. Classification tree is inexpensive to construct,extremely fast at classifying unknown records, and very easy to interpret. To prevent overfitting we usually set limits on depth of trees and prun the tree.

In this case we are tring to predict if a booking will be canceled according to 19 features. First I split the data set into training set and testing set randemly by 7:3.

set.seed(123)
index <- sample(nrow(hotel),0.7*nrow(hotel))
hotel_train <- hotel[index,]
hotel_test <- hotel[-index,]

Then use rpart and rpartplot to fit and graph the model

fit <- rpart(is_canceled ~., data = hotel_train, method = 'class')
fancyRpartPlot(fit,tweak = 3)

I test the model using testing data set, at last compute the in sample and out-of-sample accuracy.

in_sample_pred <- predict(fit,hotel_train,type = "class")
in_table <- table(in_sample_pred,hotel_train$is_canceled)
in_sample_cost <- (in_table[1,2]+in_table[2,1])/(in_table[2,1]+in_table[2,2]+in_table[1,1]+in_table[1,2])
in_sample_accuracy <- 1- in_sample_cost
out_pred <- predict(fit,hotel_test,type = "class")
out_table <- table(out_pred,hotel_test$is_canceled)
out_cost <- (out_table[1,2]+out_table[2,1])/(out_table[2,1]+out_table[2,2]+out_table[1,1]+out_table[1,2])
out_accuracy = 1- out_cost
model_result = data.frame(round(cbind(c(in_sample_accuracy,out_accuracy),c(in_sample_cost,out_cost)),3))
colnames(model_result) <- c("accuracy","cost")
rownames(model_result) <- c("in sample","out-of-sample")
kable(model_result)

	accuracy	cost
in sample	0.801	0.199
out-of-sample	0.796	0.204

Summary

This project revealed some trends of hotel cancelation and producted a model with acceptable performance.

Bookings with nonrefundable deposit are more likely to cancel.
Bookings from outside of the country are less likely to cancel.
Bookings with high everage daily rate are more likely to cancel.
Bookings frequently chenged are more likely to cancel.
Bookings envoling kids or babies are more likely to cancel.

This analysis can give hotel managers predict custmer needs, arrange room asignment, make booking policies and so on.

Unsupervised learning models may also help to do a good job.