Hotel Bookings Data

Introduction & packages

This hotel dataset help readers know the information about hotel booking. how long do customers stay? what is the most pattern of customers? how often do customers cancel their booking? when is the busiest week of the year? which factors affect the customers make the decision of cancellation.

The dataset information comes from Antonio, Almeida and Nunes, 2019 I would build models(logistic regression, classification tree and Neural Networks) using the data, Split the data on variables by groups. with statistical models find the relationship between variables to forecast customers’ future behavior help the hotels to make preparation for customers. look into the models’ accuracy and compare with these 3 different models.

Packages Required

library(readr)          # load data
library(knitr)          # inset tables and 
library(dplyr)          # filter/ group by function
library(ggplot2)        # plot function
library(tidyverse)      # pipe function
library(ggpubr)         # combined ggplots
library(rpart)          # build classification tree
library(rpart.plot)     # plot classification tree
library(nnet)           # Neural Networks Models
library(NeuralNetTools) # plot neoral nework models
library(ROCR)           # prediction

Data Preparation

Load the data into R, the data is from Antonio, Almeida and Nunes, 2019. the dataset contains 119390 observations and 32 variables. Look into this hotel dataset, it contains information about clients’ stay at hotels. specifically, it contains infromation including booking information for a Resort Hotel or City Hotel, the booking was cancelled or not, when the booking was made, the length of the stay, number of the adults, children, and babies. and many other information.

hotel <- readr::read_csv("C:/Users/yizha/Desktop/Spring 2020/Data wrangling/hotels.csv") #load the dataset
# build the table of data dictionary
my_tbl <- tibble::tribble(
     ~Variable, ~VariableName,      ~Class,    ~Description,
 "1", "hotel", "character",  "Resort Hotel or City Hotel",
 "2" ,"is_canceled", "numeric",  "booking was canceled (1) or not (0)",
 "3", "lend_time",  "numeric",  "number of days between booking and arrival date",
"4",  "arrival_date_year", "numeric", "Year of arrival date",
"5",  "arrival_date_month","character","Month of arrival date",
"6",  "arrival_date_week_number","nemeric", "Week number of year for arrival date",
"7",  "arrival_date_day_of_month",  "numeric","Day of arrival date",
"8", "stays_in_weekend_nights","numeric","Number of nights(weekend) the guest stayed",
"9", "stays_in_week_nights","numeric","Number of nights(weekday) the guest stayed",
 "10", "adults","numeric","Number of adults",
 "11", "children","numeric","Number of children",
 "12", "babies","numeric","Number of babies",
 "13", "meal","character","Tpye of meal ",
 "14", "country","character","Country of origin",
"15","market_segmet","character","Market segment designation",
"16", "distribution_channel","character","Booking distribution channel",
"17", "is_repeated_guest","numeric","repeated guest:yes(1),no(0)",
"18", "previous_cancellations","numeric","Number of bookings cancelled",
"19", "previous_bookings_not_canceled","numeric","Number of bookings not cancelled",
"20", "reserved_room_type","character","Code of room type reserved",
"21","assigned_room_type","character","Code for the type of room assigned",
"22","booking_changes","numeric","number of changes made to the booking",
"23","deposit_type","character","deposit to guarantee the booking",
"24","agent","character","ID of the agency made the booking",
"25","company","character","ID of the company that made the booking",
"26","days_in_waiting_list","numeric","number of days before comfirmed",
"27","customer_type","character","Type of booking",
"28","adr","numeric","Average Daily Rate",
"29","required_car_parking_spaces","numeric","Number of car parking spaces required",
"30","total_of_special_requests","numeric","Number of special requests by customer",
"31","reservation_status","character","Reservation last status",
"32","reservation_status_date","numeric","Date at which the last status was set",) 
require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
Variable VariableName Class Description
1 hotel character Resort Hotel or City Hotel
2 is_canceled numeric booking was canceled (1) or not (0)
3 lend_time numeric number of days between booking and arrival date
4 arrival_date_year numeric Year of arrival date
5 arrival_date_month character Month of arrival date
6 arrival_date_week_number nemeric Week number of year for arrival date
7 arrival_date_day_of_month numeric Day of arrival date
8 stays_in_weekend_nights numeric Number of nights(weekend) the guest stayed
9 stays_in_week_nights numeric Number of nights(weekday) the guest stayed
10 adults numeric Number of adults
11 children numeric Number of children
12 babies numeric Number of babies
13 meal character Tpye of meal
14 country character Country of origin
15 market_segmet character Market segment designation
16 distribution_channel character Booking distribution channel
17 is_repeated_guest numeric repeated guest:yes(1),no(0)
18 previous_cancellations numeric Number of bookings cancelled
19 previous_bookings_not_canceled numeric Number of bookings not cancelled
20 reserved_room_type character Code of room type reserved
21 assigned_room_type character Code for the type of room assigned
22 booking_changes numeric number of changes made to the booking
23 deposit_type character deposit to guarantee the booking
24 agent character ID of the agency made the booking
25 company character ID of the company that made the booking
26 days_in_waiting_list numeric number of days before comfirmed
27 customer_type character Type of booking
28 adr numeric Average Daily Rate
29 required_car_parking_spaces numeric Number of car parking spaces required
30 total_of_special_requests numeric Number of special requests by customer
31 reservation_status character Reservation last status
32 reservation_status_date numeric Date at which the last status was set

Adjustment with the original data

  1. Table of variables that be converted:
# table of converted variables
my_tbl2 <- tibble::tribble(
  ~Variable,~Name, ~From, ~To, ~Levels,
"1", "hotel", "chr",  "factor",  "2",
"2", "is canceled",  "chr", "factor","2",
"5", " arrival_date_month", "chr", "factor", "12",
"13", " meal","chr","factor","5",
"14", " country","chr","factor","178",
"15", " market_segment","chr","factor","7",
"16", " distribution_channel","chr","factor","5",
"17", " is_repeated_guest","chr","factor","2",
"20", " reserved_room_type","chr","factor","10",
"21", " assigned_room_type","chr","factor","12",
"23", " deposit_type","chr","factor","3",
"24", " agent","chr","factor","334",
"27", " customer_type","chr","factor","4",
"31", " reservation_status","chr","factor","3",
"32", " reservation_status_date","chr","date",NA)
require(knitr)
kable(my_tbl2, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
Variable Name From To Levels
1 hotel chr factor 2
2 is canceled chr factor 2
5 arrival_date_month chr factor 12
13 meal chr factor 5
14 country chr factor 178
15 market_segment chr factor 7
16 distribution_channel chr factor 5
17 is_repeated_guest chr factor 2
20 reserved_room_type chr factor 10
21 assigned_room_type chr factor 12
23 deposit_type chr factor 3
24 agent chr factor 334
27 customer_type chr factor 4
31 reservation_status chr factor 3
32 reservation_status_date chr date NA
# convert some variables
hotel <- hotel %>% 
   mutate(hotel = as.factor(hotel),
          is_canceled = as.factor(is_canceled),
          arrival_date_month = as.factor(arrival_date_month),
          meal = as.factor(meal),
          country = as.factor(country),
          market_segment = as.factor(market_segment),
          distribution_channel = as.factor(distribution_channel),
          is_repeated_guest = as.factor(is_repeated_guest),
          reserved_room_type = as.factor(reserved_room_type),
          assigned_room_type = as.factor(assigned_room_type),
          agent = as.factor(agent),
          deposit_type = as.factor(deposit_type),
          customer_type = as.factor(customer_type),
          reservation_status = as.factor(reservation_status),
          reservation_status_date = as.Date(reservation_status_date,"%m/%d/%Y"))
# define the levels
hotel$arrival_date_month <- factor(hotel$arrival_date_month,levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
levels(hotel$arrival_date_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
  1. Check the missing value in the hotel dataset:
  • the children variable has 4 missing value. for the entire dataset has 119390, the missing value is only a tiny portion in hotel. I impute the missing value in children variable with median.

  • the company variable has missing value “NULL”. the percentage of the NULL in this variable is 94.31%, we can simply removed the entire column from the dataset.

# check the missing value
hotel %>%
  map_dbl(~ sum(is.na(.)))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
# impute the missing value in children variable with median
hotel <- hotel %>%
  mutate(children
         = replace(children,
                   is.na(children),
                   median(children, na.rm = TRUE)))
# check the missing value in 'company' variable
hotel %>% 
   summarise(missingValue 
             = sum(company == "NULL"),Percent = sum(company == "NULL")/length(company)*100)
## # A tibble: 1 x 2
##   missingValue Percent
##          <int>   <dbl>
## 1       112593    94.3
hotel <- hotel %>% select(-company)
  1. The dataset has weekend nights(variable 8) & week nights(variable 9) seperate, combine this two variable into one vector-length_stay (length of time guests stay). for length_stay surely larger than 0 when the booking was not canceled, there are 680 missing value (length_stay = 0) in the variable of length_stay.the mean of the length_stay is 3.449, impute the missing value with the median. after the adjustment, the cleaned column of length_stay will have 35 rows of zero.
# combine column 8&9 into new column length_stay
hotel <- hotel %>% 
   mutate(length_stay=stays_in_weekend_nights + stays_in_week_nights)
# impute the missing value with median
temp <- ifelse(hotel$is_canceled == 0 & hotel$length_stay==0, median(hotel$length_stay), 0) 
hotel$length_stay <- hotel$length_stay +  temp 
# if the booking was canceled, 35 rows of length_stay will be zero
sum(hotel$length_stay==0) 
## [1] 35
hotel <- hotel %>% select(c(-stays_in_weekend_nights,stays_in_week_nights))
  1. Combine the variable 11 and 12 into a new logical vector (kids) to see the relationship between reservation cancel and have kids or not.
# create the new column kids, separate data into with-kid(1) and with-out kid(0)
hotel <- hotel %>% 
   mutate(kids =  children + babies) %>% 
   mutate(kids = ifelse(kids > 0,1,0)) %>% 
   mutate(kids = as.factor(kids))

hotel <- hotel %>% select(c(-children,-babies))
  1. Variable ‘adr’ is the average of daily rate. combine “adr” and “length_stay” to get total cost per booking.
hotel %>% 
  ggplot(aes(adr))+
  geom_histogram(binwidth = 10)+
  coord_cartesian(xlim = c(0,400))

hotel <- hotel %>%
  mutate(adr = replace(adr, adr <= 0, median(adr, na.rm = TRUE))) %>%  # impute the missing value with de median
  mutate(total_cost = adr * length_stay) # create new column named total_cost = adr * length_stay

summary(hotel$total_cost)  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     150     270     362     448    7590

To sum up, the cleaned hotel dataset 119390 observations and 31 variables. the response variable ‘is_canceled’ is a categorical factor.

Exploratoty Data Analysis (EDA)

The clean dataset of hotel contains 119390 observations and 32 variables.

  1. Figure-1 & Figure-2, in the dataset the percentage of City Hotel is 66.45 %, we can see the booking cancelling rate of City Hotel is 74.85%. we can tell the cancelling is high for City Hotel.
# plot the pie of hotels
value <- hotel %>% 
   summarise(sum(hotel=="City Hotel")/length(hotel),sum(hotel=="Resort Hotel")/length(hotel))

df <- data.frame(value= c(66.45,33.55), group = c("City Hotel", "Resort Hotel"))
df %>% 
    ggplot(aes(x="", y=value, fill=group)) +
    geom_bar(stat="identity", width=1) +
    coord_polar("y", start=0) +
    geom_text(aes(label = paste0(value, "%")), position = position_stack(vjust = 0.5)) +
    labs(x = NULL, y = NULL, fill = NULL, title = "Figure-1: Pie Chart of the hotel")

# which hotel canceled more
hotel %>% 
   ggplot(mapping = aes(x = is_canceled, fill = hotel)) +
   geom_bar() + 
   ggtitle("Figure-2: cancelling rate") 

  1. Figure-3, the correlation between how long the reservation was made and the which month it was made. we can tell the longest lead time about 140days that reservations were made during the summer,but short lead time during the winter.
# boxplot of the lead time
hotel %>% 
   ggplot(aes(arrival_date_month,lead_time))+
   geom_boxplot(fill="steelblue",color="black")+
   ggtitle("Figure-3.1: arrival month vs lead time")

hotel %>% 
   ggplot(aes(lead_time, is_canceled))+
   geom_point(fill="steelblue",color="black")+
   ggtitle("Figure-3.2: cancellation vs lead time")

  1. Figure-4, From the plot, we can tell that the cancellation happened soon after the booking. the longer the lead time the chances of cancellation are higher after the booking. there is no difference between City hotel and Resort Hotel in the relation of lead time and cancell
hotel %>% 
   ggplot(aes(hotel,lead_time, fill = factor(is_canceled)))+
   geom_boxplot(position = "dodge")+
   labs(title = "Figure-4: Cancellation By Hotel Type")

  1. Figure-5, we can tell that guests with kids or not has not influence on the cancelling rate.
# with-kid(1) and with-out kid(0)
hotel %>% 
   ggplot(aes(is_canceled,fill = kids))+
   geom_bar(position = "fill")+
   ggtitle("Figure-5: kids VS canceled")

  1. Figure-6, the repeated guests are not tend to cancel the reservation. the repeated guest is 3.19% of total guests. the repeated guest from the canceled group is 2.53% of total guests. the number is lower compare to entire population. we can say the repeated guests tend to fullfil the booking.
hotel %>% 
   ggplot(aes(is_canceled, fill = is_repeated_guest)) +
   geom_bar() + 
   ggtitle("Figure-6: repeated guests VS canceled")

  1. Figure-7, the more times the previous cancellations customers the more likely they would cancel for this time.
# a repeated guest (1) or not (0)
hotel %>% 
   ggplot(aes(is_canceled, previous_cancellations)) +
   geom_count(color = "darkred") +
   scale_size(range=c(2, 15))+
   ggtitle("Figure-7: Previous cancellations")

  1. Figure-8, the number of the reservations made per month, over the three year. we can see the month of August receive the largest number of booking. consistent with Figure-3, summer months need the longest lead time.
year <- as.factor(hotel$arrival_date_year)
hotel %>% 
   ggplot(aes(arrival_date_month, fill = year)) +
   geom_bar() + 
   ggtitle("Figure-8: booking of the month")

  1. Figure-9, Customer perfer Resort Hotel for short term stay. City Hotel becomes more pupular longer the customers stay. the Resort Hotel has long right tail, guest spend more at Resort Hotel. the mean of total cost mainly between 250-350 for City Hotel, 350-450 for Resort Hotel.
# create a dataset that customer has checked in but already departed
hotel_sub <- hotel %>% 
   filter(reservation_status=="Check-Out")

hotel_sub %>% 
ggplot(aes(x = total_cost, fill = hotel, color = hotel)) + 
  geom_histogram(aes(y = ..density..), position = "dodge", binwidth = 50) +
  geom_density(alpha = 0.2) + 
  coord_cartesian(xlim = c(0,2500))+
  ggtitle("Figure-9: Total Cost")

  1. Figure-10, Contract customers prefer Resort Hotel to City Hotel. Group type doesn’t have preference, but transient and transient-party prefer to City Hotel.
hotel_sub %>% 
ggplot(aes(customer_type, fill = hotel)) + 
  geom_bar(position = "dodge") + 
  ggtitle("Figure-10 Customer type of hotels")

10. Figure-11, The customers short term trip choose City Hotel more than Resort Hotel,

hotel_sub %>% 
     ggplot(aes(length_stay)) + 
     geom_density(col = "red")+
     facet_wrap(~ hotel)+
     ggtitle("Figure-11 Length of stay VS type of hotel")

  1. Figure-12, we can tell that the booking from the Transient is the most important part for both the hotels. contract gradually from Jan to Oct but decrease sharply on Nov and Dec. peak booking from group appear at Oct, while peak appear at Aug for Transient. Transient-party seems follow the seasonal pattern from spring to winter.
hotel %>% 
ggplot(aes(arrival_date_month)) +
        geom_bar(fill="steelblue",color="black") +
        facet_wrap(~customer_type, scales = "free_y")+
        ggtitle("Figure-12 monthly booking on customer type")

Statistical modeling

Randomly split the data to training(70%) and testing(30%) datasets:

set.seed(7025)
index <- sample(nrow(hotel),nrow(hotel)*0.7)
hotel.train <- hotel[index,]
hotel.test <-  hotel[-index,]
  1. fit a logistic regression. we pick variable hotel, is_canceled,previous_cancellations, customer_type,total_cost, total_of_special_requests, lead time, deposit type. the AIC of the logistic regession is 81352 which is high. the probability of the canceled the booking is 37.18%. the accuracy of the model,the AUC under the roc curve is 0.7938.
# create a subset
hotel.train.sub <- subset(hotel.train,select 
                          = c(hotel, is_canceled,
                             previous_cancellations, customer_type,total_cost, 
                             total_of_special_requests, lead_time, deposit_type))
hotel.test.sub <- subset(hotel.test,select 
                          = c(hotel, is_canceled,
                             previous_cancellations, customer_type,total_cost, 
                             total_of_special_requests,lead_time,agent, deposit_type))
# fit a logistic regression
hotel.glm <- glm(is_canceled ~., family = "binomial", data = hotel.train.sub)
summary(hotel.glm)
## 
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = hotel.train.sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4547  -0.7962  -0.5774   0.2396   2.8880  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.898e+00  5.819e-02 -32.620   <2e-16 ***
## hotelResort Hotel            -4.161e-01  1.876e-02 -22.188   <2e-16 ***
## previous_cancellations        1.171e+00  4.511e-02  25.966   <2e-16 ***
## customer_typeGroup           -3.691e-01  1.784e-01  -2.069   0.0385 *  
## customer_typeTransient        9.499e-01  5.586e-02  17.004   <2e-16 ***
## customer_typeTransient-Party  7.510e-03  5.853e-02   0.128   0.8979    
## total_cost                    6.735e-04  2.518e-05  26.742   <2e-16 ***
## total_of_special_requests    -5.215e-01  1.202e-02 -43.404   <2e-16 ***
## lead_time                     4.258e-03  9.557e-05  44.557   <2e-16 ***
## deposit_typeNon Refund        4.975e+00  1.210e-01  41.124   <2e-16 ***
## deposit_typeRefundable        2.373e-01  2.208e-01   1.075   0.2826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110301  on 83572  degrees of freedom
## Residual deviance:  81330  on 83562  degrees of freedom
## AIC: 81352
## 
## Number of Fisher Scoring iterations: 7
#prediction
pred.glm <- predict(hotel.glm, type = "response")
(p <- round(mean(as.numeric(hotel.train.sub$is_canceled)-1),4))
## [1] 0.3718
# AUC
pred <- prediction(pred.glm, hotel.train.sub$is_canceled)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = TRUE)

round(unlist(slot(performance(pred, "auc"), "y.values")),4)
## [1] 0.7938
# get binary prediction
class.hotel.train<- (pred.glm>p)*1
table(hotel.train.sub$is_canceled, class.hotel.train, dnn = c("True", "Predicted"))
##     Predicted
## True     0     1
##    0 44312  8188
##    1 12078 18995
# misclassification rate (MR)
MR<- mean(hotel.train.sub$is_canceled!=class.hotel.train)
MR
## [1] 0.2424946
  • Split the data set by customer_type, compare the AIC and Deviance on logistic regression model. From the result we can see that the Customer_type = Group is the most significant.
# split on customer_type
# deviance
hotel.train.sub %>% 
   split(.$customer_type) %>% 
   map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("deviance")
# AIC
hotel.train.sub %>% 
   split(.$customer_type) %>% 
   map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("aic")
# table of comparison
my_tbl_custmer <- tibble::tribble(
  ~Split,~deviance, ~AIC, 
"Contract", "2923.33", "2927.33",
"Group", "268.07",  "272.07", 
"Transient", "83652.09 ", "83656.09",
"Transient_party","19894.95","19898.95")
require(knitr)
kable(my_tbl_custmer, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
Split deviance AIC
Contract 2923.33 2927.33
Group 268.07 272.07
Transient 83652.09 83656.09
Transient_party 19894.95 19898.95
  • Split the data set by deposit_type, compare the AIC and Deviance on logistic regression model. From the result we can see that the deposit_type = Refundable is the most significant.
# split on customer_type
# deviance
hotel.train.sub %>% 
   split(.$deposit_type) %>% 
   map(~ glm(is_canceled ~  hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("deviance")
# AIC
hotel.train.sub %>% 
   split(.$deposit_type) %>% 
   map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("aic")
# table of comparison
my_tbl_deposit <- tibble::tribble(
  ~Split,~deviance, ~AIC, 
"No Deposit", "87154.43", "87158.43",
"Non Refund", "696.94",  "700.94", 
"Refundable", "120.93 ", "124.93")
require(knitr)
kable(my_tbl_deposit, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
Split deviance AIC
No Deposit 87154.43 87158.43
Non Refund 696.94 700.94
Refundable 120.93 124.93
  1. fit a classification tree, make some adjustment to the train and test dataset. look into the result of the classification tree, surprisingly “hotel” is not one of the componet in tree constrction with this model. the accuracy of the model,the AUC under the roc curve is 82.44%
# create a subset
hotel.train.sub <- subset(hotel.train,select = c(-reservation_status))
hotel.test.sub <- subset(hotel.test,select = c(-reservation_status))
# fit a classification tree
hotel.rpart <- rpart(is_canceled ~ ., data = hotel.train.sub, method = "class")
hotel.rpart
## n= 83573 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 83573 31073 0 (0.628193316 0.371806684)  
##     2) deposit_type=No Deposit,Refundable 73293 20863 0 (0.715347987 0.284652013)  
##       4) agent=10,103,104,106,11,111,112,114,115,117,12,121,122,126,127,128,129,132,133,135,138,139,14,141,143,144,146,147,148,15,150,151,152,154,156,157,158,159,16,163,165,167,168,171,173,174,175,177,180,182,183,184,185,187,19,193,195,196,2,201,205,210,213,215,216,219,22,220,223,227,23,232,24,241,243,245,247,25,250,251,252,253,254,256,258,26,261,265,267,269,27,270,273,275,28,280,282,285,288,289,29,290,295,298,299,3,30,301,302,303,304,305,306,308,31,310,313,314,32,321,324,325,327,328,33,330,333,334,335,336,337,339,34,344,346,348,35,350,354,358,359,36,360,363,364,367,37,370,371,375,378,384,387,388,39,390,391,393,394,40,404,406,411,414,416,418,42,423,425,426,427,429,430,432,434,436,438,44,441,444,45,450,451,453,454,455,459,461,464,467,468,469,47,474,475,479,480,481,484,493,495,497,5,50,502,508,509,510,52,526,527,53,535,59,6,61,63,64,66,68,69,7,70,71,72,73,74,75,77,78,79,81,82,83,85,88,89,91,92,94,95,98,NULL 33453  4841 0 (0.855289511 0.144710489) *
##       5) agent=1,105,107,110,118,119,13,134,142,149,153,155,17,179,181,191,192,20,208,21,214,229,234,240,242,244,248,249,262,276,281,291,296,307,315,326,332,341,355,368,38,385,4,403,405,410,420,440,531,56,57,58,67,8,86,87,9,93,96,99 39840 16022 0 (0.597841365 0.402158635)  
##        10) total_of_special_requests>=0.5 22753  6411 0 (0.718234958 0.281765042) *
##        11) total_of_special_requests< 0.5 17087  7476 1 (0.437525604 0.562474396)  
##          22) lead_time< 8.5 2083   396 0 (0.809889582 0.190110418) *
##          23) lead_time>=8.5 15004  5789 1 (0.385830445 0.614169555)  
##            46) customer_type=Transient-Party 4541  1938 0 (0.573221757 0.426778243)  
##              92) country=ARG,ARM,AUS,AUT,BDI,BEL,BGR,BRA,CHE,CHL,CMR,CN,COL,CZE,DEU,DNK,DZA,EGY,ESP,FIN,FRA,GBR,GIB,GNB,GUY,HRV,HUN,IRN,IRQ,ISL,ISR,JPN,LBN,LBY,LUX,MAR,MOZ,MYS,NLD,NOR,NZL,PAN,PER,POL,ROU,RUS,SEN,SVK,SWE,TUR,UKR,USA,VEN,ZAF 1654   112 0 (0.932285369 0.067714631) *
##              93) country=CHN,GGY,GRC,IRL,ITA,NULL,OMN,PRT,SVN 2887  1061 1 (0.367509525 0.632490475) *
##            47) customer_type=Contract,Group,Transient 10463  3186 1 (0.304501577 0.695498423)  
##              94) required_car_parking_spaces>=0.5 392     0 0 (1.000000000 0.000000000) *
##              95) required_car_parking_spaces< 0.5 10071  2794 1 (0.277430245 0.722569755)  
##               190) reservation_status_date>=17304.5 971   321 0 (0.669412976 0.330587024) *
##               191) reservation_status_date< 17304.5 9100  2144 1 (0.235604396 0.764395604) *
##     3) deposit_type=Non Refund 10280    70 1 (0.006809339 0.993190661) *
prp(hotel.rpart)

# variable used in the tree construction
printcp(hotel.rpart)
## 
## Classification tree:
## rpart(formula = is_canceled ~ ., data = hotel.train.sub, method = "class")
## 
## Variables actually used in tree construction:
## [1] agent                       country                    
## [3] customer_type               deposit_type               
## [5] lead_time                   required_car_parking_spaces
## [7] reservation_status_date     total_of_special_requests  
## 
## Root node error: 31073/83573 = 0.37181
## 
## n= 83573 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.326328      0   1.00000 1.00000 0.0044963
## 2 0.034355      1   0.67367 0.67367 0.0040311
## 3 0.023010      4   0.56342 0.56538 0.0037908
## 4 0.012615      6   0.51739 0.52042 0.0036753
## 5 0.010588      7   0.50478 0.50758 0.0036404
## 6 0.010000      8   0.49419 0.50101 0.0036222
# prediction
pred.rpart <- predict(hotel.rpart, hotel.test.sub, type="prob")
pred.train <- predict(hotel.rpart, type="class")
# AUC
pred <- prediction(pred.rpart[,2], hotel.test.sub$is_canceled)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = TRUE)

round(unlist(slot(performance(pred, "auc"), "y.values")),4)
## [1] 0.8244
# misclassification rate (MR)
table(hotel.train.sub$is_canceled, pred.train , dnn = c("True", "Predicted"))
##     Predicted
## True     0     1
##    0 49225  3275
##    1 12081 18992
MR<- mean(hotel.train.sub$is_canceled!=pred.train )
MR
## [1] 0.1837436
  1. Neural Networks Models
hotel.nnet <- nnet(is_canceled~., data=hotel.train, size=1, maxit=500,MaxNWts=2000)
## # weights:  585
## initial  value 60240.385636 
## final  value 55150.677772 
## converged
plotnet(hotel.nnet,main="Neural Network for hotel data")

#in-sample
prob.nnet.train = predict(hotel.nnet, hotel.train)
pred.nnet.train = as.numeric(prob.nnet.train > 0.5)
table(hotel.train$is_canceled, pred.nnet.train, dnn = c("Observed","Predicted"))
##         Predicted
## Observed     0
##        0 52500
##        1 31073
mean(ifelse(hotel.train$is_canceled!= pred.nnet.train, 1, 0))
## [1] 0.3718067
#out of -sample
prob.nnet.test = predict(hotel.nnet, hotel.test)
pred.nnet.test = as.numeric(prob.nnet.test >0.5)
table(hotel.test$is_canceled, pred.nnet.test, dnn = c("Observed","Predicted"))
##         Predicted
## Observed     0
##        0 22666
##        1 13151
mean(ifelse(hotel.test$is_canceled!= pred.nnet.test, 1, 0))
## [1] 0.367172

Summary

From EDA

  1. In the dataset the percentage of City Hotel is 66.45 %, we can see the booking cancelling rate of City Hotel is 74.85%. we can tell the cancelling is a little high for City Hotel.

  2. lead time is the most important factor in cancellation, the longer the lead time the higher the cancellation rate. the shorter lead time during the winter than in summer. no difference between City Hotel and Resort Hotel.

  3. If guests with kids or not have no affect on cancellation.

  4. repearted guests have lower cancelllation rate. the more time the guests cancel the booking before the more likely to cancel the booking for this time.

  5. The longer the length_stay the more likely customers will choose Resort Hotel.

From modeling

  1. logistic regression. pick up the variable of hotel, is_canceled,previous_cancellations, customer_type,total_cost, total_of_special_requests, lead time, deposit type.

  2. Classification tree. the variables used in the tree construction is agent, country, customer type, deposit type, lead time, required car parking spaces, reservation status date, total of special requests.

  3. Neural network has only one hidden layer. the output will be complicated, but we can still workout the accuracyl.

# table of comparison
my_tbl3 <- tibble::tribble(
  ~Model,~MR, ~AUC, 
"Logistic regression", "0.2425", "0.7938",
"Classification tree", "0.1837",  "0.8244", 
"Neural Networks", "0.3672 ", "NA")
require(knitr)
kable(my_tbl3, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)
Model MR AUC
Logistic regression 0.2425 0.7938
Classification tree 0.1837 0.8244
Neural Networks 0.3672 NA