Hotel Bookings Data

Introduction & packages

This hotel dataset help readers know the information about hotel booking. how long do customers stay? what is the most pattern of customers? how often do customers cancel their booking? when is the busiest week of the year? which factors affect the customers make the decision of cancellation.

The dataset information comes from Antonio, Almeida and Nunes, 2019 I would build models(logistic regression, classification tree and Neural Networks) using the data, Split the data on variables by groups. with statistical models find the relationship between variables to forecast customers’ future behavior help the hotels to make preparation for customers. look into the models’ accuracy and compare with these 3 different models.

Packages Required

library(readr)          # load data
library(knitr)          # inset tables and 
library(dplyr)          # filter/ group by function
library(ggplot2)        # plot function
library(tidyverse)      # pipe function
library(ggpubr)         # combined ggplots
library(rpart)          # build classification tree
library(rpart.plot)     # plot classification tree
library(nnet)           # Neural Networks Models
library(NeuralNetTools) # plot neoral nework models
library(ROCR)           # prediction

Data Preparation

Load the data into R, the data is from Antonio, Almeida and Nunes, 2019. the dataset contains 119390 observations and 32 variables. Look into this hotel dataset, it contains information about clients’ stay at hotels. specifically, it contains infromation including booking information for a Resort Hotel or City Hotel, the booking was cancelled or not, when the booking was made, the length of the stay, number of the adults, children, and babies. and many other information.

hotel <- readr::read_csv("C:/Users/yizha/Desktop/Spring 2020/Data wrangling/hotels.csv") #load the dataset

# build the table of data dictionary
my_tbl <- tibble::tribble(
     ~Variable, ~VariableName,      ~Class,    ~Description,
 "1", "hotel", "character",  "Resort Hotel or City Hotel",
 "2" ,"is_canceled", "numeric",  "booking was canceled (1) or not (0)",
 "3", "lend_time",  "numeric",  "number of days between booking and arrival date",
"4",  "arrival_date_year", "numeric", "Year of arrival date",
"5",  "arrival_date_month","character","Month of arrival date",
"6",  "arrival_date_week_number","nemeric", "Week number of year for arrival date",
"7",  "arrival_date_day_of_month",  "numeric","Day of arrival date",
"8", "stays_in_weekend_nights","numeric","Number of nights(weekend) the guest stayed",
"9", "stays_in_week_nights","numeric","Number of nights(weekday) the guest stayed",
 "10", "adults","numeric","Number of adults",
 "11", "children","numeric","Number of children",
 "12", "babies","numeric","Number of babies",
 "13", "meal"，"character","Tpye of meal ",
 "14", "country","character","Country of origin",
"15","market_segmet","character","Market segment designation",
"16", "distribution_channel","character","Booking distribution channel",
"17", "is_repeated_guest","numeric","repeated guest:yes(1),no(0)",
"18", "previous_cancellations","numeric","Number of bookings cancelled",
"19", "previous_bookings_not_canceled","numeric","Number of bookings not cancelled",
"20", "reserved_room_type","character","Code of room type reserved",
"21","assigned_room_type","character","Code for the type of room assigned",
"22","booking_changes","numeric","number of changes made to the booking",
"23","deposit_type","character","deposit to guarantee the booking",
"24","agent","character","ID of the agency made the booking",
"25","company","character","ID of the company that made the booking",
"26","days_in_waiting_list","numeric","number of days before comfirmed",
"27","customer_type","character","Type of booking",
"28","adr","numeric","Average Daily Rate",
"29","required_car_parking_spaces","numeric","Number of car parking spaces required",
"30","total_of_special_requests","numeric","Number of special requests by customer",
"31","reservation_status","character","Reservation last status",
"32","reservation_status_date","numeric","Date at which the last status was set",) 
require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

Variable	VariableName	Class	Description
1	hotel	character	Resort Hotel or City Hotel
2	is_canceled	numeric	booking was canceled (1) or not (0)
3	lend_time	numeric	number of days between booking and arrival date
4	arrival_date_year	numeric	Year of arrival date
5	arrival_date_month	character	Month of arrival date
6	arrival_date_week_number	nemeric	Week number of year for arrival date
7	arrival_date_day_of_month	numeric	Day of arrival date
8	stays_in_weekend_nights	numeric	Number of nights(weekend) the guest stayed
9	stays_in_week_nights	numeric	Number of nights(weekday) the guest stayed
10	adults	numeric	Number of adults
11	children	numeric	Number of children
12	babies	numeric	Number of babies
13	meal	character	Tpye of meal
14	country	character	Country of origin
15	market_segmet	character	Market segment designation
16	distribution_channel	character	Booking distribution channel
17	is_repeated_guest	numeric	repeated guest:yes(1),no(0)
18	previous_cancellations	numeric	Number of bookings cancelled
19	previous_bookings_not_canceled	numeric	Number of bookings not cancelled
20	reserved_room_type	character	Code of room type reserved
21	assigned_room_type	character	Code for the type of room assigned
22	booking_changes	numeric	number of changes made to the booking
23	deposit_type	character	deposit to guarantee the booking
24	agent	character	ID of the agency made the booking
25	company	character	ID of the company that made the booking
26	days_in_waiting_list	numeric	number of days before comfirmed
27	customer_type	character	Type of booking
28	adr	numeric	Average Daily Rate
29	required_car_parking_spaces	numeric	Number of car parking spaces required
30	total_of_special_requests	numeric	Number of special requests by customer
31	reservation_status	character	Reservation last status
32	reservation_status_date	numeric	Date at which the last status was set

Adjustment with the original data

Table of variables that be converted:

# table of converted variables
my_tbl2 <- tibble::tribble(
  ~Variable,~Name, ~From, ~To, ~Levels,
"1", "hotel", "chr",  "factor",  "2",
"2", "is canceled",  "chr", "factor","2",
"5", " arrival_date_month", "chr", "factor", "12",
"13", " meal","chr","factor","5",
"14", " country","chr","factor","178",
"15", " market_segment","chr","factor","7",
"16", " distribution_channel","chr","factor","5",
"17", " is_repeated_guest","chr","factor","2",
"20", " reserved_room_type","chr","factor","10",
"21", " assigned_room_type","chr","factor","12",
"23", " deposit_type","chr","factor","3",
"24", " agent","chr","factor","334",
"27", " customer_type","chr","factor","4",
"31", " reservation_status","chr","factor","3",
"32", " reservation_status_date","chr","date",NA)
require(knitr)
kable(my_tbl2, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

Variable	Name	From	To	Levels
1	hotel	chr	factor	2
2	is canceled	chr	factor	2
5	arrival_date_month	chr	factor	12
13	meal	chr	factor	5
14	country	chr	factor	178
15	market_segment	chr	factor	7
16	distribution_channel	chr	factor	5
17	is_repeated_guest	chr	factor	2
20	reserved_room_type	chr	factor	10
21	assigned_room_type	chr	factor	12
23	deposit_type	chr	factor	3
24	agent	chr	factor	334
27	customer_type	chr	factor	4
31	reservation_status	chr	factor	3
32	reservation_status_date	chr	date	NA

# convert some variables
hotel <- hotel %>% 
   mutate(hotel = as.factor(hotel),
          is_canceled = as.factor(is_canceled),
          arrival_date_month = as.factor(arrival_date_month),
          meal = as.factor(meal),
          country = as.factor(country),
          market_segment = as.factor(market_segment),
          distribution_channel = as.factor(distribution_channel),
          is_repeated_guest = as.factor(is_repeated_guest),
          reserved_room_type = as.factor(reserved_room_type),
          assigned_room_type = as.factor(assigned_room_type),
          agent = as.factor(agent),
          deposit_type = as.factor(deposit_type),
          customer_type = as.factor(customer_type),
          reservation_status = as.factor(reservation_status),
          reservation_status_date = as.Date(reservation_status_date,"%m/%d/%Y"))
# define the levels
hotel$arrival_date_month <- factor(hotel$arrival_date_month,levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
levels(hotel$arrival_date_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

Check the missing value in the hotel dataset:

the children variable has 4 missing value. for the entire dataset has 119390, the missing value is only a tiny portion in hotel. I impute the missing value in children variable with median.
the company variable has missing value “NULL”. the percentage of the NULL in this variable is 94.31%, we can simply removed the entire column from the dataset.

# check the missing value
hotel %>%
  map_dbl(~ sum(is.na(.)))

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

# impute the missing value in children variable with median
hotel <- hotel %>%
  mutate(children
         = replace(children,
                   is.na(children),
                   median(children, na.rm = TRUE)))
# check the missing value in 'company' variable
hotel %>% 
   summarise(missingValue 
             = sum(company == "NULL"),Percent = sum(company == "NULL")/length(company)*100)

## # A tibble: 1 x 2
##   missingValue Percent
##          <int>   <dbl>
## 1       112593    94.3

hotel <- hotel %>% select(-company)

The dataset has weekend nights(variable 8) & week nights(variable 9) seperate, combine this two variable into one vector-length_stay (length of time guests stay). for length_stay surely larger than 0 when the booking was not canceled, there are 680 missing value (length_stay = 0) in the variable of length_stay.the mean of the length_stay is 3.449, impute the missing value with the median. after the adjustment, the cleaned column of length_stay will have 35 rows of zero.

# combine column 8&9 into new column length_stay
hotel <- hotel %>% 
   mutate(length_stay=stays_in_weekend_nights + stays_in_week_nights)
# impute the missing value with median
temp <- ifelse(hotel$is_canceled == 0 & hotel$length_stay==0, median(hotel$length_stay), 0) 
hotel$length_stay <- hotel$length_stay +  temp 
# if the booking was canceled, 35 rows of length_stay will be zero
sum(hotel$length_stay==0)

## [1] 35

hotel <- hotel %>% select(c(-stays_in_weekend_nights,stays_in_week_nights))

Combine the variable 11 and 12 into a new logical vector (kids) to see the relationship between reservation cancel and have kids or not.

# create the new column kids, separate data into with-kid(1) and with-out kid(0)
hotel <- hotel %>% 
   mutate(kids =  children + babies) %>% 
   mutate(kids = ifelse(kids > 0,1,0)) %>% 
   mutate(kids = as.factor(kids))

hotel <- hotel %>% select(c(-children,-babies))

Variable ‘adr’ is the average of daily rate. combine “adr” and “length_stay” to get total cost per booking.

hotel %>% 
  ggplot(aes(adr))+
  geom_histogram(binwidth = 10)+
  coord_cartesian(xlim = c(0,400))

hotel <- hotel %>%
  mutate(adr = replace(adr, adr <= 0, median(adr, na.rm = TRUE))) %>%  # impute the missing value with de median
  mutate(total_cost = adr * length_stay) # create new column named total_cost = adr * length_stay

summary(hotel$total_cost)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     150     270     362     448    7590

To sum up, the cleaned hotel dataset 119390 observations and 31 variables. the response variable ‘is_canceled’ is a categorical factor.

Exploratoty Data Analysis (EDA)

The clean dataset of hotel contains 119390 observations and 32 variables.

Figure-1 & Figure-2, in the dataset the percentage of City Hotel is 66.45 %, we can see the booking cancelling rate of City Hotel is 74.85%. we can tell the cancelling is high for City Hotel.

# plot the pie of hotels
value <- hotel %>% 
   summarise(sum(hotel=="City Hotel")/length(hotel),sum(hotel=="Resort Hotel")/length(hotel))

df <- data.frame(value= c(66.45,33.55), group = c("City Hotel", "Resort Hotel"))
df %>% 
    ggplot(aes(x="", y=value, fill=group)) +
    geom_bar(stat="identity", width=1) +
    coord_polar("y", start=0) +
    geom_text(aes(label = paste0(value, "%")), position = position_stack(vjust = 0.5)) +
    labs(x = NULL, y = NULL, fill = NULL, title = "Figure-1: Pie Chart of the hotel")

# which hotel canceled more
hotel %>% 
   ggplot(mapping = aes(x = is_canceled, fill = hotel)) +
   geom_bar() + 
   ggtitle("Figure-2: cancelling rate")

Figure-3, the correlation between how long the reservation was made and the which month it was made. we can tell the longest lead time about 140days that reservations were made during the summer,but short lead time during the winter.

# boxplot of the lead time
hotel %>% 
   ggplot(aes(arrival_date_month,lead_time))+
   geom_boxplot(fill="steelblue",color="black")+
   ggtitle("Figure-3.1: arrival month vs lead time")

hotel %>% 
   ggplot(aes(lead_time, is_canceled))+
   geom_point(fill="steelblue",color="black")+
   ggtitle("Figure-3.2: cancellation vs lead time")

Figure-4, From the plot, we can tell that the cancellation happened soon after the booking. the longer the lead time the chances of cancellation are higher after the booking. there is no difference between City hotel and Resort Hotel in the relation of lead time and cancell

hotel %>% 
   ggplot(aes(hotel,lead_time, fill = factor(is_canceled)))+
   geom_boxplot(position = "dodge")+
   labs(title = "Figure-4: Cancellation By Hotel Type")

Figure-5, we can tell that guests with kids or not has not influence on the cancelling rate.

# with-kid(1) and with-out kid(0)
hotel %>% 
   ggplot(aes(is_canceled,fill = kids))+
   geom_bar(position = "fill")+
   ggtitle("Figure-5: kids VS canceled")

Figure-6, the repeated guests are not tend to cancel the reservation. the repeated guest is 3.19% of total guests. the repeated guest from the canceled group is 2.53% of total guests. the number is lower compare to entire population. we can say the repeated guests tend to fullfil the booking.

hotel %>% 
   ggplot(aes(is_canceled, fill = is_repeated_guest)) +
   geom_bar() + 
   ggtitle("Figure-6: repeated guests VS canceled")

Figure-7, the more times the previous cancellations customers the more likely they would cancel for this time.

# a repeated guest (1) or not (0)
hotel %>% 
   ggplot(aes(is_canceled, previous_cancellations)) +
   geom_count(color = "darkred") +
   scale_size(range=c(2, 15))+
   ggtitle("Figure-7: Previous cancellations")

Figure-8, the number of the reservations made per month, over the three year. we can see the month of August receive the largest number of booking. consistent with Figure-3, summer months need the longest lead time.

year <- as.factor(hotel$arrival_date_year)
hotel %>% 
   ggplot(aes(arrival_date_month, fill = year)) +
   geom_bar() + 
   ggtitle("Figure-8: booking of the month")

Figure-9, Customer perfer Resort Hotel for short term stay. City Hotel becomes more pupular longer the customers stay. the Resort Hotel has long right tail, guest spend more at Resort Hotel. the mean of total cost mainly between 250-350 for City Hotel, 350-450 for Resort Hotel.

# create a dataset that customer has checked in but already departed
hotel_sub <- hotel %>% 
   filter(reservation_status=="Check-Out")

hotel_sub %>% 
ggplot(aes(x = total_cost, fill = hotel, color = hotel)) + 
  geom_histogram(aes(y = ..density..), position = "dodge", binwidth = 50) +
  geom_density(alpha = 0.2) + 
  coord_cartesian(xlim = c(0,2500))+
  ggtitle("Figure-9: Total Cost")

Figure-10, Contract customers prefer Resort Hotel to City Hotel. Group type doesn’t have preference, but transient and transient-party prefer to City Hotel.

hotel_sub %>% 
ggplot(aes(customer_type, fill = hotel)) + 
  geom_bar(position = "dodge") + 
  ggtitle("Figure-10 Customer type of hotels")

10. Figure-11, The customers short term trip choose City Hotel more than Resort Hotel,

hotel_sub %>% 
     ggplot(aes(length_stay)) + 
     geom_density(col = "red")+
     facet_wrap(~ hotel)+
     ggtitle("Figure-11 Length of stay VS type of hotel")

Figure-12, we can tell that the booking from the Transient is the most important part for both the hotels. contract gradually from Jan to Oct but decrease sharply on Nov and Dec. peak booking from group appear at Oct, while peak appear at Aug for Transient. Transient-party seems follow the seasonal pattern from spring to winter.

hotel %>% 
ggplot(aes(arrival_date_month)) +
        geom_bar(fill="steelblue",color="black") +
        facet_wrap(~customer_type, scales = "free_y")+
        ggtitle("Figure-12 monthly booking on customer type")

Statistical modeling

Randomly split the data to training(70%) and testing(30%) datasets:

set.seed(7025)
index <- sample(nrow(hotel),nrow(hotel)*0.7)
hotel.train <- hotel[index,]
hotel.test <-  hotel[-index,]

fit a logistic regression. we pick variable hotel, is_canceled,previous_cancellations, customer_type,total_cost, total_of_special_requests, lead time, deposit type. the AIC of the logistic regession is 81352 which is high. the probability of the canceled the booking is 37.18%. the accuracy of the model,the AUC under the roc curve is 0.7938.

# create a subset
hotel.train.sub <- subset(hotel.train,select 
                          = c(hotel, is_canceled,
                             previous_cancellations, customer_type,total_cost, 
                             total_of_special_requests, lead_time, deposit_type))
hotel.test.sub <- subset(hotel.test,select 
                          = c(hotel, is_canceled,
                             previous_cancellations, customer_type,total_cost, 
                             total_of_special_requests,lead_time,agent, deposit_type))
# fit a logistic regression
hotel.glm <- glm(is_canceled ~., family = "binomial", data = hotel.train.sub)
summary(hotel.glm)

## 
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = hotel.train.sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4547  -0.7962  -0.5774   0.2396   2.8880  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.898e+00  5.819e-02 -32.620   <2e-16 ***
## hotelResort Hotel            -4.161e-01  1.876e-02 -22.188   <2e-16 ***
## previous_cancellations        1.171e+00  4.511e-02  25.966   <2e-16 ***
## customer_typeGroup           -3.691e-01  1.784e-01  -2.069   0.0385 *  
## customer_typeTransient        9.499e-01  5.586e-02  17.004   <2e-16 ***
## customer_typeTransient-Party  7.510e-03  5.853e-02   0.128   0.8979    
## total_cost                    6.735e-04  2.518e-05  26.742   <2e-16 ***
## total_of_special_requests    -5.215e-01  1.202e-02 -43.404   <2e-16 ***
## lead_time                     4.258e-03  9.557e-05  44.557   <2e-16 ***
## deposit_typeNon Refund        4.975e+00  1.210e-01  41.124   <2e-16 ***
## deposit_typeRefundable        2.373e-01  2.208e-01   1.075   0.2826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110301  on 83572  degrees of freedom
## Residual deviance:  81330  on 83562  degrees of freedom
## AIC: 81352
## 
## Number of Fisher Scoring iterations: 7

#prediction
pred.glm <- predict(hotel.glm, type = "response")
(p <- round(mean(as.numeric(hotel.train.sub$is_canceled)-1),4))

## [1] 0.3718

# AUC
pred <- prediction(pred.glm, hotel.train.sub$is_canceled)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = TRUE)

round(unlist(slot(performance(pred, "auc"), "y.values")),4)

## [1] 0.7938

# get binary prediction
class.hotel.train<- (pred.glm>p)*1
table(hotel.train.sub$is_canceled, class.hotel.train, dnn = c("True", "Predicted"))

##     Predicted
## True     0     1
##    0 44312  8188
##    1 12078 18995

# misclassification rate (MR)
MR<- mean(hotel.train.sub$is_canceled!=class.hotel.train)
MR

## [1] 0.2424946

Split the data set by customer_type, compare the AIC and Deviance on logistic regression model. From the result we can see that the Customer_type = Group is the most significant.

# split on customer_type
# deviance
hotel.train.sub %>% 
   split(.$customer_type) %>% 
   map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("deviance")
# AIC
hotel.train.sub %>% 
   split(.$customer_type) %>% 
   map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("aic")

# table of comparison
my_tbl_custmer <- tibble::tribble(
  ~Split,~deviance, ~AIC, 
"Contract", "2923.33", "2927.33",
"Group", "268.07",  "272.07", 
"Transient", "83652.09 ", "83656.09",
"Transient_party","19894.95","19898.95")
require(knitr)
kable(my_tbl_custmer, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

Split	deviance	AIC
Contract	2923.33	2927.33
Group	268.07	272.07
Transient	83652.09	83656.09
Transient_party	19894.95	19898.95

Split the data set by deposit_type, compare the AIC and Deviance on logistic regression model. From the result we can see that the deposit_type = Refundable is the most significant.

# split on customer_type
# deviance
hotel.train.sub %>% 
   split(.$deposit_type) %>% 
   map(~ glm(is_canceled ~  hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("deviance")
# AIC
hotel.train.sub %>% 
   split(.$deposit_type) %>% 
   map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>% 
   map(summary) %>% 
   map("aic")

# table of comparison
my_tbl_deposit <- tibble::tribble(
  ~Split,~deviance, ~AIC, 
"No Deposit", "87154.43", "87158.43",
"Non Refund", "696.94",  "700.94", 
"Refundable", "120.93 ", "124.93")
require(knitr)
kable(my_tbl_deposit, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

Split	deviance	AIC
No Deposit	87154.43	87158.43
Non Refund	696.94	700.94
Refundable	120.93	124.93

fit a classification tree, make some adjustment to the train and test dataset. look into the result of the classification tree, surprisingly “hotel” is not one of the componet in tree constrction with this model. the accuracy of the model,the AUC under the roc curve is 82.44%

# create a subset
hotel.train.sub <- subset(hotel.train,select = c(-reservation_status))
hotel.test.sub <- subset(hotel.test,select = c(-reservation_status))
# fit a classification tree
hotel.rpart <- rpart(is_canceled ~ ., data = hotel.train.sub, method = "class")
hotel.rpart

## n= 83573 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 83573 31073 0 (0.628193316 0.371806684)  
##     2) deposit_type=No Deposit,Refundable 73293 20863 0 (0.715347987 0.284652013)  
##       4) agent=10,103,104,106,11,111,112,114,115,117,12,121,122,126,127,128,129,132,133,135,138,139,14,141,143,144,146,147,148,15,150,151,152,154,156,157,158,159,16,163,165,167,168,171,173,174,175,177,180,182,183,184,185,187,19,193,195,196,2,201,205,210,213,215,216,219,22,220,223,227,23,232,24,241,243,245,247,25,250,251,252,253,254,256,258,26,261,265,267,269,27,270,273,275,28,280,282,285,288,289,29,290,295,298,299,3,30,301,302,303,304,305,306,308,31,310,313,314,32,321,324,325,327,328,33,330,333,334,335,336,337,339,34,344,346,348,35,350,354,358,359,36,360,363,364,367,37,370,371,375,378,384,387,388,39,390,391,393,394,40,404,406,411,414,416,418,42,423,425,426,427,429,430,432,434,436,438,44,441,444,45,450,451,453,454,455,459,461,464,467,468,469,47,474,475,479,480,481,484,493,495,497,5,50,502,508,509,510,52,526,527,53,535,59,6,61,63,64,66,68,69,7,70,71,72,73,74,75,77,78,79,81,82,83,85,88,89,91,92,94,95,98,NULL 33453  4841 0 (0.855289511 0.144710489) *
##       5) agent=1,105,107,110,118,119,13,134,142,149,153,155,17,179,181,191,192,20,208,21,214,229,234,240,242,244,248,249,262,276,281,291,296,307,315,326,332,341,355,368,38,385,4,403,405,410,420,440,531,56,57,58,67,8,86,87,9,93,96,99 39840 16022 0 (0.597841365 0.402158635)  
##        10) total_of_special_requests>=0.5 22753  6411 0 (0.718234958 0.281765042) *
##        11) total_of_special_requests< 0.5 17087  7476 1 (0.437525604 0.562474396)  
##          22) lead_time< 8.5 2083   396 0 (0.809889582 0.190110418) *
##          23) lead_time>=8.5 15004  5789 1 (0.385830445 0.614169555)  
##            46) customer_type=Transient-Party 4541  1938 0 (0.573221757 0.426778243)  
##              92) country=ARG,ARM,AUS,AUT,BDI,BEL,BGR,BRA,CHE,CHL,CMR,CN,COL,CZE,DEU,DNK,DZA,EGY,ESP,FIN,FRA,GBR,GIB,GNB,GUY,HRV,HUN,IRN,IRQ,ISL,ISR,JPN,LBN,LBY,LUX,MAR,MOZ,MYS,NLD,NOR,NZL,PAN,PER,POL,ROU,RUS,SEN,SVK,SWE,TUR,UKR,USA,VEN,ZAF 1654   112 0 (0.932285369 0.067714631) *
##              93) country=CHN,GGY,GRC,IRL,ITA,NULL,OMN,PRT,SVN 2887  1061 1 (0.367509525 0.632490475) *
##            47) customer_type=Contract,Group,Transient 10463  3186 1 (0.304501577 0.695498423)  
##              94) required_car_parking_spaces>=0.5 392     0 0 (1.000000000 0.000000000) *
##              95) required_car_parking_spaces< 0.5 10071  2794 1 (0.277430245 0.722569755)  
##               190) reservation_status_date>=17304.5 971   321 0 (0.669412976 0.330587024) *
##               191) reservation_status_date< 17304.5 9100  2144 1 (0.235604396 0.764395604) *
##     3) deposit_type=Non Refund 10280    70 1 (0.006809339 0.993190661) *

prp(hotel.rpart)

# variable used in the tree construction
printcp(hotel.rpart)

## 
## Classification tree:
## rpart(formula = is_canceled ~ ., data = hotel.train.sub, method = "class")
## 
## Variables actually used in tree construction:
## [1] agent                       country                    
## [3] customer_type               deposit_type               
## [5] lead_time                   required_car_parking_spaces
## [7] reservation_status_date     total_of_special_requests  
## 
## Root node error: 31073/83573 = 0.37181
## 
## n= 83573 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.326328      0   1.00000 1.00000 0.0044963
## 2 0.034355      1   0.67367 0.67367 0.0040311
## 3 0.023010      4   0.56342 0.56538 0.0037908
## 4 0.012615      6   0.51739 0.52042 0.0036753
## 5 0.010588      7   0.50478 0.50758 0.0036404
## 6 0.010000      8   0.49419 0.50101 0.0036222

# prediction
pred.rpart <- predict(hotel.rpart, hotel.test.sub, type="prob")
pred.train <- predict(hotel.rpart, type="class")
# AUC
pred <- prediction(pred.rpart[,2], hotel.test.sub$is_canceled)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = TRUE)

round(unlist(slot(performance(pred, "auc"), "y.values")),4)

## [1] 0.8244

# misclassification rate (MR)
table(hotel.train.sub$is_canceled, pred.train , dnn = c("True", "Predicted"))

##     Predicted
## True     0     1
##    0 49225  3275
##    1 12081 18992

MR<- mean(hotel.train.sub$is_canceled!=pred.train )
MR

## [1] 0.1837436

Neural Networks Models

hotel.nnet <- nnet(is_canceled~., data=hotel.train, size=1, maxit=500,MaxNWts=2000)

## # weights:  585
## initial  value 60240.385636 
## final  value 55150.677772 
## converged

plotnet(hotel.nnet,main="Neural Network for hotel data")

#in-sample
prob.nnet.train = predict(hotel.nnet, hotel.train)
pred.nnet.train = as.numeric(prob.nnet.train > 0.5)
table(hotel.train$is_canceled, pred.nnet.train, dnn = c("Observed","Predicted"))

##         Predicted
## Observed     0
##        0 52500
##        1 31073

mean(ifelse(hotel.train$is_canceled!= pred.nnet.train, 1, 0))

## [1] 0.3718067

#out of -sample
prob.nnet.test = predict(hotel.nnet, hotel.test)
pred.nnet.test = as.numeric(prob.nnet.test >0.5)
table(hotel.test$is_canceled, pred.nnet.test, dnn = c("Observed","Predicted"))

##         Predicted
## Observed     0
##        0 22666
##        1 13151

mean(ifelse(hotel.test$is_canceled!= pred.nnet.test, 1, 0))

## [1] 0.367172

Summary

From EDA

In the dataset the percentage of City Hotel is 66.45 %, we can see the booking cancelling rate of City Hotel is 74.85%. we can tell the cancelling is a little high for City Hotel.
lead time is the most important factor in cancellation, the longer the lead time the higher the cancellation rate. the shorter lead time during the winter than in summer. no difference between City Hotel and Resort Hotel.
If guests with kids or not have no affect on cancellation.
repearted guests have lower cancelllation rate. the more time the guests cancel the booking before the more likely to cancel the booking for this time.
The longer the length_stay the more likely customers will choose Resort Hotel.

From modeling

logistic regression. pick up the variable of hotel, is_canceled,previous_cancellations, customer_type,total_cost, total_of_special_requests, lead time, deposit type.
Classification tree. the variables used in the tree construction is agent, country, customer type, deposit type, lead time, required car parking spaces, reservation status date, total of special requests.
Neural network has only one hidden layer. the output will be complicated, but we can still workout the accuracyl.

# table of comparison
my_tbl3 <- tibble::tribble(
  ~Model,~MR, ~AUC, 
"Logistic regression", "0.2425", "0.7938",
"Classification tree", "0.1837",  "0.8244", 
"Neural Networks", "0.3672 ", "NA")
require(knitr)
kable(my_tbl3, digits = 3, row.names = FALSE, align = "c",
              caption = NULL)

Model	MR	AUC
Logistic regression	0.2425	0.7938
Classification tree	0.1837	0.8244
Neural Networks	0.3672	NA

Hotel Booking Data Analysis

Yao Lin