This hotel dataset help readers know the information about hotel booking. how long do customers stay? what is the most pattern of customers? how often do customers cancel their booking? when is the busiest week of the year? which factors affect the customers make the decision of cancellation.
The dataset information comes from Antonio, Almeida and Nunes, 2019 I would build models(logistic regression, classification tree and Neural Networks) using the data, Split the data on variables by groups. with statistical models find the relationship between variables to forecast customers’ future behavior help the hotels to make preparation for customers. look into the models’ accuracy and compare with these 3 different models.
library(readr) # load data
library(knitr) # inset tables and
library(dplyr) # filter/ group by function
library(ggplot2) # plot function
library(tidyverse) # pipe function
library(ggpubr) # combined ggplots
library(rpart) # build classification tree
library(rpart.plot) # plot classification tree
library(nnet) # Neural Networks Models
library(NeuralNetTools) # plot neoral nework models
library(ROCR) # prediction
Load the data into R, the data is from Antonio, Almeida and Nunes, 2019. the dataset contains 119390 observations and 32 variables. Look into this hotel dataset, it contains information about clients’ stay at hotels. specifically, it contains infromation including booking information for a Resort Hotel or City Hotel, the booking was cancelled or not, when the booking was made, the length of the stay, number of the adults, children, and babies. and many other information.
hotel <- readr::read_csv("C:/Users/yizha/Desktop/Spring 2020/Data wrangling/hotels.csv") #load the dataset
# build the table of data dictionary
my_tbl <- tibble::tribble(
~Variable, ~VariableName, ~Class, ~Description,
"1", "hotel", "character", "Resort Hotel or City Hotel",
"2" ,"is_canceled", "numeric", "booking was canceled (1) or not (0)",
"3", "lend_time", "numeric", "number of days between booking and arrival date",
"4", "arrival_date_year", "numeric", "Year of arrival date",
"5", "arrival_date_month","character","Month of arrival date",
"6", "arrival_date_week_number","nemeric", "Week number of year for arrival date",
"7", "arrival_date_day_of_month", "numeric","Day of arrival date",
"8", "stays_in_weekend_nights","numeric","Number of nights(weekend) the guest stayed",
"9", "stays_in_week_nights","numeric","Number of nights(weekday) the guest stayed",
"10", "adults","numeric","Number of adults",
"11", "children","numeric","Number of children",
"12", "babies","numeric","Number of babies",
"13", "meal","character","Tpye of meal ",
"14", "country","character","Country of origin",
"15","market_segmet","character","Market segment designation",
"16", "distribution_channel","character","Booking distribution channel",
"17", "is_repeated_guest","numeric","repeated guest:yes(1),no(0)",
"18", "previous_cancellations","numeric","Number of bookings cancelled",
"19", "previous_bookings_not_canceled","numeric","Number of bookings not cancelled",
"20", "reserved_room_type","character","Code of room type reserved",
"21","assigned_room_type","character","Code for the type of room assigned",
"22","booking_changes","numeric","number of changes made to the booking",
"23","deposit_type","character","deposit to guarantee the booking",
"24","agent","character","ID of the agency made the booking",
"25","company","character","ID of the company that made the booking",
"26","days_in_waiting_list","numeric","number of days before comfirmed",
"27","customer_type","character","Type of booking",
"28","adr","numeric","Average Daily Rate",
"29","required_car_parking_spaces","numeric","Number of car parking spaces required",
"30","total_of_special_requests","numeric","Number of special requests by customer",
"31","reservation_status","character","Reservation last status",
"32","reservation_status_date","numeric","Date at which the last status was set",)
require(knitr)
kable(my_tbl, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Variable | VariableName | Class | Description |
|---|---|---|---|
| 1 | hotel | character | Resort Hotel or City Hotel |
| 2 | is_canceled | numeric | booking was canceled (1) or not (0) |
| 3 | lend_time | numeric | number of days between booking and arrival date |
| 4 | arrival_date_year | numeric | Year of arrival date |
| 5 | arrival_date_month | character | Month of arrival date |
| 6 | arrival_date_week_number | nemeric | Week number of year for arrival date |
| 7 | arrival_date_day_of_month | numeric | Day of arrival date |
| 8 | stays_in_weekend_nights | numeric | Number of nights(weekend) the guest stayed |
| 9 | stays_in_week_nights | numeric | Number of nights(weekday) the guest stayed |
| 10 | adults | numeric | Number of adults |
| 11 | children | numeric | Number of children |
| 12 | babies | numeric | Number of babies |
| 13 | meal | character | Tpye of meal |
| 14 | country | character | Country of origin |
| 15 | market_segmet | character | Market segment designation |
| 16 | distribution_channel | character | Booking distribution channel |
| 17 | is_repeated_guest | numeric | repeated guest:yes(1),no(0) |
| 18 | previous_cancellations | numeric | Number of bookings cancelled |
| 19 | previous_bookings_not_canceled | numeric | Number of bookings not cancelled |
| 20 | reserved_room_type | character | Code of room type reserved |
| 21 | assigned_room_type | character | Code for the type of room assigned |
| 22 | booking_changes | numeric | number of changes made to the booking |
| 23 | deposit_type | character | deposit to guarantee the booking |
| 24 | agent | character | ID of the agency made the booking |
| 25 | company | character | ID of the company that made the booking |
| 26 | days_in_waiting_list | numeric | number of days before comfirmed |
| 27 | customer_type | character | Type of booking |
| 28 | adr | numeric | Average Daily Rate |
| 29 | required_car_parking_spaces | numeric | Number of car parking spaces required |
| 30 | total_of_special_requests | numeric | Number of special requests by customer |
| 31 | reservation_status | character | Reservation last status |
| 32 | reservation_status_date | numeric | Date at which the last status was set |
Adjustment with the original data
# table of converted variables
my_tbl2 <- tibble::tribble(
~Variable,~Name, ~From, ~To, ~Levels,
"1", "hotel", "chr", "factor", "2",
"2", "is canceled", "chr", "factor","2",
"5", " arrival_date_month", "chr", "factor", "12",
"13", " meal","chr","factor","5",
"14", " country","chr","factor","178",
"15", " market_segment","chr","factor","7",
"16", " distribution_channel","chr","factor","5",
"17", " is_repeated_guest","chr","factor","2",
"20", " reserved_room_type","chr","factor","10",
"21", " assigned_room_type","chr","factor","12",
"23", " deposit_type","chr","factor","3",
"24", " agent","chr","factor","334",
"27", " customer_type","chr","factor","4",
"31", " reservation_status","chr","factor","3",
"32", " reservation_status_date","chr","date",NA)
require(knitr)
kable(my_tbl2, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Variable | Name | From | To | Levels |
|---|---|---|---|---|
| 1 | hotel | chr | factor | 2 |
| 2 | is canceled | chr | factor | 2 |
| 5 | arrival_date_month | chr | factor | 12 |
| 13 | meal | chr | factor | 5 |
| 14 | country | chr | factor | 178 |
| 15 | market_segment | chr | factor | 7 |
| 16 | distribution_channel | chr | factor | 5 |
| 17 | is_repeated_guest | chr | factor | 2 |
| 20 | reserved_room_type | chr | factor | 10 |
| 21 | assigned_room_type | chr | factor | 12 |
| 23 | deposit_type | chr | factor | 3 |
| 24 | agent | chr | factor | 334 |
| 27 | customer_type | chr | factor | 4 |
| 31 | reservation_status | chr | factor | 3 |
| 32 | reservation_status_date | chr | date | NA |
# convert some variables
hotel <- hotel %>%
mutate(hotel = as.factor(hotel),
is_canceled = as.factor(is_canceled),
arrival_date_month = as.factor(arrival_date_month),
meal = as.factor(meal),
country = as.factor(country),
market_segment = as.factor(market_segment),
distribution_channel = as.factor(distribution_channel),
is_repeated_guest = as.factor(is_repeated_guest),
reserved_room_type = as.factor(reserved_room_type),
assigned_room_type = as.factor(assigned_room_type),
agent = as.factor(agent),
deposit_type = as.factor(deposit_type),
customer_type = as.factor(customer_type),
reservation_status = as.factor(reservation_status),
reservation_status_date = as.Date(reservation_status_date,"%m/%d/%Y"))
# define the levels
hotel$arrival_date_month <- factor(hotel$arrival_date_month,levels = c("January","February","March","April","May","June","July","August","September","October","November","December"))
levels(hotel$arrival_date_month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
the children variable has 4 missing value. for the entire dataset has 119390, the missing value is only a tiny portion in hotel. I impute the missing value in children variable with median.
the company variable has missing value “NULL”. the percentage of the NULL in this variable is 94.31%, we can simply removed the entire column from the dataset.
# check the missing value
hotel %>%
map_dbl(~ sum(is.na(.)))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
# impute the missing value in children variable with median
hotel <- hotel %>%
mutate(children
= replace(children,
is.na(children),
median(children, na.rm = TRUE)))
# check the missing value in 'company' variable
hotel %>%
summarise(missingValue
= sum(company == "NULL"),Percent = sum(company == "NULL")/length(company)*100)
## # A tibble: 1 x 2
## missingValue Percent
## <int> <dbl>
## 1 112593 94.3
hotel <- hotel %>% select(-company)
# combine column 8&9 into new column length_stay
hotel <- hotel %>%
mutate(length_stay=stays_in_weekend_nights + stays_in_week_nights)
# impute the missing value with median
temp <- ifelse(hotel$is_canceled == 0 & hotel$length_stay==0, median(hotel$length_stay), 0)
hotel$length_stay <- hotel$length_stay + temp
# if the booking was canceled, 35 rows of length_stay will be zero
sum(hotel$length_stay==0)
## [1] 35
hotel <- hotel %>% select(c(-stays_in_weekend_nights,stays_in_week_nights))
# create the new column kids, separate data into with-kid(1) and with-out kid(0)
hotel <- hotel %>%
mutate(kids = children + babies) %>%
mutate(kids = ifelse(kids > 0,1,0)) %>%
mutate(kids = as.factor(kids))
hotel <- hotel %>% select(c(-children,-babies))
hotel %>%
ggplot(aes(adr))+
geom_histogram(binwidth = 10)+
coord_cartesian(xlim = c(0,400))
hotel <- hotel %>%
mutate(adr = replace(adr, adr <= 0, median(adr, na.rm = TRUE))) %>% # impute the missing value with de median
mutate(total_cost = adr * length_stay) # create new column named total_cost = adr * length_stay
summary(hotel$total_cost)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 150 270 362 448 7590
To sum up, the cleaned hotel dataset 119390 observations and 31 variables. the response variable ‘is_canceled’ is a categorical factor.
The clean dataset of hotel contains 119390 observations and 32 variables.
# plot the pie of hotels
value <- hotel %>%
summarise(sum(hotel=="City Hotel")/length(hotel),sum(hotel=="Resort Hotel")/length(hotel))
df <- data.frame(value= c(66.45,33.55), group = c("City Hotel", "Resort Hotel"))
df %>%
ggplot(aes(x="", y=value, fill=group)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
geom_text(aes(label = paste0(value, "%")), position = position_stack(vjust = 0.5)) +
labs(x = NULL, y = NULL, fill = NULL, title = "Figure-1: Pie Chart of the hotel")
# which hotel canceled more
hotel %>%
ggplot(mapping = aes(x = is_canceled, fill = hotel)) +
geom_bar() +
ggtitle("Figure-2: cancelling rate")
# boxplot of the lead time
hotel %>%
ggplot(aes(arrival_date_month,lead_time))+
geom_boxplot(fill="steelblue",color="black")+
ggtitle("Figure-3.1: arrival month vs lead time")
hotel %>%
ggplot(aes(lead_time, is_canceled))+
geom_point(fill="steelblue",color="black")+
ggtitle("Figure-3.2: cancellation vs lead time")
hotel %>%
ggplot(aes(hotel,lead_time, fill = factor(is_canceled)))+
geom_boxplot(position = "dodge")+
labs(title = "Figure-4: Cancellation By Hotel Type")
# with-kid(1) and with-out kid(0)
hotel %>%
ggplot(aes(is_canceled,fill = kids))+
geom_bar(position = "fill")+
ggtitle("Figure-5: kids VS canceled")
hotel %>%
ggplot(aes(is_canceled, fill = is_repeated_guest)) +
geom_bar() +
ggtitle("Figure-6: repeated guests VS canceled")
# a repeated guest (1) or not (0)
hotel %>%
ggplot(aes(is_canceled, previous_cancellations)) +
geom_count(color = "darkred") +
scale_size(range=c(2, 15))+
ggtitle("Figure-7: Previous cancellations")
year <- as.factor(hotel$arrival_date_year)
hotel %>%
ggplot(aes(arrival_date_month, fill = year)) +
geom_bar() +
ggtitle("Figure-8: booking of the month")
# create a dataset that customer has checked in but already departed
hotel_sub <- hotel %>%
filter(reservation_status=="Check-Out")
hotel_sub %>%
ggplot(aes(x = total_cost, fill = hotel, color = hotel)) +
geom_histogram(aes(y = ..density..), position = "dodge", binwidth = 50) +
geom_density(alpha = 0.2) +
coord_cartesian(xlim = c(0,2500))+
ggtitle("Figure-9: Total Cost")
hotel_sub %>%
ggplot(aes(customer_type, fill = hotel)) +
geom_bar(position = "dodge") +
ggtitle("Figure-10 Customer type of hotels")
10. Figure-11, The customers short term trip choose City Hotel more than Resort Hotel,
hotel_sub %>%
ggplot(aes(length_stay)) +
geom_density(col = "red")+
facet_wrap(~ hotel)+
ggtitle("Figure-11 Length of stay VS type of hotel")
hotel %>%
ggplot(aes(arrival_date_month)) +
geom_bar(fill="steelblue",color="black") +
facet_wrap(~customer_type, scales = "free_y")+
ggtitle("Figure-12 monthly booking on customer type")
Randomly split the data to training(70%) and testing(30%) datasets:
set.seed(7025)
index <- sample(nrow(hotel),nrow(hotel)*0.7)
hotel.train <- hotel[index,]
hotel.test <- hotel[-index,]
# create a subset
hotel.train.sub <- subset(hotel.train,select
= c(hotel, is_canceled,
previous_cancellations, customer_type,total_cost,
total_of_special_requests, lead_time, deposit_type))
hotel.test.sub <- subset(hotel.test,select
= c(hotel, is_canceled,
previous_cancellations, customer_type,total_cost,
total_of_special_requests,lead_time,agent, deposit_type))
# fit a logistic regression
hotel.glm <- glm(is_canceled ~., family = "binomial", data = hotel.train.sub)
summary(hotel.glm)
##
## Call:
## glm(formula = is_canceled ~ ., family = "binomial", data = hotel.train.sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.4547 -0.7962 -0.5774 0.2396 2.8880
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.898e+00 5.819e-02 -32.620 <2e-16 ***
## hotelResort Hotel -4.161e-01 1.876e-02 -22.188 <2e-16 ***
## previous_cancellations 1.171e+00 4.511e-02 25.966 <2e-16 ***
## customer_typeGroup -3.691e-01 1.784e-01 -2.069 0.0385 *
## customer_typeTransient 9.499e-01 5.586e-02 17.004 <2e-16 ***
## customer_typeTransient-Party 7.510e-03 5.853e-02 0.128 0.8979
## total_cost 6.735e-04 2.518e-05 26.742 <2e-16 ***
## total_of_special_requests -5.215e-01 1.202e-02 -43.404 <2e-16 ***
## lead_time 4.258e-03 9.557e-05 44.557 <2e-16 ***
## deposit_typeNon Refund 4.975e+00 1.210e-01 41.124 <2e-16 ***
## deposit_typeRefundable 2.373e-01 2.208e-01 1.075 0.2826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 110301 on 83572 degrees of freedom
## Residual deviance: 81330 on 83562 degrees of freedom
## AIC: 81352
##
## Number of Fisher Scoring iterations: 7
#prediction
pred.glm <- predict(hotel.glm, type = "response")
(p <- round(mean(as.numeric(hotel.train.sub$is_canceled)-1),4))
## [1] 0.3718
# AUC
pred <- prediction(pred.glm, hotel.train.sub$is_canceled)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = TRUE)
round(unlist(slot(performance(pred, "auc"), "y.values")),4)
## [1] 0.7938
# get binary prediction
class.hotel.train<- (pred.glm>p)*1
table(hotel.train.sub$is_canceled, class.hotel.train, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 44312 8188
## 1 12078 18995
# misclassification rate (MR)
MR<- mean(hotel.train.sub$is_canceled!=class.hotel.train)
MR
## [1] 0.2424946
# split on customer_type
# deviance
hotel.train.sub %>%
split(.$customer_type) %>%
map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>%
map(summary) %>%
map("deviance")
# AIC
hotel.train.sub %>%
split(.$customer_type) %>%
map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>%
map(summary) %>%
map("aic")
# table of comparison
my_tbl_custmer <- tibble::tribble(
~Split,~deviance, ~AIC,
"Contract", "2923.33", "2927.33",
"Group", "268.07", "272.07",
"Transient", "83652.09 ", "83656.09",
"Transient_party","19894.95","19898.95")
require(knitr)
kable(my_tbl_custmer, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Split | deviance | AIC |
|---|---|---|
| Contract | 2923.33 | 2927.33 |
| Group | 268.07 | 272.07 |
| Transient | 83652.09 | 83656.09 |
| Transient_party | 19894.95 | 19898.95 |
# split on customer_type
# deviance
hotel.train.sub %>%
split(.$deposit_type) %>%
map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>%
map(summary) %>%
map("deviance")
# AIC
hotel.train.sub %>%
split(.$deposit_type) %>%
map(~ glm(is_canceled ~ hotel, family = "binomial", data = . )) %>%
map(summary) %>%
map("aic")
# table of comparison
my_tbl_deposit <- tibble::tribble(
~Split,~deviance, ~AIC,
"No Deposit", "87154.43", "87158.43",
"Non Refund", "696.94", "700.94",
"Refundable", "120.93 ", "124.93")
require(knitr)
kable(my_tbl_deposit, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Split | deviance | AIC |
|---|---|---|
| No Deposit | 87154.43 | 87158.43 |
| Non Refund | 696.94 | 700.94 |
| Refundable | 120.93 | 124.93 |
# create a subset
hotel.train.sub <- subset(hotel.train,select = c(-reservation_status))
hotel.test.sub <- subset(hotel.test,select = c(-reservation_status))
# fit a classification tree
hotel.rpart <- rpart(is_canceled ~ ., data = hotel.train.sub, method = "class")
hotel.rpart
## n= 83573
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 83573 31073 0 (0.628193316 0.371806684)
## 2) deposit_type=No Deposit,Refundable 73293 20863 0 (0.715347987 0.284652013)
## 4) agent=10,103,104,106,11,111,112,114,115,117,12,121,122,126,127,128,129,132,133,135,138,139,14,141,143,144,146,147,148,15,150,151,152,154,156,157,158,159,16,163,165,167,168,171,173,174,175,177,180,182,183,184,185,187,19,193,195,196,2,201,205,210,213,215,216,219,22,220,223,227,23,232,24,241,243,245,247,25,250,251,252,253,254,256,258,26,261,265,267,269,27,270,273,275,28,280,282,285,288,289,29,290,295,298,299,3,30,301,302,303,304,305,306,308,31,310,313,314,32,321,324,325,327,328,33,330,333,334,335,336,337,339,34,344,346,348,35,350,354,358,359,36,360,363,364,367,37,370,371,375,378,384,387,388,39,390,391,393,394,40,404,406,411,414,416,418,42,423,425,426,427,429,430,432,434,436,438,44,441,444,45,450,451,453,454,455,459,461,464,467,468,469,47,474,475,479,480,481,484,493,495,497,5,50,502,508,509,510,52,526,527,53,535,59,6,61,63,64,66,68,69,7,70,71,72,73,74,75,77,78,79,81,82,83,85,88,89,91,92,94,95,98,NULL 33453 4841 0 (0.855289511 0.144710489) *
## 5) agent=1,105,107,110,118,119,13,134,142,149,153,155,17,179,181,191,192,20,208,21,214,229,234,240,242,244,248,249,262,276,281,291,296,307,315,326,332,341,355,368,38,385,4,403,405,410,420,440,531,56,57,58,67,8,86,87,9,93,96,99 39840 16022 0 (0.597841365 0.402158635)
## 10) total_of_special_requests>=0.5 22753 6411 0 (0.718234958 0.281765042) *
## 11) total_of_special_requests< 0.5 17087 7476 1 (0.437525604 0.562474396)
## 22) lead_time< 8.5 2083 396 0 (0.809889582 0.190110418) *
## 23) lead_time>=8.5 15004 5789 1 (0.385830445 0.614169555)
## 46) customer_type=Transient-Party 4541 1938 0 (0.573221757 0.426778243)
## 92) country=ARG,ARM,AUS,AUT,BDI,BEL,BGR,BRA,CHE,CHL,CMR,CN,COL,CZE,DEU,DNK,DZA,EGY,ESP,FIN,FRA,GBR,GIB,GNB,GUY,HRV,HUN,IRN,IRQ,ISL,ISR,JPN,LBN,LBY,LUX,MAR,MOZ,MYS,NLD,NOR,NZL,PAN,PER,POL,ROU,RUS,SEN,SVK,SWE,TUR,UKR,USA,VEN,ZAF 1654 112 0 (0.932285369 0.067714631) *
## 93) country=CHN,GGY,GRC,IRL,ITA,NULL,OMN,PRT,SVN 2887 1061 1 (0.367509525 0.632490475) *
## 47) customer_type=Contract,Group,Transient 10463 3186 1 (0.304501577 0.695498423)
## 94) required_car_parking_spaces>=0.5 392 0 0 (1.000000000 0.000000000) *
## 95) required_car_parking_spaces< 0.5 10071 2794 1 (0.277430245 0.722569755)
## 190) reservation_status_date>=17304.5 971 321 0 (0.669412976 0.330587024) *
## 191) reservation_status_date< 17304.5 9100 2144 1 (0.235604396 0.764395604) *
## 3) deposit_type=Non Refund 10280 70 1 (0.006809339 0.993190661) *
prp(hotel.rpart)
# variable used in the tree construction
printcp(hotel.rpart)
##
## Classification tree:
## rpart(formula = is_canceled ~ ., data = hotel.train.sub, method = "class")
##
## Variables actually used in tree construction:
## [1] agent country
## [3] customer_type deposit_type
## [5] lead_time required_car_parking_spaces
## [7] reservation_status_date total_of_special_requests
##
## Root node error: 31073/83573 = 0.37181
##
## n= 83573
##
## CP nsplit rel error xerror xstd
## 1 0.326328 0 1.00000 1.00000 0.0044963
## 2 0.034355 1 0.67367 0.67367 0.0040311
## 3 0.023010 4 0.56342 0.56538 0.0037908
## 4 0.012615 6 0.51739 0.52042 0.0036753
## 5 0.010588 7 0.50478 0.50758 0.0036404
## 6 0.010000 8 0.49419 0.50101 0.0036222
# prediction
pred.rpart <- predict(hotel.rpart, hotel.test.sub, type="prob")
pred.train <- predict(hotel.rpart, type="class")
# AUC
pred <- prediction(pred.rpart[,2], hotel.test.sub$is_canceled)
perf <- performance(pred, "tpr","fpr")
plot(perf,colorize = TRUE)
round(unlist(slot(performance(pred, "auc"), "y.values")),4)
## [1] 0.8244
# misclassification rate (MR)
table(hotel.train.sub$is_canceled, pred.train , dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 49225 3275
## 1 12081 18992
MR<- mean(hotel.train.sub$is_canceled!=pred.train )
MR
## [1] 0.1837436
hotel.nnet <- nnet(is_canceled~., data=hotel.train, size=1, maxit=500,MaxNWts=2000)
## # weights: 585
## initial value 60240.385636
## final value 55150.677772
## converged
plotnet(hotel.nnet,main="Neural Network for hotel data")
#in-sample
prob.nnet.train = predict(hotel.nnet, hotel.train)
pred.nnet.train = as.numeric(prob.nnet.train > 0.5)
table(hotel.train$is_canceled, pred.nnet.train, dnn = c("Observed","Predicted"))
## Predicted
## Observed 0
## 0 52500
## 1 31073
mean(ifelse(hotel.train$is_canceled!= pred.nnet.train, 1, 0))
## [1] 0.3718067
#out of -sample
prob.nnet.test = predict(hotel.nnet, hotel.test)
pred.nnet.test = as.numeric(prob.nnet.test >0.5)
table(hotel.test$is_canceled, pred.nnet.test, dnn = c("Observed","Predicted"))
## Predicted
## Observed 0
## 0 22666
## 1 13151
mean(ifelse(hotel.test$is_canceled!= pred.nnet.test, 1, 0))
## [1] 0.367172
From EDA
In the dataset the percentage of City Hotel is 66.45 %, we can see the booking cancelling rate of City Hotel is 74.85%. we can tell the cancelling is a little high for City Hotel.
lead time is the most important factor in cancellation, the longer the lead time the higher the cancellation rate. the shorter lead time during the winter than in summer. no difference between City Hotel and Resort Hotel.
If guests with kids or not have no affect on cancellation.
repearted guests have lower cancelllation rate. the more time the guests cancel the booking before the more likely to cancel the booking for this time.
The longer the length_stay the more likely customers will choose Resort Hotel.
From modeling
logistic regression. pick up the variable of hotel, is_canceled,previous_cancellations, customer_type,total_cost, total_of_special_requests, lead time, deposit type.
Classification tree. the variables used in the tree construction is agent, country, customer type, deposit type, lead time, required car parking spaces, reservation status date, total of special requests.
Neural network has only one hidden layer. the output will be complicated, but we can still workout the accuracyl.
# table of comparison
my_tbl3 <- tibble::tribble(
~Model,~MR, ~AUC,
"Logistic regression", "0.2425", "0.7938",
"Classification tree", "0.1837", "0.8244",
"Neural Networks", "0.3672 ", "NA")
require(knitr)
kable(my_tbl3, digits = 3, row.names = FALSE, align = "c",
caption = NULL)
| Model | MR | AUC |
|---|---|---|
| Logistic regression | 0.2425 | 0.7938 |
| Classification tree | 0.1837 | 0.8244 |
| Neural Networks | 0.3672 | NA |