Hospitality generates revenue for local economies directly when tourists spend money in hotels, restaurants and entertainment venues. Hospitality industry is growing, with more and more people spending their money for vacation and leisure activities. People may only lodge into a hotel when it’s a holiday season or a special event, thus the demand for staying in room is not equally distributed across the year. Hotel industry is a very volatile industry and the bookings depend on variety of factors such as type of hotels, seasonality, days of week and many more. This makes analyzing the patterns available in the past data more important to help the hotels plan better. Using the historical data, hotels can perform various campaigns to boost the business. The data consists of around 119,390 booking transactions from 2 hotel: an anonymous city hotel from Lisbon and a resort hotel from Algarve. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. There is so much to explore from this data, but we will only focus on demand forecasting.
We will tackle this problem statement in three segments:
The project aims to gain interesting insight into customers’ behavior when booking a hotel. To maximize the revenue gained by the hotel, the management often employed a pricing strategy, one of them being raising the room rate when the demand is high and making a promo when the demand is low. Thus, the ability to accurately forecast the future demand is very important and became a vital part on the pricing scheme. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.These insights can guide hotels to adjust their customer strategies and make preparation for unknown.
Rpart, rpart.plot and ROCR : These packages are used for building classification and regression models using decision trees. Further, we can visualize the tree structure and evaluate the performance of the models.
Tidyverse : This package consists of 6 core packages out of which the below 3 are most important for this project: dplyr: Used for data manipulation tidyr: Used for data modifications ggplot2: Used for creating powerful visualizations.
Forecast, tseries and sarima : These packages are used to model the time-series data including the seasonal component in the series (if any).
#Importing the packages required
library(rpart) #used for classification trees
library(rpart.plot) #used for plotting the trees
library(tidyverse) #used for data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rmarkdown) #used for formatting the markdown file
library(lubridate) #Used for manipulating dates
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tseries) #used for time-series forecasting
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(sarima) #used for time-series forecasting including seasonal components
## Loading required package: stats4
library(caret) #used for createDataPartition
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR) #used for predict function
## Warning: package 'ROCR' was built under R version 4.1.2
The data set contains the following variables:
The source data can be found here: Hotel Data
Data Structure
This data set contains 119390 observations and 32 variables. Some of the variables are integer, others are character. It includes information about hotel names, lead time, number of children, agent, and much more.
#Importing the dataset
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
#Structure of the dataset
str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
Missing values
By checking missing values for each variables, we can see that there are only 4 missing value for variable “children”, no missing value in other variables.
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
Taking care of the missing value
hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)
Total bookings across years
hotels %>%
group_by(is_repeated_guest) %>%
count(is_repeated_guest) %>%
summarise(percent = round(n/nrow(hotels)*100,2), total = n)
## # A tibble: 2 x 3
## is_repeated_guest percent total
## <dbl> <dbl> <int>
## 1 0 96.8 115580
## 2 1 3.19 3810
Leading time difference between two groups of guests
guestbehabar <- function(behavior){
hotels %>%
ggplot(aes(is_repeated_guest, fill = behavior)) +
geom_bar(position = "fill") +
labs(title = "Behavior feature by guest type",
subtitle = behavior,
x = "Guest type (1 for repeated guests)",
y = "Percentage of group") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
}
behaviourboxplot <- function(behavior){
hotels %>%
ggplot(aes(x = is_repeated_guest, y = behavior)) +
geom_boxplot() +
geom_jitter(width = .15, alpha = .2) +
labs(title = "Behavior feature by guest type",
subtitle = behavior,
x = "Guest type (1 for repeated guests)",
y = "count") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
}
behaviourboxplot(hotels$lead_time)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Type of hotel booked by repeated guests
guestbehabar(hotels$hotel)
It turns out that repeated guests prefer City Hotel other than Resort Hotel. However, the percentage of repeated guests that book Resort Hotel is 46.7%, while only 33.1% of unrepeated guests book Resort Hotel.
hotels %>%
group_by(is_repeated_guest) %>%
filter(hotel == "Resort Hotel") %>%
count() -> filhot
hotels %>%
group_by(is_repeated_guest) %>%
count() -> total
as.data.frame(filhot/total)
## is_repeated_guest n
## 1 NaN 0.3312165
## 2 1 0.4666667
Repeated guests cancelling booking
hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$is_canceled <- as.factor(hotels$is_canceled)
guestbehabar(hotels$is_canceled)
hotels %>%
group_by(is_repeated_guest) %>%
filter(is_canceled == "1") %>%
count() -> filcan
hotels %>%
group_by(is_repeated_guest) %>%
count() -> total
as.data.frame(filcan/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.3778508
## 2 NA 0.1448819
Meal choice of repeated guests
guestbehabar(hotels$meal)
table(hotels$meal, hotels$is_repeated_guest)
##
## 0 1
## BB 88837 3473
## FB 789 9
## HB 14277 186
## SC 10540 110
## Undefined 1137 32
hotels %>%
group_by(is_repeated_guest) %>%
filter(meal == "BB") %>%
count() -> filmea
as.data.frame(filmea/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.7686191
## 2 NA 0.9115486
Percentage of repeated guests making deposit
guestbehabar(hotels$deposit_type)
Normally 98.2% of repeated guests don’t make deposit. A possible reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.
hotels %>%
group_by(is_repeated_guest) %>%
filter(deposit_type == "No Deposit") %>%
count() -> fildep
as.data.frame(fildep/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.8729798
## 2 NA 0.9821522
Customer type of repeated guests
guestbehabar(hotels$customer_type)
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking. 4.2% of the repeated guests make bookings that are associated to a group. Both of these two numbers are higher than that of unrepeated guests making the bookings.
custype <- function(type){
hotels %>%
group_by(is_repeated_guest) %>%
filter(customer_type == type) %>%
count()
}
custype("Transient") -> filtra
custype("Group") -> filgro
as.data.frame(filtra/total)
## Warning in Ops.factor(left, right): '/' not meaningful for factors
## is_repeated_guest n
## 1 NA 0.7487455
## 2 NA 0.8065617
Splitting the dataset into training and test sets
The data was Randomly splitted to training (80%) and testing (20%) datasets.
i <- createDataPartition(hotels$is_canceled, p=0.80, list=FALSE)
# select 20% of the data for testing
testset <- hotels[-i,]
# select 80% of data to train the models
trainset <- hotels[i,]
Training a Logistic Regression Model
Fitting the full model with some explanatory variables, after excluding some lesser interesting variables such as date, month, year, country etc.
fullmodel <- glm(is_canceled ~ hotel + lead_time + adr
+ total_of_special_requests + distribution_channel +
is_repeated_guest + previous_cancellations + booking_changes +
deposit_type + days_in_waiting_list + required_car_parking_spaces +
customer_type, family = "binomial" , data=trainset)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Further find the best model using backward elimination technique.
# Backward Elimination
model_step_b <- step(fullmodel,direction='backward')
## Start: AIC=86045.78
## is_canceled ~ hotel + lead_time + adr + total_of_special_requests +
## distribution_channel + is_repeated_guest + previous_cancellations +
## booking_changes + deposit_type + days_in_waiting_list + required_car_parking_spaces +
## customer_type
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - hotel 1 86009 86045
## <none> 86008 86046
## - days_in_waiting_list 1 86020 86056
## - is_repeated_guest 1 86369 86405
## - booking_changes 1 86589 86625
## - distribution_channel 4 87072 87102
## - adr 1 87301 87337
## - lead_time 1 87769 87805
## - previous_cancellations 1 87954 87990
## - customer_type 3 88124 88156
## - total_of_special_requests 1 88747 88783
## - required_car_parking_spaces 1 89483 89519
## - deposit_type 2 94884 94918
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=86045.32
## is_canceled ~ lead_time + adr + total_of_special_requests + distribution_channel +
## is_repeated_guest + previous_cancellations + booking_changes +
## deposit_type + days_in_waiting_list + required_car_parking_spaces +
## customer_type
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 86009 86045
## - days_in_waiting_list 1 86022 86056
## - is_repeated_guest 1 86370 86404
## - booking_changes 1 86589 86623
## - distribution_channel 4 87081 87109
## - adr 1 87308 87342
## - lead_time 1 87786 87820
## - previous_cancellations 1 87956 87990
## - customer_type 3 88130 88160
## - total_of_special_requests 1 88747 88781
## - required_car_parking_spaces 1 89557 89591
## - deposit_type 2 94950 94982
#Remove hotel and days_in_waiting_list, Lower AIC or BOC value indicates a better fit.
Finalmodel <- glm(is_canceled ~ lead_time + adr + total_of_special_requests + distribution_channel +
is_repeated_guest + previous_cancellations + booking_changes +
deposit_type + required_car_parking_spaces +
customer_type, family = "binomial" , data=trainset)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(Finalmodel)
##
## Call:
## glm(formula = is_canceled ~ lead_time + adr + total_of_special_requests +
## distribution_channel + is_repeated_guest + previous_cancellations +
## booking_changes + deposit_type + required_car_parking_spaces +
## customer_type, family = "binomial", data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.6016 -0.7944 -0.4589 0.2252 3.6189
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.823e+00 7.110e-02 -39.708 < 2e-16 ***
## lead_time 3.902e-03 9.285e-05 42.021 < 2e-16 ***
## adr 6.460e-03 1.806e-04 35.761 < 2e-16 ***
## total_of_special_requests -5.879e-01 1.191e-02 -49.358 < 2e-16 ***
## distribution_channelDirect -2.735e-01 5.365e-02 -5.099 3.42e-07 ***
## distribution_channelGDS -4.749e-01 2.248e-01 -2.113 0.0346 *
## distribution_channelTA/TO 5.895e-01 4.706e-02 12.526 < 2e-16 ***
## distribution_channelUndefined 1.737e+01 3.319e+02 0.052 0.9582
## is_repeated_guest1 -1.383e+00 8.096e-02 -17.084 < 2e-16 ***
## previous_cancellations 1.906e+00 5.263e-02 36.208 < 2e-16 ***
## booking_changes -3.709e-01 1.688e-02 -21.977 < 2e-16 ***
## deposit_typeNon Refund 4.817e+00 1.144e-01 42.115 < 2e-16 ***
## deposit_typeRefundable 5.075e-01 2.308e-01 2.199 0.0279 *
## required_car_parking_spaces -2.857e+01 5.736e+01 -0.498 0.6184
## customer_typeGroup 5.388e-02 1.643e-01 0.328 0.7429
## customer_typeTransient 1.102e+00 5.307e-02 20.774 < 2e-16 ***
## customer_typeTransient-Party 1.347e-01 5.512e-02 2.444 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 125920 on 95512 degrees of freedom
## Residual deviance: 86022 on 95496 degrees of freedom
## AIC: 86056
##
## Number of Fisher Scoring iterations: 11
Finalmodel$deviance
## [1] 86022.41
AIC(Finalmodel)
## [1] 86056.41
BIC(Finalmodel)
## [1] 86217.35
In-Sample Prediction and ROC curve
In-sample prediction and Confusion Matrix with cut-off probability of 0.5 is calculated.
#Setting cut-off probability=0.5
table(predict(Finalmodel,type="response") > 0.5)
##
## FALSE TRUE
## 73804 21709
#confusion matrix
pred_prob=predict(Finalmodel, data=trainset, type="response")
pred_value=1*(pred_prob>0.5)
actual_value <-trainset$is_canceled
confusion_matrix <- table(actual_value, pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## 0 57017 3116
## 1 16787 18593
#misclasscification or error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate #0.21
## [1] 0.20838
#In-sample prediction
pred.glm0.train<- predict(Finalmodel, type="response")
##ROC Curve
pred <- prediction(pred.glm0.train, as.numeric(trainset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8336302
Out-of-Sample Prediction and ROC curve Similarly, out-of-sample prediction was evaluated and ROC curve for testing data set was created. The Receiver Operating Characteristic (ROC) curve was created.
#out-of-sample prediction
pred.glm0.test<- predict(Finalmodel, newdata = testset, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(testset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8306352
The logistic regression model correctly predicts the cancellation with 83% of accuracy.
The probability of a repeated guest cancel a booking is much lower than that of a unrepeated guest does. This informs that repeated guests are much loyal.
81.1% of repeated guests don’t make change of booking.
98.2% of repeated guests don’t make deposit. Higher than unrepeated guests do.
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking.
Only 3.19% of the guests are repeated guests.
Repeated guests tend to book the hotel one month ahead of visiting, which is much shorter than that made by unrepeated guests. This indicates repeated guests don’t rush to book hotels may because they always know which hotel to book if visiting that place.
The logistic regression model correctly predicts the cancellation with 83% of accuracy.