Sources used:
https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md
https://www.kaggle.com/jessemostipak/hotel-booking-demand
https://www.sciencedirect.com/science/article/pii/S2352340918315191
hotel - Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled - Value indicating if the booking was canceled (1) or not (0)
lead_time- Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year - Year of arrival date
arrival_date_month - Month of arrival date
arrival_date_week_number - Week number of year for arrival date
arrival_date_day_of_month - Day of arrival date
stays_in_weekend_nights - Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights -Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults - Number of adults
children - Number of children
babies - Number of babies
meal - Type of meal booked. Categories are presented in standard hospitality meal packages:
-Undefined/SC – no meal package; -BB – Bed & Breakfast; -HB – Half board (breakfast and one other meal – usually dinner); -FB – Full board (breakfast, lunch and dinner)
country Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment - Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel - Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest - Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations - Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled - Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type - Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type - Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes - Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
-No Deposit – no deposit was made; -Non Refund – a deposit was made in the value of the total stay cost; -Refundable – a deposit was made with a value under the total cost of stay.
agent - ID of the travel agency that made the booking
company - ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list - Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type - Type of booking, assuming one of four categories:
-Contract - when the booking has an allotment or other type of contract associated to it; -Group – when the booking is associated to a group; -Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; -Transient-party – when the booking is transient, but is associated to at least other transient booking
adr - Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces - Number of car parking spaces required by the customer
total_of_special_requests - Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status - Reservation last status, assuming one of three categories:
-Canceled – booking was canceled by the customer; -Check-Out – customer has checked in but already departed; -No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date - Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(broom)
library(BSDA)
## Loading required package: lattice
##
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
##
## Orange
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggThemeAssist)
library(corrplot)
## corrplot 0.92 loaded
library(survival)
library(rpart)
#library(randomForestSRC)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(varImp)
## Loading required package: measures
##
## Attaching package: 'measures'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'varImp'
## The following object is masked from 'package:caret':
##
## varImp
library(tidytext)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
library(ggplot2)
library(car)
## Loading required package: carData
##
## Attaching package: 'carData'
## The following objects are masked from 'package:BSDA':
##
## Vocab, Wool
##
## Attaching package: 'car'
## The following object is masked from 'package:modeltools':
##
## Predict
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(DescTools)
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
##
## Recode
## The following object is masked from 'package:modeltools':
##
## ParseFormula
## The following objects are masked from 'package:measures':
##
## AUC, MAE, MAPE, MSE, NPV, RMSE, SpearmanRho
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
library(ROCR)
library(lmtest)
library(rattle)
## Loading required package: bitops
##
## Attaching package: 'bitops'
## The following object is masked from 'package:DescTools':
##
## %^%
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
library(purrr)
hotel_data <- read.csv('hotel_bookings.csv')
str(hotel_data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
summary(hotel_data)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
anyNA(hotel_data)
## [1] TRUE
hotel_data <- na.omit(hotel_data)
Here we’ve taken out several variables based on incompleteness of data or because the variables are assumed to be of little value in predicting booking cancellations. The data frame is saved as a new data frame from the original.
hotel_data <- subset(hotel_data, select = -c(company,agent,country,arrival_date_year,reservation_status_date,reservation_status,reserved_room_type,assigned_room_type))
hotel_data$hotel <- as.factor(hotel_data$hotel)
hotel_data$meal <- as.factor(hotel_data$meal)
hotel_data$arrival_date_month <- as.factor(hotel_data$arrival_date_month)
hotel_data$market_segment <- as.factor(hotel_data$market_segment)
hotel_data$distribution_channel <- as.factor(hotel_data$distribution_channel)
hotel_data$deposit_type <- as.factor(hotel_data$deposit_type)
hotel_data$customer_type <- as.factor(hotel_data$customer_type)
hotel_data$is_canceled <- as.factor(hotel_data$is_canceled)
str(hotel_data)
## 'data.frame': 119386 obs. of 24 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ market_segment : Factor w/ 7 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
We see below, that bookings are not canceled more often than they are canceled, for both City Hotel and Resort Hotel.
A ‘0’ represents a booking not canceled and a ‘1’ represents a booking that was canceled.
table(hotel_data$is_canceled, hotel_data$hotel)
##
## City Hotel Resort Hotel
## 0 46228 28938
## 1 33098 11122
Here we see a potential issue with the Non-Refund deposit type.
table(hotel_data$is_canceled, hotel_data$deposit_type)
##
## No Deposit Non Refund Refundable
## 0 74947 93 126
## 1 29690 14494 36
table(hotel_data$deposit_type)
##
## No Deposit Non Refund Refundable
## 104637 14587 162
https://www.sciencedirect.com/science/article/pii/S2352340918315191
ggplot(data = hotel_data,
aes(x = deposit_type,
group = is_canceled,
fill = as.factor(is_canceled)))+
geom_bar(stat = "count", width = 0.5, position = "dodge")+
facet_grid(. ~ hotel)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
A potential reason for the misinformation of Non-Refund deposit type, is described in the link below, and a comment from Kaggle where the dataset was retrieved from:
https://www.kaggle.com/jessemostipak/hotel-booking-demand/discussion/131787
The author discussed this in his paper, Big Data in Hotel Revenue Management: Exploring Cancellation Drivers to Gain Insights Into Booking Cancellation Behavior. Here’s the reason, pasted directly from the journal. “As an example, through analysis of the “Nonrefundable” (DepositType) canceled bookings in some Asiatic countries (Country) and from certain distribution channels (DistributionChannel and Agent), it is possible to understand why so many “Nonrefundable” bookings are canceled. These bookings are usually made through OTA using false or invalid credit card details. These bookings are issued as support for requests for visas to enter the country (a hotel booking is mandatory for applying for a Portuguese entry visa). After failing to charge the customer’s credit card, the hotel identifies these bookings as “fake” and contacts the customer; however, during the time required to verify these bookings, they contribute negatively to demand forecast and demand-management decisions." -(commented by user Viriyo Satra Mangala).
Regardless of the reason, the deposit type looks unrealistic. Since the deposit type on ‘Non Refund’ is unreliable, we will take out all records that contain ‘Non Refund’, and will be left with ‘No Deposit’ and ‘Refundable’ deposit types. It may be best to not use Deposit Type at all in building of our stats model.
Take out records with non-refund types, drops from 119836 observations to
hotel_data <- filter(hotel_data, deposit_type != "Non Refund")
Total Bookings cancellation Status graphed below.
ggplot(data = hotel_data,
aes(
x = hotel,
y = prop.table(stat(count)),
fill = factor(is_canceled),
label = scales::percent(prop.table(stat(count)))
)) +
geom_bar(position = position_dodge()) +
geom_text(
stat = "count",
position = position_dodge(.9),
vjust = -0.5,
size = 3
) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Total Population Cancellation Status by Hotel Type",
x = "Hotel Type",
y = "Count") +
theme_dark() +
scale_fill_discrete(
name = "Booking Status",
breaks = c("0", "1"),
labels = c("Not Cancelled", "Cancelled")
)
COunt of marketing segments grouped by Hotel and Canceled Status
ggplot(data = hotel_data,
aes(x = market_segment,
group = is_canceled,
fill = as.factor(is_canceled)))+
geom_bar(stat = "count", width = 0.5, position = "dodge")+
facet_grid(. ~ hotel)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
Below is the total count of bookings for City Hotel and Resort Hotel
hotel_data%>%count(hotel)%>%
ggplot(aes(hotel,n)) +
geom_bar(stat = 'identity', fill = 'blue', alpha = 0.4, col = "black") +
xlab('Hotels ') + ylab('Frecuency')
hotel_data$arrival_date_month <-
factor(hotel_data$arrival_date_month, levels = month.name)
# Visualize Hotel traffic on Monthly basis
ggplot(data = hotel_data, aes(x = arrival_date_month)) +
geom_bar(fill = "steelblue") +
facet_wrap(~hotel) +
geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
coord_flip() + labs(title = "Month Wise Booking Request",
x = "Month",
y = "Count") +
theme_classic()
ADR median per hotel
hotel_data %>%
group_by(hotel) %>%
summarise_at(vars(adr), list(name = median))
## # A tibble: 2 x 2
## hotel name
## <fct> <dbl>
## 1 City Hotel 102.
## 2 Resort Hotel 76
stays week nights average per hotel
hotel_data %>%
group_by(hotel) %>%
summarise_at(vars(stays_in_week_nights), list(name = mean))
## # A tibble: 2 x 2
## hotel name
## <fct> <dbl>
## 1 City Hotel 2.22
## 2 Resort Hotel 3.14
weekend nights average
hotel_data %>%
group_by(hotel) %>%
summarise_at(vars(stays_in_weekend_nights), list(name = mean))
## # A tibble: 2 x 2
## hotel name
## <fct> <dbl>
## 1 City Hotel 0.840
## 2 Resort Hotel 1.19
hotel_data %>%
group_by(hotel, deposit_type, is_canceled) %>%
summarise_at(vars(adr), list(name = median))
## # A tibble: 8 x 4
## # Groups: hotel, deposit_type [4]
## hotel deposit_type is_canceled name
## <fct> <fct> <fct> <dbl>
## 1 City Hotel No Deposit 0 99.9
## 2 City Hotel No Deposit 1 106.
## 3 City Hotel Refundable 0 81
## 4 City Hotel Refundable 1 135.
## 5 Resort Hotel No Deposit 0 72.4
## 6 Resort Hotel No Deposit 1 91.6
## 7 Resort Hotel Refundable 0 66
## 8 Resort Hotel Refundable 1 63
x = deposit_type, y = adr, group = is_canceled,
ggplot(data = hotel_data,
aes(x = is_canceled,
y = adr,
fill = as.factor(is_canceled)))+
geom_bar(stat = "summary", width = 0.5, position = "dodge", fun = "median")+
facet_grid(. ~ hotel)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))+labs(y = "adr (Median Value)", fill = "is_canceled")
hotel_data$arrival_date_month = factor(hotel_data$arrival_date_month, levels = month.name)
ggplot(data = hotel_data,
aes(x = arrival_date_month,
y = adr))+
geom_bar(stat = "summary", width = 0.5, position = "dodge", fill = "#FF6666", fun = "median")+
facet_grid(. ~ hotel)+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))+
scale_x_discrete(limits = month.name)
Set a temporary training dataset below to test the Logistic Model, and this will help the time it takes to run models. Once the most important variables are decided then use a full training and testing set. This could be a form of ‘Undersampling’, because It reduces the number of observations from majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles. Check the balance of the data afterwords.
set.seed(123)
hotel_temporary <- sample(1:nrow(hotel_data), size = 0.10 * nrow(hotel_data))
hotel_small <- hotel_data[hotel_temporary,]
take out deposit type as it seems unrealiable
set.seed(123)
m1 <- glm(is_canceled ~ .-deposit_type, data = hotel_small, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m1)
##
## Call:
## glm(formula = is_canceled ~ . - deposit_type, family = binomial,
## data = hotel_small)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.9380 -0.7880 -0.5288 0.8791 4.1047
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.310e+00 5.598e-01 -4.127 3.67e-05 ***
## hotelResort Hotel 1.046e-01 5.968e-02 1.753 0.079623 .
## lead_time 3.831e-03 3.170e-04 12.083 < 2e-16 ***
## arrival_date_monthFebruary 9.532e-01 2.864e-01 3.328 0.000875 ***
## arrival_date_monthMarch 1.303e+00 4.950e-01 2.631 0.008509 **
## arrival_date_monthApril 2.069e+00 7.355e-01 2.813 0.004901 **
## arrival_date_monthMay 2.618e+00 9.709e-01 2.697 0.007006 **
## arrival_date_monthJune 3.435e+00 1.213e+00 2.832 0.004624 **
## arrival_date_monthJuly 4.031e+00 1.450e+00 2.781 0.005421 **
## arrival_date_monthAugust 4.852e+00 1.698e+00 2.858 0.004269 **
## arrival_date_monthSeptember 5.440e+00 1.951e+00 2.788 0.005307 **
## arrival_date_monthOctober 6.365e+00 2.186e+00 2.912 0.003589 **
## arrival_date_monthNovember 7.338e+00 2.424e+00 3.027 0.002472 **
## arrival_date_monthDecember 8.063e+00 2.657e+00 3.034 0.002411 **
## arrival_date_week_number -1.647e-01 5.501e-02 -2.994 0.002753 **
## arrival_date_day_of_month 2.148e-02 8.443e-03 2.544 0.010944 *
## stays_in_weekend_nights 1.060e-01 2.774e-02 3.822 0.000133 ***
## stays_in_week_nights 2.491e-02 1.463e-02 1.703 0.088640 .
## adults 1.661e-01 5.197e-02 3.195 0.001397 **
## children 3.252e-02 6.106e-02 0.533 0.594368
## babies 2.921e-01 2.986e-01 0.978 0.327982
## mealFB 8.967e-01 3.533e-01 2.538 0.011136 *
## mealHB -2.101e-02 8.522e-02 -0.247 0.805295
## mealSC 2.692e-01 7.887e-02 3.413 0.000643 ***
## mealUndefined -4.958e-01 2.917e-01 -1.700 0.089156 .
## market_segmentComplementary 8.350e-01 7.157e-01 1.167 0.243288
## market_segmentCorporate -1.285e-01 5.386e-01 -0.239 0.811446
## market_segmentDirect 5.570e-01 5.979e-01 0.931 0.351600
## market_segmentGroups 4.082e-01 5.683e-01 0.718 0.472615
## market_segmentOffline TA/TO -1.480e-01 5.693e-01 -0.260 0.794819
## market_segmentOnline TA 1.169e+00 5.674e-01 2.060 0.039442 *
## distribution_channelDirect -8.420e-01 2.889e-01 -2.914 0.003567 **
## distribution_channelGDS -1.105e+00 5.451e-01 -2.026 0.042717 *
## distribution_channelTA/TO -4.732e-01 2.253e-01 -2.100 0.035691 *
## is_repeated_guest -1.226e+00 2.497e-01 -4.910 9.11e-07 ***
## previous_cancellations 2.490e+00 1.731e-01 14.388 < 2e-16 ***
## previous_bookings_not_canceled -2.026e-01 2.583e-02 -7.845 4.31e-15 ***
## booking_changes -4.295e-01 4.994e-02 -8.600 < 2e-16 ***
## days_in_waiting_list 4.260e-04 1.452e-03 0.293 0.769203
## customer_typeGroup -7.239e-02 5.210e-01 -0.139 0.889499
## customer_typeTransient 7.175e-01 1.616e-01 4.440 9.01e-06 ***
## customer_typeTransient-Party 2.528e-01 1.714e-01 1.475 0.140200
## adr 4.388e-03 7.354e-04 5.966 2.43e-09 ***
## required_car_parking_spaces -1.652e+01 1.321e+02 -0.125 0.900462
## total_of_special_requests -7.577e-01 3.660e-02 -20.705 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12533 on 10478 degrees of freedom
## Residual deviance: 10291 on 10434 degrees of freedom
## AIC: 10381
##
## Number of Fisher Scoring iterations: 16
# model statistics for type 3 ANOVA
Wald1<-Anova(m1,type = 3, test = "Wald")
Wald1
## Analysis of Deviance Table (Type III tests)
##
## Response: is_canceled
## Df Chisq Pr(>Chisq)
## (Intercept) 1 17.0325 3.675e-05 ***
## hotel 1 3.0726 0.0796230 .
## lead_time 1 145.9992 < 2.2e-16 ***
## arrival_date_month 11 42.4198 1.370e-05 ***
## arrival_date_week_number 1 8.9641 0.0027534 **
## arrival_date_day_of_month 1 6.4744 0.0109439 *
## stays_in_weekend_nights 1 14.6053 0.0001325 ***
## stays_in_week_nights 1 2.8989 0.0886398 .
## adults 1 10.2104 0.0013965 **
## children 1 0.2836 0.5943680
## babies 1 0.9568 0.3279824
## meal 4 21.5182 0.0002499 ***
## market_segment 6 239.4975 < 2.2e-16 ***
## distribution_channel 3 10.2164 0.0168138 *
## is_repeated_guest 1 24.1073 9.111e-07 ***
## previous_cancellations 1 207.0151 < 2.2e-16 ***
## previous_bookings_not_canceled 1 61.5516 4.313e-15 ***
## booking_changes 1 73.9573 < 2.2e-16 ***
## days_in_waiting_list 1 0.0861 0.7692028
## customer_type 3 44.3174 1.292e-09 ***
## adr 1 35.5952 2.429e-09 ***
## required_car_parking_spaces 1 0.0156 0.9004615
## total_of_special_requests 1 428.7050 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Build a new model (m3) with insignificant features removed
set.seed(123)
m3 <- glm(is_canceled ~ . -days_in_waiting_list -deposit_type -required_car_parking_spaces -children -babies, data = hotel_small, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m3)
##
## Call:
## glm(formula = is_canceled ~ . - days_in_waiting_list - deposit_type -
## required_car_parking_spaces - children - babies, family = binomial,
## data = hotel_small)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.9376 -0.7822 -0.5578 0.9072 4.2227
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2187460 0.5593826 -3.966 7.30e-05 ***
## hotelResort Hotel -0.0813598 0.0576410 -1.411 0.158100
## lead_time 0.0038744 0.0003082 12.573 < 2e-16 ***
## arrival_date_monthFebruary 1.0968597 0.2802167 3.914 9.07e-05 ***
## arrival_date_monthMarch 1.5404270 0.4841058 3.182 0.001463 **
## arrival_date_monthApril 2.4146629 0.7185631 3.360 0.000778 ***
## arrival_date_monthMay 3.1100400 0.9492316 3.276 0.001051 **
## arrival_date_monthJune 3.9993346 1.1861074 3.372 0.000747 ***
## arrival_date_monthJuly 4.6875817 1.4170920 3.308 0.000940 ***
## arrival_date_monthAugust 5.6448084 1.6603014 3.400 0.000674 ***
## arrival_date_monthSeptember 6.3246300 1.9084476 3.314 0.000920 ***
## arrival_date_monthOctober 7.3484622 2.1379976 3.437 0.000588 ***
## arrival_date_monthNovember 8.4108708 2.3713133 3.547 0.000390 ***
## arrival_date_monthDecember 9.2422607 2.5990235 3.556 0.000376 ***
## arrival_date_week_number -0.1887885 0.0538263 -3.507 0.000453 ***
## arrival_date_day_of_month 0.0254640 0.0082523 3.086 0.002031 **
## stays_in_weekend_nights 0.1164072 0.0272114 4.278 1.89e-05 ***
## stays_in_week_nights 0.0307970 0.0144045 2.138 0.032516 *
## adults 0.1501595 0.0495052 3.033 0.002420 **
## mealFB 0.8441445 0.3325315 2.539 0.011132 *
## mealHB 0.0039915 0.0830538 0.048 0.961669
## mealSC 0.2596129 0.0780752 3.325 0.000884 ***
## mealUndefined -0.3574997 0.2895501 -1.235 0.216952
## market_segmentComplementary 0.5044588 0.7083974 0.712 0.476395
## market_segmentCorporate -0.1766358 0.5386526 -0.328 0.742971
## market_segmentDirect 0.2198292 0.5937974 0.370 0.711227
## market_segmentGroups 0.2952615 0.5655767 0.522 0.601633
## market_segmentOffline TA/TO -0.2592766 0.5670897 -0.457 0.647523
## market_segmentOnline TA 1.0224114 0.5649823 1.810 0.070353 .
## distribution_channelDirect -0.6316546 0.2795379 -2.260 0.023844 *
## distribution_channelGDS -0.9316261 0.5431951 -1.715 0.086330 .
## distribution_channelTA/TO -0.3391845 0.2176683 -1.558 0.119171
## is_repeated_guest -1.2528300 0.2492271 -5.027 4.99e-07 ***
## previous_cancellations 2.5006164 0.1702729 14.686 < 2e-16 ***
## previous_bookings_not_canceled -0.2140704 0.0283599 -7.548 4.41e-14 ***
## booking_changes -0.4491725 0.0492317 -9.124 < 2e-16 ***
## customer_typeGroup -0.0112182 0.5145965 -0.022 0.982608
## customer_typeTransient 0.6861320 0.1609171 4.264 2.01e-05 ***
## customer_typeTransient-Party 0.2244410 0.1707857 1.314 0.188790
## adr 0.0033795 0.0006638 5.091 3.56e-07 ***
## total_of_special_requests -0.7540047 0.0358349 -21.041 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12533 on 10478 degrees of freedom
## Residual deviance: 10727 on 10438 degrees of freedom
## AIC: 10809
##
## Number of Fisher Scoring iterations: 6
# model statistics for type 3 ANOVA with the Logit Model build(m3)
Wald3<-Anova(m3,type = 3, test = "Wald")
Wald3
## Analysis of Deviance Table (Type III tests)
##
## Response: is_canceled
## Df Chisq Pr(>Chisq)
## (Intercept) 1 15.7325 7.296e-05 ***
## hotel 1 1.9923 0.1580996
## lead_time 1 158.0701 < 2.2e-16 ***
## arrival_date_month 11 45.0395 4.775e-06 ***
## arrival_date_week_number 1 12.3016 0.0004526 ***
## arrival_date_day_of_month 1 9.5213 0.0020310 **
## stays_in_weekend_nights 1 18.3004 1.887e-05 ***
## stays_in_week_nights 1 4.5711 0.0325158 *
## adults 1 9.2003 0.0024197 **
## meal 4 19.5549 0.0006113 ***
## market_segment 6 235.6665 < 2.2e-16 ***
## distribution_channel 3 6.5646 0.0871488 .
## is_repeated_guest 1 25.2693 4.986e-07 ***
## previous_cancellations 1 215.6767 < 2.2e-16 ***
## previous_bookings_not_canceled 1 56.9774 4.408e-14 ***
## booking_changes 1 83.2407 < 2.2e-16 ***
## customer_type 3 43.1339 2.305e-09 ***
## adr 1 25.9210 3.557e-07 ***
## total_of_special_requests 1 442.7272 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random Forest to observe variable importance. Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.Out-of-bag estimate error rate of 18.59%
set.seed(123)
forest1 <- randomForest(is_canceled ~.-deposit_type,
data = hotel_small,
importance = TRUE,
ntree = 1000)
forest1
##
## Call:
## randomForest(formula = is_canceled ~ . - deposit_type, data = hotel_small, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 18.59%
## Confusion matrix:
## 0 1 class.error
## 0 6948 540 0.07211538
## 1 1408 1583 0.47074557
We will now run a RF model to find which features are important and if they are similar results to Logistic Regression.
Variable importance
varImpPlot(forest1)
After finding the most significant features, we test Logisitc Regression and RandomForest on the entire dataset and include additional insights. But first we split the data.
set.seed(123)
train.idx <- sample(1:nrow(hotel_data), size = 0.70 * nrow(hotel_data))
train_data <- hotel_data[train.idx,]
test_data <- hotel_data[-train.idx,]
set.seed(123)
m4 <- glm(is_canceled ~ . -days_in_waiting_list -deposit_type -required_car_parking_spaces -children -babies, data = train_data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m4)
##
## Call:
## glm(formula = is_canceled ~ . - days_in_waiting_list - deposit_type -
## required_car_parking_spaces - children - babies, family = binomial,
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.7779 -0.5436 0.8742 5.5722
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3797276 0.2210451 -10.766 < 2e-16 ***
## hotelResort Hotel -0.1055449 0.0217673 -4.849 1.24e-06 ***
## lead_time 0.0041393 0.0001166 35.498 < 2e-16 ***
## arrival_date_monthFebruary 0.9545556 0.1069058 8.929 < 2e-16 ***
## arrival_date_monthMarch 1.4448939 0.1840697 7.850 4.17e-15 ***
## arrival_date_monthApril 2.4168787 0.2723987 8.873 < 2e-16 ***
## arrival_date_monthMay 3.1179026 0.3604750 8.649 < 2e-16 ***
## arrival_date_monthJune 3.8856070 0.4498383 8.638 < 2e-16 ***
## arrival_date_monthJuly 4.5935186 0.5381040 8.536 < 2e-16 ***
## arrival_date_monthAugust 5.5302438 0.6300220 8.778 < 2e-16 ***
## arrival_date_monthSeptember 6.2115343 0.7246547 8.572 < 2e-16 ***
## arrival_date_monthOctober 7.2180822 0.8113012 8.897 < 2e-16 ***
## arrival_date_monthNovember 8.0881865 0.9005646 8.981 < 2e-16 ***
## arrival_date_monthDecember 8.9551473 0.9870954 9.072 < 2e-16 ***
## arrival_date_week_number -0.1835498 0.0204423 -8.979 < 2e-16 ***
## arrival_date_day_of_month 0.0261415 0.0031349 8.339 < 2e-16 ***
## stays_in_weekend_nights 0.0848250 0.0104989 8.079 6.51e-16 ***
## stays_in_week_nights 0.0563759 0.0055848 10.095 < 2e-16 ***
## adults 0.0944233 0.0171092 5.519 3.41e-08 ***
## mealFB 0.7783541 0.1242451 6.265 3.74e-10 ***
## mealHB -0.0741201 0.0313609 -2.363 0.018105 *
## mealSC 0.1404857 0.0300397 4.677 2.92e-06 ***
## mealUndefined -0.5301862 0.1185561 -4.472 7.75e-06 ***
## market_segmentComplementary 0.5115529 0.2644214 1.935 0.053038 .
## market_segmentCorporate -0.1751410 0.2118239 -0.827 0.408337
## market_segmentDirect 0.0314691 0.2316087 0.136 0.891923
## market_segmentGroups 0.0638232 0.2199366 0.290 0.771672
## market_segmentOffline TA/TO -0.5500209 0.2207341 -2.492 0.012710 *
## market_segmentOnline TA 0.7859308 0.2200038 3.572 0.000354 ***
## distribution_channelDirect -0.5865386 0.1072301 -5.470 4.50e-08 ***
## distribution_channelGDS -0.8968919 0.2347669 -3.820 0.000133 ***
## distribution_channelTA/TO -0.1256727 0.0815080 -1.542 0.123111
## distribution_channelUndefined -7.3656229 72.4629559 -0.102 0.919037
## is_repeated_guest -0.7521839 0.0969619 -7.758 8.66e-15 ***
## previous_cancellations 2.7343790 0.0694837 39.353 < 2e-16 ***
## previous_bookings_not_canceled -0.4556729 0.0287004 -15.877 < 2e-16 ***
## booking_changes -0.4162856 0.0181431 -22.945 < 2e-16 ***
## customer_typeGroup -0.0171859 0.1966399 -0.087 0.930355
## customer_typeTransient 0.7939001 0.0630203 12.598 < 2e-16 ***
## customer_typeTransient-Party 0.3991553 0.0670266 5.955 2.60e-09 ***
## adr 0.0042139 0.0002522 16.712 < 2e-16 ***
## total_of_special_requests -0.7400127 0.0135619 -54.565 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 87345 on 73358 degrees of freedom
## Residual deviance: 74012 on 73317 degrees of freedom
## AIC: 74096
##
## Number of Fisher Scoring iterations: 8
https://www.kaggle.com/ugurdar/logistic-regression-roc-auc-hotel-booking-data#Model-Building
test_data$PredProb = predict.glm(m4, newdata = test_data, type = 'response')
test_data$PredSur = ifelse(test_data$PredProb >= 0.5, 1,0)
caret::confusionMatrix(as.factor(test_data$is_canceled), as.factor(test_data$PredSur), mode="everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21169 1271
## 1 5722 3278
##
## Accuracy : 0.7776
## 95% CI : (0.7729, 0.7822)
## No Information Rate : 0.8553
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3611
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7206
## Specificity : 0.7872
## Pos Pred Value : 0.3642
## Neg Pred Value : 0.9434
## Precision : 0.3642
## Recall : 0.7206
## F1 : 0.4839
## Prevalence : 0.1447
## Detection Rate : 0.1043
## Detection Prevalence : 0.2863
## Balanced Accuracy : 0.7539
##
## 'Positive' Class : 1
##
Accuracy = 77.7% F1 score of 48%
this error occurs when using the full data set. We will shorten the data to make sure we can run the RF Error: cannot allocate vector of size 1.1 Gb
set.seed(123)
train_rf.idx <- sample(1:nrow(hotel_data), size = 0.50 * nrow(hotel_data))
train_rf_data <- hotel_data[train_rf.idx,]
test_rf_data <- hotel_data[-train_rf.idx,]
set.seed(123)
forest2 <- randomForest(is_canceled ~.-deposit_type -days_in_waiting_list -required_car_parking_spaces -children -babies,
data = train_rf_data,
importance = TRUE,
ntree = 1000)
forest2
##
## Call:
## randomForest(formula = is_canceled ~ . - deposit_type - days_in_waiting_list - required_car_parking_spaces - children - babies, data = train_rf_data, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 16.73%
## Confusion matrix:
## 0 1 class.error
## 0 34832 2712 0.07223524
## 1 6053 8802 0.40747223
F1 score of 66.8%. The higher the F1 score, the better. The random forest has better accuracy at 84% and 66.8% F1 score. This is important if there is a cost associated with contacting guest who we think might cancel their bookings. For example, if we send every guest a promo to keep them from canceling, this could be costly if we have a high amount of False Positives (positive meaning that our model gives a positive score of ‘1’ that the guest will cancel). But since these guests were likely to come anyways, we now give guests promos, even when we didn’t need to. Therefore F1 score could be an important metric.
For this reason, we choose the Random Forest Model. In the future, it may be useful to run a RF with hyperparmeters.