By Logan Henslee

Sources used:

https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md

https://www.kaggle.com/jessemostipak/hotel-booking-demand

https://www.sciencedirect.com/science/article/pii/S2352340918315191

Data Dictionary

  • hotel - Hotel (H1 = Resort Hotel or H2 = City Hotel)

  • is_canceled - Value indicating if the booking was canceled (1) or not (0)

  • lead_time- Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

  • arrival_date_year - Year of arrival date

  • arrival_date_month - Month of arrival date

  • arrival_date_week_number - Week number of year for arrival date

  • arrival_date_day_of_month - Day of arrival date

  • stays_in_weekend_nights - Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

  • stays_in_week_nights -Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

  • adults - Number of adults

  • children - Number of children

  • babies - Number of babies

  • meal - Type of meal booked. Categories are presented in standard hospitality meal packages:

    -Undefined/SC – no meal package; -BB – Bed & Breakfast; -HB – Half board (breakfast and one other meal – usually dinner); -FB – Full board (breakfast, lunch and dinner)

  • country Country of origin. Categories are represented in the ISO 3155–3:2013 format

  • market_segment - Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  • distribution_channel - Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  • is_repeated_guest - Value indicating if the booking name was from a repeated guest (1) or not (0)

  • previous_cancellations - Number of previous bookings that were cancelled by the customer prior to the current booking

  • previous_bookings_not_canceled - Number of previous bookings not cancelled by the customer prior to the current booking

  • reserved_room_type - Code of room type reserved. Code is presented instead of designation for anonymity reasons

  • assigned_room_type - Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

  • booking_changes - Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

  • deposit_type Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:

    -No Deposit – no deposit was made; -Non Refund – a deposit was made in the value of the total stay cost; -Refundable – a deposit was made with a value under the total cost of stay.

  • agent - ID of the travel agency that made the booking

  • company - ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

  • days_in_waiting_list - Number of days the booking was in the waiting list before it was confirmed to the customer

  • customer_type - Type of booking, assuming one of four categories:

    -Contract - when the booking has an allotment or other type of contract associated to it; -Group – when the booking is associated to a group; -Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; -Transient-party – when the booking is transient, but is associated to at least other transient booking

  • adr - Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

  • required_car_parking_spaces - Number of car parking spaces required by the customer

  • total_of_special_requests - Number of special requests made by the customer (e.g. twin bed or high floor)

  • reservation_status - Reservation last status, assuming one of three categories:

    -Canceled – booking was canceled by the customer; -Check-Out – customer has checked in but already departed; -No-Show – customer did not check-in and did inform the hotel of the reason why

  • reservation_status_date - Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(broom)
library(BSDA)
## Loading required package: lattice
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggThemeAssist)
library(corrplot)
## corrplot 0.92 loaded
library(survival) 
library(rpart) 
#library(randomForestSRC)
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(varImp)
## Loading required package: measures
## 
## Attaching package: 'measures'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## Attaching package: 'varImp'
## The following object is masked from 'package:caret':
## 
##     varImp
library(tidytext)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library(ggplot2)

library(car)
## Loading required package: carData
## 
## Attaching package: 'carData'
## The following objects are masked from 'package:BSDA':
## 
##     Vocab, Wool
## 
## Attaching package: 'car'
## The following object is masked from 'package:modeltools':
## 
##     Predict
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(DescTools)
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
## 
##     Recode
## The following object is masked from 'package:modeltools':
## 
##     ParseFormula
## The following objects are masked from 'package:measures':
## 
##     AUC, MAE, MAPE, MSE, NPV, RMSE, SpearmanRho
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
library(ROCR)
library(lmtest)

library(rattle)
## Loading required package: bitops
## 
## Attaching package: 'bitops'
## The following object is masked from 'package:DescTools':
## 
##     %^%
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(purrr)
hotel_data <- read.csv('hotel_bookings.csv')
str(hotel_data)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
summary(hotel_data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 
anyNA(hotel_data)
## [1] TRUE
hotel_data <- na.omit(hotel_data)

EDA and Data Transformation

Here we’ve taken out several variables based on incompleteness of data or because the variables are assumed to be of little value in predicting booking cancellations. The data frame is saved as a new data frame from the original.

hotel_data <- subset(hotel_data, select = -c(company,agent,country,arrival_date_year,reservation_status_date,reservation_status,reserved_room_type,assigned_room_type))
hotel_data$hotel <- as.factor(hotel_data$hotel)
hotel_data$meal <- as.factor(hotel_data$meal)
hotel_data$arrival_date_month <- as.factor(hotel_data$arrival_date_month)
hotel_data$market_segment <- as.factor(hotel_data$market_segment)
hotel_data$distribution_channel <- as.factor(hotel_data$distribution_channel)
hotel_data$deposit_type <- as.factor(hotel_data$deposit_type)
hotel_data$customer_type <- as.factor(hotel_data$customer_type)

hotel_data$is_canceled <- as.factor(hotel_data$is_canceled)
str(hotel_data)
## 'data.frame':    119386 obs. of  24 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ market_segment                : Factor w/ 7 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...

We see below, that bookings are not canceled more often than they are canceled, for both City Hotel and Resort Hotel.

A ‘0’ represents a booking not canceled and a ‘1’ represents a booking that was canceled.

table(hotel_data$is_canceled, hotel_data$hotel)
##    
##     City Hotel Resort Hotel
##   0      46228        28938
##   1      33098        11122

Here we see a potential issue with the Non-Refund deposit type.

table(hotel_data$is_canceled, hotel_data$deposit_type)
##    
##     No Deposit Non Refund Refundable
##   0      74947         93        126
##   1      29690      14494         36
table(hotel_data$deposit_type)
## 
## No Deposit Non Refund Refundable 
##     104637      14587        162

ADR (Average Daily Rate). This would be the bookings cost per night.

hotel_data %>%
  group_by(hotel, deposit_type, is_canceled) %>%
  summarise_at(vars(adr), list(name = median))
## # A tibble: 8 x 4
## # Groups:   hotel, deposit_type [4]
##   hotel        deposit_type is_canceled  name
##   <fct>        <fct>        <fct>       <dbl>
## 1 City Hotel   No Deposit   0            99.9
## 2 City Hotel   No Deposit   1           106. 
## 3 City Hotel   Refundable   0            81  
## 4 City Hotel   Refundable   1           135. 
## 5 Resort Hotel No Deposit   0            72.4
## 6 Resort Hotel No Deposit   1            91.6
## 7 Resort Hotel Refundable   0            66  
## 8 Resort Hotel Refundable   1            63

x = deposit_type, y = adr, group = is_canceled,

ggplot(data = hotel_data, 
       aes(x = is_canceled,
           y = adr,
           
           fill = as.factor(is_canceled)))+
  geom_bar(stat = "summary", width = 0.5, position = "dodge", fun = "median")+
  facet_grid(. ~ hotel)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+labs(y = "adr (Median Value)", fill = "is_canceled")

hotel_data$arrival_date_month = factor(hotel_data$arrival_date_month, levels = month.name)
ggplot(data = hotel_data, 
       aes(x = arrival_date_month, 
           y = adr))+
  geom_bar(stat = "summary", width = 0.5, position = "dodge", fill = "#FF6666", fun = "median")+
  facet_grid(. ~ hotel)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+ 
  scale_x_discrete(limits = month.name)

Variable Importance

Set a temporary training dataset below to test the Logistic Model, and this will help the time it takes to run models. Once the most important variables are decided then use a full training and testing set. This could be a form of ‘Undersampling’, because It reduces the number of observations from majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles. Check the balance of the data afterwords.

set.seed(123)
hotel_temporary <- sample(1:nrow(hotel_data), size = 0.10 * nrow(hotel_data))
hotel_small <- hotel_data[hotel_temporary,]

Logistic Regression to observe significant variables.

take out deposit type as it seems unrealiable

set.seed(123)
m1 <- glm(is_canceled ~ .-deposit_type, data = hotel_small, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m1)
## 
## Call:
## glm(formula = is_canceled ~ . - deposit_type, family = binomial, 
##     data = hotel_small)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.9380  -0.7880  -0.5288   0.8791   4.1047  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.310e+00  5.598e-01  -4.127 3.67e-05 ***
## hotelResort Hotel               1.046e-01  5.968e-02   1.753 0.079623 .  
## lead_time                       3.831e-03  3.170e-04  12.083  < 2e-16 ***
## arrival_date_monthFebruary      9.532e-01  2.864e-01   3.328 0.000875 ***
## arrival_date_monthMarch         1.303e+00  4.950e-01   2.631 0.008509 ** 
## arrival_date_monthApril         2.069e+00  7.355e-01   2.813 0.004901 ** 
## arrival_date_monthMay           2.618e+00  9.709e-01   2.697 0.007006 ** 
## arrival_date_monthJune          3.435e+00  1.213e+00   2.832 0.004624 ** 
## arrival_date_monthJuly          4.031e+00  1.450e+00   2.781 0.005421 ** 
## arrival_date_monthAugust        4.852e+00  1.698e+00   2.858 0.004269 ** 
## arrival_date_monthSeptember     5.440e+00  1.951e+00   2.788 0.005307 ** 
## arrival_date_monthOctober       6.365e+00  2.186e+00   2.912 0.003589 ** 
## arrival_date_monthNovember      7.338e+00  2.424e+00   3.027 0.002472 ** 
## arrival_date_monthDecember      8.063e+00  2.657e+00   3.034 0.002411 ** 
## arrival_date_week_number       -1.647e-01  5.501e-02  -2.994 0.002753 ** 
## arrival_date_day_of_month       2.148e-02  8.443e-03   2.544 0.010944 *  
## stays_in_weekend_nights         1.060e-01  2.774e-02   3.822 0.000133 ***
## stays_in_week_nights            2.491e-02  1.463e-02   1.703 0.088640 .  
## adults                          1.661e-01  5.197e-02   3.195 0.001397 ** 
## children                        3.252e-02  6.106e-02   0.533 0.594368    
## babies                          2.921e-01  2.986e-01   0.978 0.327982    
## mealFB                          8.967e-01  3.533e-01   2.538 0.011136 *  
## mealHB                         -2.101e-02  8.522e-02  -0.247 0.805295    
## mealSC                          2.692e-01  7.887e-02   3.413 0.000643 ***
## mealUndefined                  -4.958e-01  2.917e-01  -1.700 0.089156 .  
## market_segmentComplementary     8.350e-01  7.157e-01   1.167 0.243288    
## market_segmentCorporate        -1.285e-01  5.386e-01  -0.239 0.811446    
## market_segmentDirect            5.570e-01  5.979e-01   0.931 0.351600    
## market_segmentGroups            4.082e-01  5.683e-01   0.718 0.472615    
## market_segmentOffline TA/TO    -1.480e-01  5.693e-01  -0.260 0.794819    
## market_segmentOnline TA         1.169e+00  5.674e-01   2.060 0.039442 *  
## distribution_channelDirect     -8.420e-01  2.889e-01  -2.914 0.003567 ** 
## distribution_channelGDS        -1.105e+00  5.451e-01  -2.026 0.042717 *  
## distribution_channelTA/TO      -4.732e-01  2.253e-01  -2.100 0.035691 *  
## is_repeated_guest              -1.226e+00  2.497e-01  -4.910 9.11e-07 ***
## previous_cancellations          2.490e+00  1.731e-01  14.388  < 2e-16 ***
## previous_bookings_not_canceled -2.026e-01  2.583e-02  -7.845 4.31e-15 ***
## booking_changes                -4.295e-01  4.994e-02  -8.600  < 2e-16 ***
## days_in_waiting_list            4.260e-04  1.452e-03   0.293 0.769203    
## customer_typeGroup             -7.239e-02  5.210e-01  -0.139 0.889499    
## customer_typeTransient          7.175e-01  1.616e-01   4.440 9.01e-06 ***
## customer_typeTransient-Party    2.528e-01  1.714e-01   1.475 0.140200    
## adr                             4.388e-03  7.354e-04   5.966 2.43e-09 ***
## required_car_parking_spaces    -1.652e+01  1.321e+02  -0.125 0.900462    
## total_of_special_requests      -7.577e-01  3.660e-02 -20.705  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12533  on 10478  degrees of freedom
## Residual deviance: 10291  on 10434  degrees of freedom
## AIC: 10381
## 
## Number of Fisher Scoring iterations: 16

ANOVA 1

# model statistics for type 3 ANOVA
Wald1<-Anova(m1,type = 3, test = "Wald")
Wald1
## Analysis of Deviance Table (Type III tests)
## 
## Response: is_canceled
##                                Df    Chisq Pr(>Chisq)    
## (Intercept)                     1  17.0325  3.675e-05 ***
## hotel                           1   3.0726  0.0796230 .  
## lead_time                       1 145.9992  < 2.2e-16 ***
## arrival_date_month             11  42.4198  1.370e-05 ***
## arrival_date_week_number        1   8.9641  0.0027534 ** 
## arrival_date_day_of_month       1   6.4744  0.0109439 *  
## stays_in_weekend_nights         1  14.6053  0.0001325 ***
## stays_in_week_nights            1   2.8989  0.0886398 .  
## adults                          1  10.2104  0.0013965 ** 
## children                        1   0.2836  0.5943680    
## babies                          1   0.9568  0.3279824    
## meal                            4  21.5182  0.0002499 ***
## market_segment                  6 239.4975  < 2.2e-16 ***
## distribution_channel            3  10.2164  0.0168138 *  
## is_repeated_guest               1  24.1073  9.111e-07 ***
## previous_cancellations          1 207.0151  < 2.2e-16 ***
## previous_bookings_not_canceled  1  61.5516  4.313e-15 ***
## booking_changes                 1  73.9573  < 2.2e-16 ***
## days_in_waiting_list            1   0.0861  0.7692028    
## customer_type                   3  44.3174  1.292e-09 ***
## adr                             1  35.5952  2.429e-09 ***
## required_car_parking_spaces     1   0.0156  0.9004615    
## total_of_special_requests       1 428.7050  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Build a new model (m3) with insignificant features removed

set.seed(123)
m3 <- glm(is_canceled ~ . -days_in_waiting_list -deposit_type -required_car_parking_spaces -children -babies, data = hotel_small, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m3)
## 
## Call:
## glm(formula = is_canceled ~ . - days_in_waiting_list - deposit_type - 
##     required_car_parking_spaces - children - babies, family = binomial, 
##     data = hotel_small)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.9376  -0.7822  -0.5578   0.9072   4.2227  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.2187460  0.5593826  -3.966 7.30e-05 ***
## hotelResort Hotel              -0.0813598  0.0576410  -1.411 0.158100    
## lead_time                       0.0038744  0.0003082  12.573  < 2e-16 ***
## arrival_date_monthFebruary      1.0968597  0.2802167   3.914 9.07e-05 ***
## arrival_date_monthMarch         1.5404270  0.4841058   3.182 0.001463 ** 
## arrival_date_monthApril         2.4146629  0.7185631   3.360 0.000778 ***
## arrival_date_monthMay           3.1100400  0.9492316   3.276 0.001051 ** 
## arrival_date_monthJune          3.9993346  1.1861074   3.372 0.000747 ***
## arrival_date_monthJuly          4.6875817  1.4170920   3.308 0.000940 ***
## arrival_date_monthAugust        5.6448084  1.6603014   3.400 0.000674 ***
## arrival_date_monthSeptember     6.3246300  1.9084476   3.314 0.000920 ***
## arrival_date_monthOctober       7.3484622  2.1379976   3.437 0.000588 ***
## arrival_date_monthNovember      8.4108708  2.3713133   3.547 0.000390 ***
## arrival_date_monthDecember      9.2422607  2.5990235   3.556 0.000376 ***
## arrival_date_week_number       -0.1887885  0.0538263  -3.507 0.000453 ***
## arrival_date_day_of_month       0.0254640  0.0082523   3.086 0.002031 ** 
## stays_in_weekend_nights         0.1164072  0.0272114   4.278 1.89e-05 ***
## stays_in_week_nights            0.0307970  0.0144045   2.138 0.032516 *  
## adults                          0.1501595  0.0495052   3.033 0.002420 ** 
## mealFB                          0.8441445  0.3325315   2.539 0.011132 *  
## mealHB                          0.0039915  0.0830538   0.048 0.961669    
## mealSC                          0.2596129  0.0780752   3.325 0.000884 ***
## mealUndefined                  -0.3574997  0.2895501  -1.235 0.216952    
## market_segmentComplementary     0.5044588  0.7083974   0.712 0.476395    
## market_segmentCorporate        -0.1766358  0.5386526  -0.328 0.742971    
## market_segmentDirect            0.2198292  0.5937974   0.370 0.711227    
## market_segmentGroups            0.2952615  0.5655767   0.522 0.601633    
## market_segmentOffline TA/TO    -0.2592766  0.5670897  -0.457 0.647523    
## market_segmentOnline TA         1.0224114  0.5649823   1.810 0.070353 .  
## distribution_channelDirect     -0.6316546  0.2795379  -2.260 0.023844 *  
## distribution_channelGDS        -0.9316261  0.5431951  -1.715 0.086330 .  
## distribution_channelTA/TO      -0.3391845  0.2176683  -1.558 0.119171    
## is_repeated_guest              -1.2528300  0.2492271  -5.027 4.99e-07 ***
## previous_cancellations          2.5006164  0.1702729  14.686  < 2e-16 ***
## previous_bookings_not_canceled -0.2140704  0.0283599  -7.548 4.41e-14 ***
## booking_changes                -0.4491725  0.0492317  -9.124  < 2e-16 ***
## customer_typeGroup             -0.0112182  0.5145965  -0.022 0.982608    
## customer_typeTransient          0.6861320  0.1609171   4.264 2.01e-05 ***
## customer_typeTransient-Party    0.2244410  0.1707857   1.314 0.188790    
## adr                             0.0033795  0.0006638   5.091 3.56e-07 ***
## total_of_special_requests      -0.7540047  0.0358349 -21.041  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12533  on 10478  degrees of freedom
## Residual deviance: 10727  on 10438  degrees of freedom
## AIC: 10809
## 
## Number of Fisher Scoring iterations: 6

ANOVA 2

# model statistics for type 3 ANOVA with the Logit Model build(m3)
Wald3<-Anova(m3,type = 3, test = "Wald")
Wald3
## Analysis of Deviance Table (Type III tests)
## 
## Response: is_canceled
##                                Df    Chisq Pr(>Chisq)    
## (Intercept)                     1  15.7325  7.296e-05 ***
## hotel                           1   1.9923  0.1580996    
## lead_time                       1 158.0701  < 2.2e-16 ***
## arrival_date_month             11  45.0395  4.775e-06 ***
## arrival_date_week_number        1  12.3016  0.0004526 ***
## arrival_date_day_of_month       1   9.5213  0.0020310 ** 
## stays_in_weekend_nights         1  18.3004  1.887e-05 ***
## stays_in_week_nights            1   4.5711  0.0325158 *  
## adults                          1   9.2003  0.0024197 ** 
## meal                            4  19.5549  0.0006113 ***
## market_segment                  6 235.6665  < 2.2e-16 ***
## distribution_channel            3   6.5646  0.0871488 .  
## is_repeated_guest               1  25.2693  4.986e-07 ***
## previous_cancellations          1 215.6767  < 2.2e-16 ***
## previous_bookings_not_canceled  1  56.9774  4.408e-14 ***
## booking_changes                 1  83.2407  < 2.2e-16 ***
## customer_type                   3  43.1339  2.305e-09 ***
## adr                             1  25.9210  3.557e-07 ***
## total_of_special_requests       1 442.7272  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random Forest inference

Random Forest to observe variable importance. Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.Out-of-bag estimate error rate of 18.59%

set.seed(123)
forest1 <- randomForest(is_canceled ~.-deposit_type,
                             data = hotel_small, 
                             importance = TRUE, 
                             ntree = 1000)

forest1
## 
## Call:
##  randomForest(formula = is_canceled ~ . - deposit_type, data = hotel_small,      importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.59%
## Confusion matrix:
##      0    1 class.error
## 0 6948  540  0.07211538
## 1 1408 1583  0.47074557

We will now run a RF model to find which features are important and if they are similar results to Logistic Regression.

Variable importance

varImpPlot(forest1)

Prepare Train/Test Data

After finding the most significant features, we test Logisitc Regression and RandomForest on the entire dataset and include additional insights. But first we split the data.

set.seed(123)
train.idx <- sample(1:nrow(hotel_data), size = 0.70 * nrow(hotel_data))
train_data <- hotel_data[train.idx,]
test_data <- hotel_data[-train.idx,]

Train Logistic Regression Model

set.seed(123)
m4 <- glm(is_canceled ~ . -days_in_waiting_list -deposit_type -required_car_parking_spaces -children -babies, data = train_data, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m4)
## 
## Call:
## glm(formula = is_canceled ~ . - days_in_waiting_list - deposit_type - 
##     required_car_parking_spaces - children - babies, family = binomial, 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.7779  -0.5436   0.8742   5.5722  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.3797276  0.2210451 -10.766  < 2e-16 ***
## hotelResort Hotel              -0.1055449  0.0217673  -4.849 1.24e-06 ***
## lead_time                       0.0041393  0.0001166  35.498  < 2e-16 ***
## arrival_date_monthFebruary      0.9545556  0.1069058   8.929  < 2e-16 ***
## arrival_date_monthMarch         1.4448939  0.1840697   7.850 4.17e-15 ***
## arrival_date_monthApril         2.4168787  0.2723987   8.873  < 2e-16 ***
## arrival_date_monthMay           3.1179026  0.3604750   8.649  < 2e-16 ***
## arrival_date_monthJune          3.8856070  0.4498383   8.638  < 2e-16 ***
## arrival_date_monthJuly          4.5935186  0.5381040   8.536  < 2e-16 ***
## arrival_date_monthAugust        5.5302438  0.6300220   8.778  < 2e-16 ***
## arrival_date_monthSeptember     6.2115343  0.7246547   8.572  < 2e-16 ***
## arrival_date_monthOctober       7.2180822  0.8113012   8.897  < 2e-16 ***
## arrival_date_monthNovember      8.0881865  0.9005646   8.981  < 2e-16 ***
## arrival_date_monthDecember      8.9551473  0.9870954   9.072  < 2e-16 ***
## arrival_date_week_number       -0.1835498  0.0204423  -8.979  < 2e-16 ***
## arrival_date_day_of_month       0.0261415  0.0031349   8.339  < 2e-16 ***
## stays_in_weekend_nights         0.0848250  0.0104989   8.079 6.51e-16 ***
## stays_in_week_nights            0.0563759  0.0055848  10.095  < 2e-16 ***
## adults                          0.0944233  0.0171092   5.519 3.41e-08 ***
## mealFB                          0.7783541  0.1242451   6.265 3.74e-10 ***
## mealHB                         -0.0741201  0.0313609  -2.363 0.018105 *  
## mealSC                          0.1404857  0.0300397   4.677 2.92e-06 ***
## mealUndefined                  -0.5301862  0.1185561  -4.472 7.75e-06 ***
## market_segmentComplementary     0.5115529  0.2644214   1.935 0.053038 .  
## market_segmentCorporate        -0.1751410  0.2118239  -0.827 0.408337    
## market_segmentDirect            0.0314691  0.2316087   0.136 0.891923    
## market_segmentGroups            0.0638232  0.2199366   0.290 0.771672    
## market_segmentOffline TA/TO    -0.5500209  0.2207341  -2.492 0.012710 *  
## market_segmentOnline TA         0.7859308  0.2200038   3.572 0.000354 ***
## distribution_channelDirect     -0.5865386  0.1072301  -5.470 4.50e-08 ***
## distribution_channelGDS        -0.8968919  0.2347669  -3.820 0.000133 ***
## distribution_channelTA/TO      -0.1256727  0.0815080  -1.542 0.123111    
## distribution_channelUndefined  -7.3656229 72.4629559  -0.102 0.919037    
## is_repeated_guest              -0.7521839  0.0969619  -7.758 8.66e-15 ***
## previous_cancellations          2.7343790  0.0694837  39.353  < 2e-16 ***
## previous_bookings_not_canceled -0.4556729  0.0287004 -15.877  < 2e-16 ***
## booking_changes                -0.4162856  0.0181431 -22.945  < 2e-16 ***
## customer_typeGroup             -0.0171859  0.1966399  -0.087 0.930355    
## customer_typeTransient          0.7939001  0.0630203  12.598  < 2e-16 ***
## customer_typeTransient-Party    0.3991553  0.0670266   5.955 2.60e-09 ***
## adr                             0.0042139  0.0002522  16.712  < 2e-16 ***
## total_of_special_requests      -0.7400127  0.0135619 -54.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 87345  on 73358  degrees of freedom
## Residual deviance: 74012  on 73317  degrees of freedom
## AIC: 74096
## 
## Number of Fisher Scoring iterations: 8

Test

https://www.kaggle.com/ugurdar/logistic-regression-roc-auc-hotel-booking-data#Model-Building

test_data$PredProb = predict.glm(m4, newdata = test_data, type = 'response')
test_data$PredSur = ifelse(test_data$PredProb >= 0.5, 1,0)

Run confusion matrix, calculate F1, and Misclassification Error

caret::confusionMatrix(as.factor(test_data$is_canceled), as.factor(test_data$PredSur), mode="everything", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 21169  1271
##          1  5722  3278
##                                           
##                Accuracy : 0.7776          
##                  95% CI : (0.7729, 0.7822)
##     No Information Rate : 0.8553          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3611          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7206          
##             Specificity : 0.7872          
##          Pos Pred Value : 0.3642          
##          Neg Pred Value : 0.9434          
##               Precision : 0.3642          
##                  Recall : 0.7206          
##                      F1 : 0.4839          
##              Prevalence : 0.1447          
##          Detection Rate : 0.1043          
##    Detection Prevalence : 0.2863          
##       Balanced Accuracy : 0.7539          
##                                           
##        'Positive' Class : 1               
## 

Accuracy = 77.7% F1 score of 48%

Run Random Forest on full Training Data.

this error occurs when using the full data set. We will shorten the data to make sure we can run the RF Error: cannot allocate vector of size 1.1 Gb

set.seed(123)
train_rf.idx <- sample(1:nrow(hotel_data), size = 0.50 * nrow(hotel_data))
train_rf_data <- hotel_data[train_rf.idx,]
test_rf_data <- hotel_data[-train_rf.idx,]
set.seed(123)
forest2 <- randomForest(is_canceled ~.-deposit_type -days_in_waiting_list -required_car_parking_spaces -children -babies,
                             data = train_rf_data, 
                             importance = TRUE, 
                             ntree = 1000)

forest2
## 
## Call:
##  randomForest(formula = is_canceled ~ . - deposit_type - days_in_waiting_list -      required_car_parking_spaces - children - babies, data = train_rf_data,      importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 16.73%
## Confusion matrix:
##       0    1 class.error
## 0 34832 2712  0.07223524
## 1  6053 8802  0.40747223

F1 score of 66.8%. The higher the F1 score, the better. The random forest has better accuracy at 84% and 66.8% F1 score. This is important if there is a cost associated with contacting guest who we think might cancel their bookings. For example, if we send every guest a promo to keep them from canceling, this could be costly if we have a high amount of False Positives (positive meaning that our model gives a positive score of ‘1’ that the guest will cancel). But since these guests were likely to come anyways, we now give guests promos, even when we didn’t need to. Therefore F1 score could be an important metric.

For this reason, we choose the Random Forest Model. In the future, it may be useful to run a RF with hyperparmeters.