By Logan Henslee

Sources used:

Data Dictionary

hotel - Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled - Value indicating if the booking was canceled (1) or not (0)
lead_time- Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year - Year of arrival date
arrival_date_month - Month of arrival date
arrival_date_week_number - Week number of year for arrival date
arrival_date_day_of_month - Day of arrival date
stays_in_weekend_nights - Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights -Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults - Number of adults
children - Number of children
babies - Number of babies
meal - Type of meal booked. Categories are presented in standard hospitality meal packages:

-Undefined/SC – no meal package; -BB – Bed & Breakfast; -HB – Half board (breakfast and one other meal – usually dinner); -FB – Full board (breakfast, lunch and dinner)
country Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment - Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel - Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest - Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations - Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled - Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type - Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type - Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes - Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:

-No Deposit – no deposit was made; -Non Refund – a deposit was made in the value of the total stay cost; -Refundable – a deposit was made with a value under the total cost of stay.
agent - ID of the travel agency that made the booking
company - ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list - Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type - Type of booking, assuming one of four categories:

-Contract - when the booking has an allotment or other type of contract associated to it; -Group – when the booking is associated to a group; -Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; -Transient-party – when the booking is transient, but is associated to at least other transient booking
adr - Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces - Number of car parking spaces required by the customer
total_of_special_requests - Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status - Reservation last status, assuming one of three categories:

-Canceled – booking was canceled by the customer; -Check-Out – customer has checked in but already departed; -No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date - Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.3     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(broom)
library(BSDA)

## Loading required package: lattice

## 
## Attaching package: 'BSDA'

## The following object is masked from 'package:datasets':
## 
##     Orange

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggThemeAssist)
library(corrplot)

## corrplot 0.92 loaded

library(survival) 
library(rpart) 
#library(randomForestSRC)
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(varImp)

## Loading required package: measures

## 
## Attaching package: 'measures'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## Loading required package: party

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'varImp'

## The following object is masked from 'package:caret':
## 
##     varImp

library(tidytext)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(hrbrthemes)

## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.

##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and

##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:scales':
## 
##     viridis_pal

library(ggplot2)

library(car)

## Loading required package: carData

## 
## Attaching package: 'carData'

## The following objects are masked from 'package:BSDA':
## 
##     Vocab, Wool

## 
## Attaching package: 'car'

## The following object is masked from 'package:modeltools':
## 
##     Predict

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(DescTools)

## 
## Attaching package: 'DescTools'

## The following object is masked from 'package:car':
## 
##     Recode

## The following object is masked from 'package:modeltools':
## 
##     ParseFormula

## The following objects are masked from 'package:measures':
## 
##     AUC, MAE, MAPE, MSE, NPV, RMSE, SpearmanRho

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

library(ROCR)
library(lmtest)

library(rattle)

## Loading required package: bitops

## 
## Attaching package: 'bitops'

## The following object is masked from 'package:DescTools':
## 
##     %^%

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

library(purrr)

hotel_data <- read.csv('hotel_bookings.csv')

str(hotel_data)

## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

summary(hotel_data)

##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
##

anyNA(hotel_data)

## [1] TRUE

hotel_data <- na.omit(hotel_data)

EDA and Data Transformation

Here we’ve taken out several variables based on incompleteness of data or because the variables are assumed to be of little value in predicting booking cancellations. The data frame is saved as a new data frame from the original.

hotel_data <- subset(hotel_data, select = -c(company,agent,country,arrival_date_year,reservation_status_date,reservation_status,reserved_room_type,assigned_room_type))

hotel_data$hotel <- as.factor(hotel_data$hotel)
hotel_data$meal <- as.factor(hotel_data$meal)
hotel_data$arrival_date_month <- as.factor(hotel_data$arrival_date_month)
hotel_data$market_segment <- as.factor(hotel_data$market_segment)
hotel_data$distribution_channel <- as.factor(hotel_data$distribution_channel)
hotel_data$deposit_type <- as.factor(hotel_data$deposit_type)
hotel_data$customer_type <- as.factor(hotel_data$customer_type)

hotel_data$is_canceled <- as.factor(hotel_data$is_canceled)

str(hotel_data)

## 'data.frame':    119386 obs. of  24 variables:
##  $ hotel                         : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
##  $ is_canceled                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_month            : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
##  $ market_segment                : Factor w/ 7 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
##  $ distribution_channel          : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...

We see below, that bookings are not canceled more often than they are canceled, for both City Hotel and Resort Hotel.

A ‘0’ represents a booking not canceled and a ‘1’ represents a booking that was canceled.

table(hotel_data$is_canceled, hotel_data$hotel)

##    
##     City Hotel Resort Hotel
##   0      46228        28938
##   1      33098        11122

Here we see a potential issue with the Non-Refund deposit type.

table(hotel_data$is_canceled, hotel_data$deposit_type)

##    
##     No Deposit Non Refund Refundable
##   0      74947         93        126
##   1      29690      14494         36

table(hotel_data$deposit_type)

## 
## No Deposit Non Refund Refundable 
##     104637      14587        162

For the below graph, refer to the link below. Deposit type is recorded before the bookings arrival or cancellation date.

https://www.sciencedirect.com/science/article/pii/S2352340918315191

ggplot(data = hotel_data, 
       aes(x = deposit_type, 
           group = is_canceled,
           fill = as.factor(is_canceled)))+
  geom_bar(stat = "count", width = 0.5, position = "dodge")+
  facet_grid(. ~ hotel)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))

A potential reason for the misinformation of Non-Refund deposit type, is described in the link below, and a comment from Kaggle where the dataset was retrieved from:

https://www.kaggle.com/jessemostipak/hotel-booking-demand/discussion/131787

The author discussed this in his paper, Big Data in Hotel Revenue Management: Exploring Cancellation Drivers to Gain Insights Into Booking Cancellation Behavior. Here’s the reason, pasted directly from the journal. “As an example, through analysis of the “Nonrefundable” (DepositType) canceled bookings in some Asiatic countries (Country) and from certain distribution channels (DistributionChannel and Agent), it is possible to understand why so many “Nonrefundable” bookings are canceled. These bookings are usually made through OTA using false or invalid credit card details. These bookings are issued as support for requests for visas to enter the country (a hotel booking is mandatory for applying for a Portuguese entry visa). After failing to charge the customer’s credit card, the hotel identifies these bookings as “fake” and contacts the customer; however, during the time required to verify these bookings, they contribute negatively to demand forecast and demand-management decisions." -(commented by user Viriyo Satra Mangala).

Regardless of the reason, the deposit type looks unrealistic. Since the deposit type on ‘Non Refund’ is unreliable, we will take out all records that contain ‘Non Refund’, and will be left with ‘No Deposit’ and ‘Refundable’ deposit types. It may be best to not use Deposit Type at all in building of our stats model.

Take out records with non-refund types, drops from 119836 observations to

hotel_data <- filter(hotel_data, deposit_type != "Non Refund")

Total Bookings cancellation Status graphed below.

ggplot(data = hotel_data,
       aes(
         x = hotel,
         y = prop.table(stat(count)),
         fill = factor(is_canceled),
         label = scales::percent(prop.table(stat(count)))
       )) +
  geom_bar(position = position_dodge()) +
  geom_text(
    stat = "count",
    position = position_dodge(.9),
    vjust = -0.5,
    size = 3
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Total Population Cancellation Status by Hotel Type",
       x = "Hotel Type",
       y = "Count") +
  theme_dark() +
  scale_fill_discrete(
    name = "Booking Status",
    breaks = c("0", "1"),
    labels = c("Not Cancelled", "Cancelled")
  )

COunt of marketing segments grouped by Hotel and Canceled Status

ggplot(data = hotel_data, 
       aes(x = market_segment, 
           group = is_canceled,
           fill = as.factor(is_canceled)))+
  geom_bar(stat = "count", width = 0.5, position = "dodge")+
  facet_grid(. ~ hotel)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))

Below is the total count of bookings for City Hotel and Resort Hotel

hotel_data%>%count(hotel)%>%
  ggplot(aes(hotel,n)) + 
  geom_bar(stat = 'identity', fill = 'blue', alpha = 0.4, col = "black") +
  xlab('Hotels ') + ylab('Frecuency')

hotel_data$arrival_date_month <-
  factor(hotel_data$arrival_date_month, levels = month.name)
# Visualize Hotel traffic on Monthly basis
ggplot(data = hotel_data, aes(x = arrival_date_month)) +
  geom_bar(fill = "steelblue") +
  facet_wrap(~hotel) +
  geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
  coord_flip() + labs(title = "Month Wise Booking Request",
                      x = "Month",
                      y = "Count") +
  theme_classic()

ADR median per hotel

hotel_data %>%
  group_by(hotel) %>%
  summarise_at(vars(adr), list(name = median))

## # A tibble: 2 x 2
##   hotel         name
##   <fct>        <dbl>
## 1 City Hotel    102.
## 2 Resort Hotel   76

stays week nights average per hotel

hotel_data %>%
  group_by(hotel) %>%
  summarise_at(vars(stays_in_week_nights), list(name = mean))

## # A tibble: 2 x 2
##   hotel         name
##   <fct>        <dbl>
## 1 City Hotel    2.22
## 2 Resort Hotel  3.14

weekend nights average

hotel_data %>%
  group_by(hotel) %>%
  summarise_at(vars(stays_in_weekend_nights), list(name = mean))

## # A tibble: 2 x 2
##   hotel         name
##   <fct>        <dbl>
## 1 City Hotel   0.840
## 2 Resort Hotel 1.19

ADR (Average Daily Rate). This would be the bookings cost per night.

hotel_data %>%
  group_by(hotel, deposit_type, is_canceled) %>%
  summarise_at(vars(adr), list(name = median))

## # A tibble: 8 x 4
## # Groups:   hotel, deposit_type [4]
##   hotel        deposit_type is_canceled  name
##   <fct>        <fct>        <fct>       <dbl>
## 1 City Hotel   No Deposit   0            99.9
## 2 City Hotel   No Deposit   1           106. 
## 3 City Hotel   Refundable   0            81  
## 4 City Hotel   Refundable   1           135. 
## 5 Resort Hotel No Deposit   0            72.4
## 6 Resort Hotel No Deposit   1            91.6
## 7 Resort Hotel Refundable   0            66  
## 8 Resort Hotel Refundable   1            63

x = deposit_type, y = adr, group = is_canceled,

ggplot(data = hotel_data, 
       aes(x = is_canceled,
           y = adr,
           
           fill = as.factor(is_canceled)))+
  geom_bar(stat = "summary", width = 0.5, position = "dodge", fun = "median")+
  facet_grid(. ~ hotel)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+labs(y = "adr (Median Value)", fill = "is_canceled")

hotel_data$arrival_date_month = factor(hotel_data$arrival_date_month, levels = month.name)

ggplot(data = hotel_data, 
       aes(x = arrival_date_month, 
           y = adr))+
  geom_bar(stat = "summary", width = 0.5, position = "dodge", fill = "#FF6666", fun = "median")+
  facet_grid(. ~ hotel)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+ 
  scale_x_discrete(limits = month.name)

Variable Importance

Set a temporary training dataset below to test the Logistic Model, and this will help the time it takes to run models. Once the most important variables are decided then use a full training and testing set. This could be a form of ‘Undersampling’, because It reduces the number of observations from majority class to make the data set balanced. This method is best to use when the data set is huge and reducing the number of training samples helps to improve run time and storage troubles. Check the balance of the data afterwords.

set.seed(123)
hotel_temporary <- sample(1:nrow(hotel_data), size = 0.10 * nrow(hotel_data))
hotel_small <- hotel_data[hotel_temporary,]

Logistic Regression to observe significant variables.

take out deposit type as it seems unrealiable

set.seed(123)
m1 <- glm(is_canceled ~ .-deposit_type, data = hotel_small, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(m1)

## 
## Call:
## glm(formula = is_canceled ~ . - deposit_type, family = binomial, 
##     data = hotel_small)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.9380  -0.7880  -0.5288   0.8791   4.1047  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.310e+00  5.598e-01  -4.127 3.67e-05 ***
## hotelResort Hotel               1.046e-01  5.968e-02   1.753 0.079623 .  
## lead_time                       3.831e-03  3.170e-04  12.083  < 2e-16 ***
## arrival_date_monthFebruary      9.532e-01  2.864e-01   3.328 0.000875 ***
## arrival_date_monthMarch         1.303e+00  4.950e-01   2.631 0.008509 ** 
## arrival_date_monthApril         2.069e+00  7.355e-01   2.813 0.004901 ** 
## arrival_date_monthMay           2.618e+00  9.709e-01   2.697 0.007006 ** 
## arrival_date_monthJune          3.435e+00  1.213e+00   2.832 0.004624 ** 
## arrival_date_monthJuly          4.031e+00  1.450e+00   2.781 0.005421 ** 
## arrival_date_monthAugust        4.852e+00  1.698e+00   2.858 0.004269 ** 
## arrival_date_monthSeptember     5.440e+00  1.951e+00   2.788 0.005307 ** 
## arrival_date_monthOctober       6.365e+00  2.186e+00   2.912 0.003589 ** 
## arrival_date_monthNovember      7.338e+00  2.424e+00   3.027 0.002472 ** 
## arrival_date_monthDecember      8.063e+00  2.657e+00   3.034 0.002411 ** 
## arrival_date_week_number       -1.647e-01  5.501e-02  -2.994 0.002753 ** 
## arrival_date_day_of_month       2.148e-02  8.443e-03   2.544 0.010944 *  
## stays_in_weekend_nights         1.060e-01  2.774e-02   3.822 0.000133 ***
## stays_in_week_nights            2.491e-02  1.463e-02   1.703 0.088640 .  
## adults                          1.661e-01  5.197e-02   3.195 0.001397 ** 
## children                        3.252e-02  6.106e-02   0.533 0.594368    
## babies                          2.921e-01  2.986e-01   0.978 0.327982    
## mealFB                          8.967e-01  3.533e-01   2.538 0.011136 *  
## mealHB                         -2.101e-02  8.522e-02  -0.247 0.805295    
## mealSC                          2.692e-01  7.887e-02   3.413 0.000643 ***
## mealUndefined                  -4.958e-01  2.917e-01  -1.700 0.089156 .  
## market_segmentComplementary     8.350e-01  7.157e-01   1.167 0.243288    
## market_segmentCorporate        -1.285e-01  5.386e-01  -0.239 0.811446    
## market_segmentDirect            5.570e-01  5.979e-01   0.931 0.351600    
## market_segmentGroups            4.082e-01  5.683e-01   0.718 0.472615    
## market_segmentOffline TA/TO    -1.480e-01  5.693e-01  -0.260 0.794819    
## market_segmentOnline TA         1.169e+00  5.674e-01   2.060 0.039442 *  
## distribution_channelDirect     -8.420e-01  2.889e-01  -2.914 0.003567 ** 
## distribution_channelGDS        -1.105e+00  5.451e-01  -2.026 0.042717 *  
## distribution_channelTA/TO      -4.732e-01  2.253e-01  -2.100 0.035691 *  
## is_repeated_guest              -1.226e+00  2.497e-01  -4.910 9.11e-07 ***
## previous_cancellations          2.490e+00  1.731e-01  14.388  < 2e-16 ***
## previous_bookings_not_canceled -2.026e-01  2.583e-02  -7.845 4.31e-15 ***
## booking_changes                -4.295e-01  4.994e-02  -8.600  < 2e-16 ***
## days_in_waiting_list            4.260e-04  1.452e-03   0.293 0.769203    
## customer_typeGroup             -7.239e-02  5.210e-01  -0.139 0.889499    
## customer_typeTransient          7.175e-01  1.616e-01   4.440 9.01e-06 ***
## customer_typeTransient-Party    2.528e-01  1.714e-01   1.475 0.140200    
## adr                             4.388e-03  7.354e-04   5.966 2.43e-09 ***
## required_car_parking_spaces    -1.652e+01  1.321e+02  -0.125 0.900462    
## total_of_special_requests      -7.577e-01  3.660e-02 -20.705  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12533  on 10478  degrees of freedom
## Residual deviance: 10291  on 10434  degrees of freedom
## AIC: 10381
## 
## Number of Fisher Scoring iterations: 16

ANOVA 1

# model statistics for type 3 ANOVA
Wald1<-Anova(m1,type = 3, test = "Wald")
Wald1

## Analysis of Deviance Table (Type III tests)
## 
## Response: is_canceled
##                                Df    Chisq Pr(>Chisq)    
## (Intercept)                     1  17.0325  3.675e-05 ***
## hotel                           1   3.0726  0.0796230 .  
## lead_time                       1 145.9992  < 2.2e-16 ***
## arrival_date_month             11  42.4198  1.370e-05 ***
## arrival_date_week_number        1   8.9641  0.0027534 ** 
## arrival_date_day_of_month       1   6.4744  0.0109439 *  
## stays_in_weekend_nights         1  14.6053  0.0001325 ***
## stays_in_week_nights            1   2.8989  0.0886398 .  
## adults                          1  10.2104  0.0013965 ** 
## children                        1   0.2836  0.5943680    
## babies                          1   0.9568  0.3279824    
## meal                            4  21.5182  0.0002499 ***
## market_segment                  6 239.4975  < 2.2e-16 ***
## distribution_channel            3  10.2164  0.0168138 *  
## is_repeated_guest               1  24.1073  9.111e-07 ***
## previous_cancellations          1 207.0151  < 2.2e-16 ***
## previous_bookings_not_canceled  1  61.5516  4.313e-15 ***
## booking_changes                 1  73.9573  < 2.2e-16 ***
## days_in_waiting_list            1   0.0861  0.7692028    
## customer_type                   3  44.3174  1.292e-09 ***
## adr                             1  35.5952  2.429e-09 ***
## required_car_parking_spaces     1   0.0156  0.9004615    
## total_of_special_requests       1 428.7050  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Build a new model (m3) with insignificant features removed

set.seed(123)
m3 <- glm(is_canceled ~ . -days_in_waiting_list -deposit_type -required_car_parking_spaces -children -babies, data = hotel_small, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(m3)

## 
## Call:
## glm(formula = is_canceled ~ . - days_in_waiting_list - deposit_type - 
##     required_car_parking_spaces - children - babies, family = binomial, 
##     data = hotel_small)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.9376  -0.7822  -0.5578   0.9072   4.2227  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.2187460  0.5593826  -3.966 7.30e-05 ***
## hotelResort Hotel              -0.0813598  0.0576410  -1.411 0.158100    
## lead_time                       0.0038744  0.0003082  12.573  < 2e-16 ***
## arrival_date_monthFebruary      1.0968597  0.2802167   3.914 9.07e-05 ***
## arrival_date_monthMarch         1.5404270  0.4841058   3.182 0.001463 ** 
## arrival_date_monthApril         2.4146629  0.7185631   3.360 0.000778 ***
## arrival_date_monthMay           3.1100400  0.9492316   3.276 0.001051 ** 
## arrival_date_monthJune          3.9993346  1.1861074   3.372 0.000747 ***
## arrival_date_monthJuly          4.6875817  1.4170920   3.308 0.000940 ***
## arrival_date_monthAugust        5.6448084  1.6603014   3.400 0.000674 ***
## arrival_date_monthSeptember     6.3246300  1.9084476   3.314 0.000920 ***
## arrival_date_monthOctober       7.3484622  2.1379976   3.437 0.000588 ***
## arrival_date_monthNovember      8.4108708  2.3713133   3.547 0.000390 ***
## arrival_date_monthDecember      9.2422607  2.5990235   3.556 0.000376 ***
## arrival_date_week_number       -0.1887885  0.0538263  -3.507 0.000453 ***
## arrival_date_day_of_month       0.0254640  0.0082523   3.086 0.002031 ** 
## stays_in_weekend_nights         0.1164072  0.0272114   4.278 1.89e-05 ***
## stays_in_week_nights            0.0307970  0.0144045   2.138 0.032516 *  
## adults                          0.1501595  0.0495052   3.033 0.002420 ** 
## mealFB                          0.8441445  0.3325315   2.539 0.011132 *  
## mealHB                          0.0039915  0.0830538   0.048 0.961669    
## mealSC                          0.2596129  0.0780752   3.325 0.000884 ***
## mealUndefined                  -0.3574997  0.2895501  -1.235 0.216952    
## market_segmentComplementary     0.5044588  0.7083974   0.712 0.476395    
## market_segmentCorporate        -0.1766358  0.5386526  -0.328 0.742971    
## market_segmentDirect            0.2198292  0.5937974   0.370 0.711227    
## market_segmentGroups            0.2952615  0.5655767   0.522 0.601633    
## market_segmentOffline TA/TO    -0.2592766  0.5670897  -0.457 0.647523    
## market_segmentOnline TA         1.0224114  0.5649823   1.810 0.070353 .  
## distribution_channelDirect     -0.6316546  0.2795379  -2.260 0.023844 *  
## distribution_channelGDS        -0.9316261  0.5431951  -1.715 0.086330 .  
## distribution_channelTA/TO      -0.3391845  0.2176683  -1.558 0.119171    
## is_repeated_guest              -1.2528300  0.2492271  -5.027 4.99e-07 ***
## previous_cancellations          2.5006164  0.1702729  14.686  < 2e-16 ***
## previous_bookings_not_canceled -0.2140704  0.0283599  -7.548 4.41e-14 ***
## booking_changes                -0.4491725  0.0492317  -9.124  < 2e-16 ***
## customer_typeGroup             -0.0112182  0.5145965  -0.022 0.982608    
## customer_typeTransient          0.6861320  0.1609171   4.264 2.01e-05 ***
## customer_typeTransient-Party    0.2244410  0.1707857   1.314 0.188790    
## adr                             0.0033795  0.0006638   5.091 3.56e-07 ***
## total_of_special_requests      -0.7540047  0.0358349 -21.041  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12533  on 10478  degrees of freedom
## Residual deviance: 10727  on 10438  degrees of freedom
## AIC: 10809
## 
## Number of Fisher Scoring iterations: 6

ANOVA 2

# model statistics for type 3 ANOVA with the Logit Model build(m3)
Wald3<-Anova(m3,type = 3, test = "Wald")
Wald3

## Analysis of Deviance Table (Type III tests)
## 
## Response: is_canceled
##                                Df    Chisq Pr(>Chisq)    
## (Intercept)                     1  15.7325  7.296e-05 ***
## hotel                           1   1.9923  0.1580996    
## lead_time                       1 158.0701  < 2.2e-16 ***
## arrival_date_month             11  45.0395  4.775e-06 ***
## arrival_date_week_number        1  12.3016  0.0004526 ***
## arrival_date_day_of_month       1   9.5213  0.0020310 ** 
## stays_in_weekend_nights         1  18.3004  1.887e-05 ***
## stays_in_week_nights            1   4.5711  0.0325158 *  
## adults                          1   9.2003  0.0024197 ** 
## meal                            4  19.5549  0.0006113 ***
## market_segment                  6 235.6665  < 2.2e-16 ***
## distribution_channel            3   6.5646  0.0871488 .  
## is_repeated_guest               1  25.2693  4.986e-07 ***
## previous_cancellations          1 215.6767  < 2.2e-16 ***
## previous_bookings_not_canceled  1  56.9774  4.408e-14 ***
## booking_changes                 1  83.2407  < 2.2e-16 ***
## customer_type                   3  43.1339  2.305e-09 ***
## adr                             1  25.9210  3.557e-07 ***
## total_of_special_requests       1 442.7272  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random Forest inference

Random Forest to observe variable importance. Random forest can be very effective to find a set of predictors that best explains the variance in the response variable.Out-of-bag estimate error rate of 18.59%

set.seed(123)
forest1 <- randomForest(is_canceled ~.-deposit_type,
                             data = hotel_small, 
                             importance = TRUE, 
                             ntree = 1000)

forest1

## 
## Call:
##  randomForest(formula = is_canceled ~ . - deposit_type, data = hotel_small,      importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.59%
## Confusion matrix:
##      0    1 class.error
## 0 6948  540  0.07211538
## 1 1408 1583  0.47074557

We will now run a RF model to find which features are important and if they are similar results to Logistic Regression.

Variable importance

varImpPlot(forest1)

Prepare Train/Test Data

After finding the most significant features, we test Logisitc Regression and RandomForest on the entire dataset and include additional insights. But first we split the data.

set.seed(123)
train.idx <- sample(1:nrow(hotel_data), size = 0.70 * nrow(hotel_data))
train_data <- hotel_data[train.idx,]
test_data <- hotel_data[-train.idx,]

Train Logistic Regression Model

set.seed(123)
m4 <- glm(is_canceled ~ . -days_in_waiting_list -deposit_type -required_car_parking_spaces -children -babies, data = train_data, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(m4)

## 
## Call:
## glm(formula = is_canceled ~ . - days_in_waiting_list - deposit_type - 
##     required_car_parking_spaces - children - babies, family = binomial, 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.7779  -0.5436   0.8742   5.5722  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -2.3797276  0.2210451 -10.766  < 2e-16 ***
## hotelResort Hotel              -0.1055449  0.0217673  -4.849 1.24e-06 ***
## lead_time                       0.0041393  0.0001166  35.498  < 2e-16 ***
## arrival_date_monthFebruary      0.9545556  0.1069058   8.929  < 2e-16 ***
## arrival_date_monthMarch         1.4448939  0.1840697   7.850 4.17e-15 ***
## arrival_date_monthApril         2.4168787  0.2723987   8.873  < 2e-16 ***
## arrival_date_monthMay           3.1179026  0.3604750   8.649  < 2e-16 ***
## arrival_date_monthJune          3.8856070  0.4498383   8.638  < 2e-16 ***
## arrival_date_monthJuly          4.5935186  0.5381040   8.536  < 2e-16 ***
## arrival_date_monthAugust        5.5302438  0.6300220   8.778  < 2e-16 ***
## arrival_date_monthSeptember     6.2115343  0.7246547   8.572  < 2e-16 ***
## arrival_date_monthOctober       7.2180822  0.8113012   8.897  < 2e-16 ***
## arrival_date_monthNovember      8.0881865  0.9005646   8.981  < 2e-16 ***
## arrival_date_monthDecember      8.9551473  0.9870954   9.072  < 2e-16 ***
## arrival_date_week_number       -0.1835498  0.0204423  -8.979  < 2e-16 ***
## arrival_date_day_of_month       0.0261415  0.0031349   8.339  < 2e-16 ***
## stays_in_weekend_nights         0.0848250  0.0104989   8.079 6.51e-16 ***
## stays_in_week_nights            0.0563759  0.0055848  10.095  < 2e-16 ***
## adults                          0.0944233  0.0171092   5.519 3.41e-08 ***
## mealFB                          0.7783541  0.1242451   6.265 3.74e-10 ***
## mealHB                         -0.0741201  0.0313609  -2.363 0.018105 *  
## mealSC                          0.1404857  0.0300397   4.677 2.92e-06 ***
## mealUndefined                  -0.5301862  0.1185561  -4.472 7.75e-06 ***
## market_segmentComplementary     0.5115529  0.2644214   1.935 0.053038 .  
## market_segmentCorporate        -0.1751410  0.2118239  -0.827 0.408337    
## market_segmentDirect            0.0314691  0.2316087   0.136 0.891923    
## market_segmentGroups            0.0638232  0.2199366   0.290 0.771672    
## market_segmentOffline TA/TO    -0.5500209  0.2207341  -2.492 0.012710 *  
## market_segmentOnline TA         0.7859308  0.2200038   3.572 0.000354 ***
## distribution_channelDirect     -0.5865386  0.1072301  -5.470 4.50e-08 ***
## distribution_channelGDS        -0.8968919  0.2347669  -3.820 0.000133 ***
## distribution_channelTA/TO      -0.1256727  0.0815080  -1.542 0.123111    
## distribution_channelUndefined  -7.3656229 72.4629559  -0.102 0.919037    
## is_repeated_guest              -0.7521839  0.0969619  -7.758 8.66e-15 ***
## previous_cancellations          2.7343790  0.0694837  39.353  < 2e-16 ***
## previous_bookings_not_canceled -0.4556729  0.0287004 -15.877  < 2e-16 ***
## booking_changes                -0.4162856  0.0181431 -22.945  < 2e-16 ***
## customer_typeGroup             -0.0171859  0.1966399  -0.087 0.930355    
## customer_typeTransient          0.7939001  0.0630203  12.598  < 2e-16 ***
## customer_typeTransient-Party    0.3991553  0.0670266   5.955 2.60e-09 ***
## adr                             0.0042139  0.0002522  16.712  < 2e-16 ***
## total_of_special_requests      -0.7400127  0.0135619 -54.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 87345  on 73358  degrees of freedom
## Residual deviance: 74012  on 73317  degrees of freedom
## AIC: 74096
## 
## Number of Fisher Scoring iterations: 8

Test

https://www.kaggle.com/ugurdar/logistic-regression-roc-auc-hotel-booking-data#Model-Building

test_data$PredProb = predict.glm(m4, newdata = test_data, type = 'response')

test_data$PredSur = ifelse(test_data$PredProb >= 0.5, 1,0)

Run confusion matrix, calculate F1, and Misclassification Error

caret::confusionMatrix(as.factor(test_data$is_canceled), as.factor(test_data$PredSur), mode="everything", positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 21169  1271
##          1  5722  3278
##                                           
##                Accuracy : 0.7776          
##                  95% CI : (0.7729, 0.7822)
##     No Information Rate : 0.8553          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3611          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7206          
##             Specificity : 0.7872          
##          Pos Pred Value : 0.3642          
##          Neg Pred Value : 0.9434          
##               Precision : 0.3642          
##                  Recall : 0.7206          
##                      F1 : 0.4839          
##              Prevalence : 0.1447          
##          Detection Rate : 0.1043          
##    Detection Prevalence : 0.2863          
##       Balanced Accuracy : 0.7539          
##                                           
##        'Positive' Class : 1               
##

Accuracy = 77.7% F1 score of 48%

Run Random Forest on full Training Data.

this error occurs when using the full data set. We will shorten the data to make sure we can run the RF Error: cannot allocate vector of size 1.1 Gb

set.seed(123)
train_rf.idx <- sample(1:nrow(hotel_data), size = 0.50 * nrow(hotel_data))
train_rf_data <- hotel_data[train_rf.idx,]
test_rf_data <- hotel_data[-train_rf.idx,]

set.seed(123)
forest2 <- randomForest(is_canceled ~.-deposit_type -days_in_waiting_list -required_car_parking_spaces -children -babies,
                             data = train_rf_data, 
                             importance = TRUE, 
                             ntree = 1000)

forest2

## 
## Call:
##  randomForest(formula = is_canceled ~ . - deposit_type - days_in_waiting_list -      required_car_parking_spaces - children - babies, data = train_rf_data,      importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 16.73%
## Confusion matrix:
##       0    1 class.error
## 0 34832 2712  0.07223524
## 1  6053 8802  0.40747223

F1 score of 66.8%. The higher the F1 score, the better. The random forest has better accuracy at 84% and 66.8% F1 score. This is important if there is a cost associated with contacting guest who we think might cancel their bookings. For example, if we send every guest a promo to keep them from canceling, this could be costly if we have a high amount of False Positives (positive meaning that our model gives a positive score of ‘1’ that the guest will cancel). But since these guests were likely to come anyways, we now give guests promos, even when we didn’t need to. Therefore F1 score could be an important metric.

For this reason, we choose the Random Forest Model. In the future, it may be useful to run a RF with hyperparmeters.

Hotel_Booking