Authors: Sreeparna Chatterjee | Abhishek Yadav | Deepak Narang

Introduction

Introduction

The Hotel industry is one of its kind, replete with immense competition and unpredictability. The value of real-time clean and authentic data is crucial to this industry. Predicting customer behavior is imperative to maintaining a steady flow of revenue and predicting the investments and efforts required in marketing and advertising. The hospitality industry in the US is valued at close to $200 billion, with the numbers taking a slight dip during the pandemic, but are expected to get back to normal as the global pandemic fades down. With immense competition in the online bidding market and several booking agencies targeting the same customers, reliable and clean data assisting the models can create a distinction in the success of a Hotel. For this project, we have used the dataset from Antonio, Almeida, and Nunes, 2019.

One of the unavoidable but major challenge for Hotel Industry is the booking cancellation. It directly impacts the hotel’s revenue. If a model which predicts the cancellation is available for management then it allows them to come up with better plans to tackle the cancellation and predict the demand which can result in significant increment in the revenue.

Our objectives are:

  1. To explore the data and find the variables having some effect on cancellation of a booking.
  2. TO create some visualizations to see the relationship between two variables more clearly.
  3. To develop a model which can predict the cancellation of bookings.

We would begin with importing the data, cleaning it by getting rid of null or missing values, checking for completeness and consistency, and finding the relationships between various data points, and factors affecting the same. To get a better understanding of the data we will dive into basic EDA by creating several visualizations like Bar charts, Box plots, and graphs, to comprehend correlations between various factors.

Packages Installed

The R packages used for data analysis in this project are:

• data.table - To import data • tidyverse - Fortidy data, visualisation, transformation • tibble - To create tibbles • plotrix - For 3D Exploded Pie Chart • tidyr - For tidy data • dplyr - For data analysis • DT - To display data set • ggplot2 - For data visualization • magrittr - For pipe oprator • knitr - To display tables • moments - For moments • ROCR - For creating ROC curve • Scales -To demonstrate ggplot2 style scales for specific types of data • Pander -To produce simple tables from summary() output • Plotly -For creating interactive charts

options(warn=-1)
library(data.table)   #import data
library(tidyverse)    #tidy data, visualisation, transformation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(tibble)       #create tibbles
library(plotrix)      #3D Exploded Pie Chart
library(tidyr)        #tidy data
library(dplyr)        #data analysis
library(DT)           #display data set
library(ggplot2)      #data visualization
library(magrittr)     #pipe oprator
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(knitr)        #display tables
library(moments)      #moments
library(caret)        #partitioning the data
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROCR)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:plotrix':
## 
##     rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(pander)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Data Pre-processing

Data Source

The hotel data for analysis has been sourced from this Github repository. This data contains all the booking information that will help us to find answers for analysis aim questions.

Data Import

The Hotel Data imported contains 119390 rows and 32 columns, having all the information of a booking.

Here we have read the data from csv file and check the number of rows and columns in the hotel data.

hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data<-hotels
dim(data)
## [1] 119390     32

Data Dictionary

The description for Hotel Data variables is given below.

data.type <- lapply(data, class)
Description <- c("Hotel (H1 = Resort Hotel or H2 = City Hotel)",
                    "Value indicating if the booking was canceled (1) or not (0)",
                    "Number of days that elapsed between the entering date of the booking into the PMS and the arrival date",
                    "Year of arrival date",
                    "Month of arrival date",
                    "Week number of year for arrival date",
                    "Day of arrival date",
                    "Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel",
                    "Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel",
                    "Number of adults",
                    "Number of children",
                    "Number of babies",
                    "Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)",
                    "Country of origin. Categories are represented in the ISO 3155–3:2013 format",
                    "Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
                    "Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”",
                    "Value indicating if the booking name was from a repeated guest (1) or not (0)",
                    "Number of previous bookings that were cancelled by the customer prior to the current booking",
                    "Number of previous bookings not cancelled by the customer prior to the current booking",
                    "Code of room type reserved. Code is presented instead of designation for anonymity reasons",
                    "Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons",
                    "Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation",
                    "Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.",
                    "ID of the travel agency that made the booking",
                    "ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons",
                    "Number of days the booking was in the waiting list before it was confirmed to the customer",
                    "Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking",
                    "Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights",
                    "Number of car parking spaces required by the customer",
                    "Number of special requests made by the customer (e.g. twin bed or high floor)",
                    "Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why",
                    "Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel"
)
Variable <- colnames(data)
data.description <- as.data.frame(cbind(Variable, Description))
options(warn=-1)
data.description
##                          Variable
## 1                           hotel
## 2                     is_canceled
## 3                       lead_time
## 4               arrival_date_year
## 5              arrival_date_month
## 6        arrival_date_week_number
## 7       arrival_date_day_of_month
## 8         stays_in_weekend_nights
## 9            stays_in_week_nights
## 10                         adults
## 11                       children
## 12                         babies
## 13                           meal
## 14                        country
## 15                 market_segment
## 16           distribution_channel
## 17              is_repeated_guest
## 18         previous_cancellations
## 19 previous_bookings_not_canceled
## 20             reserved_room_type
## 21             assigned_room_type
## 22                booking_changes
## 23                   deposit_type
## 24                          agent
## 25                        company
## 26           days_in_waiting_list
## 27                  customer_type
## 28                            adr
## 29    required_car_parking_spaces
## 30      total_of_special_requests
## 31             reservation_status
## 32        reservation_status_date
##                                                                                                                                                                                                                                                                                                                                                                                                                Description
## 1                                                                                                                                                                                                                                                                                                                                                                             Hotel (H1 = Resort Hotel or H2 = City Hotel)
## 2                                                                                                                                                                                                                                                                                                                                                              Value indicating if the booking was canceled (1) or not (0)
## 3                                                                                                                                                                                                                                                                                                                   Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
## 4                                                                                                                                                                                                                                                                                                                                                                                                     Year of arrival date
## 5                                                                                                                                                                                                                                                                                                                                                                                                    Month of arrival date
## 6                                                                                                                                                                                                                                                                                                                                                                                     Week number of year for arrival date
## 7                                                                                                                                                                                                                                                                                                                                                                                                      Day of arrival date
## 8                                                                                                                                                                                                                                                                                                                            Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
## 9                                                                                                                                                                                                                                                                                                                                 Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
## 10                                                                                                                                                                                                                                                                                                                                                                                                        Number of adults
## 11                                                                                                                                                                                                                                                                                                                                                                                                      Number of children
## 12                                                                                                                                                                                                                                                                                                                                                                                                        Number of babies
## 13                                                                                                                                                                                                                                 Type of meal booked. SC for no meal package, BB for Bed & Breakfast, HB for Half board (breakfast and one other meal – usually dinner), FB for Full board (breakfast, lunch and dinner)
## 14                                                                                                                                                                                                                                                                                                                                             Country of origin. Categories are represented in the ISO 3155–3:2013 format
## 15                                                                                                                                                                                                                                                                                                          Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
## 16                                                                                                                                                                                                                                                                                                                       Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
## 17                                                                                                                                                                                                                                                                                                                                           Value indicating if the booking name was from a repeated guest (1) or not (0)
## 18                                                                                                                                                                                                                                                                                                                            Number of previous bookings that were cancelled by the customer prior to the current booking
## 19                                                                                                                                                                                                                                                                                                                                  Number of previous bookings not cancelled by the customer prior to the current booking
## 20                                                                                                                                                                                                                                                                                                                              Code of room type reserved. Code is presented instead of designation for anonymity reasons
## 21                                                                                                                                                        Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
## 22                                                                                                                                                                                                                                                                        Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
## 23                                                                                                                     Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
## 24                                                                                                                                                                                                                                                                                                                                                                           ID of the travel agency that made the booking
## 25                                                                                                                                                                                                                                                                      ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
## 26                                                                                                                                                                                                                                                                                                                              Number of days the booking was in the waiting list before it was confirmed to the customer
## 27 Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
## 28                                                                                                                                                                                                                                                                                                     Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
## 29                                                                                                                                                                                                                                                                                                                                                                   Number of car parking spaces required by the customer
## 30                                                                                                                                                                                                                                                                                                                                           Number of special requests made by the customer (e.g. twin bed or high floor)
## 31                                                                                                                                                                    Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
## 32                                                                                                                                                                                                                Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Data Manipulation

Converting variables to factors

data<-data%>%
  mutate(
         hotel=as.factor(hotel),      
         is_canceled=as.factor(is_canceled),
         meal=as.factor(meal),
         country=as.factor(country),
         market_segment=as.factor(market_segment),
         distribution_channel=as.factor(distribution_channel),
         is_repeated_guest=as.factor(is_repeated_guest),
         reserved_room_type=as.factor(reserved_room_type),
         assigned_room_type=as.factor(assigned_room_type),
         deposit_type=as.factor(deposit_type),
         customer_type=as.factor(customer_type),
         reservation_status=as.factor(reservation_status),
         agent=as.factor(agent),
         company=as.factor(company),
         arrival_date_day_of_month=as.factor(arrival_date_day_of_month),
         arrival_date_month=as.factor(arrival_date_month),
         arrival_date_year=as.factor(arrival_date_year))

New variables “arrival_date”, “nights_stay” and “total_guests” were created using existing variables.

#creating new variable: arrival_date
data$arrival_date <- paste(data$arrival_date_month, 
                            data$arrival_date_day_of_month,
                            data$arrival_date_year,sep="-")
data$arrival_date <-as.Date(data$arrival_date, format="%B-%d-%Y")

#creating new variable: nights_stay and total_guests

data <- data %>% 
          dplyr::mutate(nights_stay = stays_in_weekend_nights + stays_in_week_nights,
                  total_guests = adults + children + babies)%>% 
          dplyr::select(-c(stays_in_weekend_nights, stays_in_week_nights, adults, children, babies,
                        arrival_date_week_number ))

The summary below shows

summary(data)
##           hotel       is_canceled   lead_time   arrival_date_year
##  City Hotel  :79330   0:75166     Min.   :  0   2015:21996       
##  Resort Hotel:40060   1:44224     1st Qu.: 18   2016:56707       
##                                   Median : 69   2017:40687       
##                                   Mean   :104                    
##                                   3rd Qu.:160                    
##                                   Max.   :737                    
##                                                                  
##  arrival_date_month arrival_date_day_of_month        meal          country     
##  August :13877      17     : 4406             BB       :92310   PRT    :48590  
##  July   :12661      5      : 4317             FB       :  798   GBR    :12129  
##  May    :11791      15     : 4196             HB       :14463   FRA    :10415  
##  October:11160      25     : 4160             SC       :10650   ESP    : 8568  
##  April  :11089      26     : 4147             Undefined: 1169   DEU    : 7287  
##  June   :10939      9      : 4096                               ITA    : 3766  
##  (Other):47873      (Other):94068                               (Other):28635  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56477   Corporate: 6677      0:115580         
##  Offline TA/TO:24219   Direct   :14645      1:  3810         
##  Groups       :19811   GDS      :  193                       
##  Direct       :12606   TA/TO    :97870                       
##  Corporate    : 5295   Undefined:    5                       
##  Complementary:  743                                         
##  (Other)      :  239                                         
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                A      :85994     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                D      :19201     
##  Median : 0.00000       Median : 0.0000                E      : 6535     
##  Mean   : 0.08712       Mean   : 0.1371                F      : 2897     
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                G      : 2094     
##  Max.   :26.00000       Max.   :72.0000                B      : 1118     
##                                                        (Other): 1551     
##  assigned_room_type booking_changes       deposit_type        agent      
##  A      :74053      Min.   : 0.0000   No Deposit:104641   9      :31961  
##  D      :25322      1st Qu.: 0.0000   Non Refund: 14587   NULL   :16340  
##  E      : 7806      Median : 0.0000   Refundable:   162   240    :13922  
##  F      : 3751      Mean   : 0.2211                       1      : 7191  
##  G      : 2553      3rd Qu.: 0.0000                       14     : 3640  
##  C      : 2375      Max.   :21.0000                       7      : 3539  
##  (Other): 3530                                            (Other):42797  
##     company       days_in_waiting_list         customer_type  
##  NULL   :112593   Min.   :  0.000      Contract       : 4076  
##  40     :   927   1st Qu.:  0.000      Group          :  577  
##  223    :   784   Median :  0.000      Transient      :89613  
##  67     :   267   Mean   :  2.321      Transient-Party:25124  
##  45     :   250   3rd Qu.:  0.000                             
##  153    :   215   Max.   :391.000                             
##  (Other):  4354                                               
##       adr          required_car_parking_spaces total_of_special_requests
##  Min.   :  -6.38   Min.   :0.00000             Min.   :0.0000           
##  1st Qu.:  69.29   1st Qu.:0.00000             1st Qu.:0.0000           
##  Median :  94.58   Median :0.00000             Median :0.0000           
##  Mean   : 101.83   Mean   :0.06252             Mean   :0.5714           
##  3rd Qu.: 126.00   3rd Qu.:0.00000             3rd Qu.:1.0000           
##  Max.   :5400.00   Max.   :8.00000             Max.   :5.0000           
##                                                                         
##  reservation_status reservation_status_date  arrival_date       
##  Canceled :43017    Min.   :2014-10-17      Min.   :2015-07-01  
##  Check-Out:75166    1st Qu.:2016-02-01      1st Qu.:2016-03-13  
##  No-Show  : 1207    Median :2016-08-07      Median :2016-09-06  
##                     Mean   :2016-07-30      Mean   :2016-08-28  
##                     3rd Qu.:2017-02-08      3rd Qu.:2017-03-18  
##                     Max.   :2017-09-14      Max.   :2017-08-31  
##                                                                 
##   nights_stay      total_guests   
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 3.000   Median : 2.000  
##  Mean   : 3.428   Mean   : 1.968  
##  3rd Qu.: 4.000   3rd Qu.: 2.000  
##  Max.   :69.000   Max.   :55.000  
##                   NA's   :4

The “Undefined” meal are imputed by “SC” because both “Undefined” and “SC” mean the customer choose not to eat at hotel and new variable is create as “new_meal” and existing variable “meal” is dropped.

data<- data %>% 
         dplyr::mutate(new_meal = fct_collapse(meal, SC = c("Undefined" , "SC"),
                                       BB = "BB",
                                       FB = "FB",
                                       HB = "HB"),
         new_meal = fct_relevel(new_meal, "FB", "HB", "BB", "SC"),
         new_meal = fct_explicit_na(new_meal)) %>% 
        dplyr::select(-meal) 

Missing Values

From summary we can see that there no missing values except “total_guests”. The 4 missing values are replaced by 0.

data$total_guests[is.na(data$total_guests)]  <- 0
colSums(is.na(data))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month      arrival_date_day_of_month 
##                              0                              0 
##                        country                 market_segment 
##                              0                              0 
##           distribution_channel              is_repeated_guest 
##                              0                              0 
##         previous_cancellations previous_bookings_not_canceled 
##                              0                              0 
##             reserved_room_type             assigned_room_type 
##                              0                              0 
##                booking_changes                   deposit_type 
##                              0                              0 
##                          agent                        company 
##                              0                              0 
##           days_in_waiting_list                  customer_type 
##                              0                              0 
##                            adr    required_car_parking_spaces 
##                              0                              0 
##      total_of_special_requests             reservation_status 
##                              0                              0 
##        reservation_status_date                   arrival_date 
##                              0                              0 
##                    nights_stay                   total_guests 
##                              0                              0 
##                       new_meal 
##                              0

Exploratory Data Analysis

Monthly Cancellations

The hotel data is from July 2015 to August 2017, so for most of the months data is only for 2 years. We can see seasonality affecting the number of bookings, as the number of bookings drop around the months of March to May and more bookings were made in November, December, January and Februray.

library(ggplot2)
data %>% ggplot(aes(x=arrival_date_month, fill=hotel)) + 
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("skyblue4", "plum4"),labels=c("City Hotel", "Resort Hotel")) +
  scale_y_continuous(name = "Bookings",labels = scales::comma) +
  guides(fill=guide_legend(title=NULL))  +
  facet_grid(arrival_date_year ~ .) + 
  theme(legend.position="bottom", axis.text.x=element_text(angle=0, hjust=1, vjust=0.5)) +
  scale_x_discrete(labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))+
  labs(x="", y="Bookings") +
  ggtitle("Bookings Data", subtitle="7/1/2015 - 8/31/2017")

Below chart shows the cancellation percentage for City Hoetl and Resort Hotels. Here we can see that City Hotels had more cancellations as compared to Resort Hotels.

data %>% ggplot( aes(x=hotel, 
                       fill=is_canceled)) + 
  geom_bar(position="dodge") +
  geom_text(stat = "Count", aes(label=scales::percent(..count../sum(..count..))),position=position_dodge(0.9), vjust=1.5) +
  scale_fill_manual(values=c("skyblue4", "plum4"),labels=c("Not Canceled", "Canceled")) +
  guides(fill=guide_legend(title=NULL))  +
  ggtitle("City and Resort Hotels Booking with/without Cancellation") 

Lead Time

Lead Time can be interpreted as the difference between the date on which booking was done and date for which booking was done.

We want to see the effect of Lead Time on Cancellations.

# Calculating average lead time for every month
avg_monthly_lead <- hotels %>%  group_by(arrival_date_month) %>% 
  summarise(avg_lead_time=mean(lead_time))
data %>%  group_by(arrival_date_month, is_canceled) %>% 
  summarize(avg_lead_time=mean(lead_time)) %>% 
  ggplot(aes(x=arrival_date_month, y=avg_lead_time, fill=is_canceled)) +
  geom_col(position="dodge") + 
  geom_text(aes(label=round(avg_lead_time)), position=position_dodge(0.9), vjust=1.5, size=3) +
  scale_fill_manual(values=c("skyblue4", "plum4"),labels=c("Not Canceled", "Canceled")) +
  guides(fill=guide_legend(title=NULL)) +
  theme(legend.position="bottom", axis.text.x=element_text(angle=45, hjust=1, vjust=0.5)) +
  scale_x_discrete(labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))+
  labs(x="", y="Lead Time in Days", title = "Average Lead Time per Month")
## `summarise()` has grouped output by 'arrival_date_month'. You can override using the `.groups` argument.

From the above chart we can see that the Lead Time is more for Cancelled bookings as compared to Not Cancelled ones.

Modelling Exercise

Why did we do Modelling?

We wanted to identify a more robust way of identifying patterns or drivers that is causing booking cancellations.

We decide to go ahead with a logistic regression model with our response variable as: whether a booking will be cancelled or not (1/0)

Logistic Regression

Split the dataset into test and train data set

Splitting the dataset into training and testing data set:

dt = sort(sample(nrow(data), nrow(data)*.7))
train<-data[dt,]
test<-data[-dt,]

Model

Fitting the logistic regression model:

options(warning=-1)
model <- glm(is_canceled ~ hotel + lead_time  + adr +   total_guests + 
                   total_of_special_requests + distribution_channel +
                 is_repeated_guest + previous_cancellations + booking_changes +
                 deposit_type + days_in_waiting_list + required_car_parking_spaces +
                 customer_type, family = "binomial" , data=train)

Summary of the logistic regression model

summary(model)
## 
## Call:
## glm(formula = is_canceled ~ hotel + lead_time + adr + total_guests + 
##     total_of_special_requests + distribution_channel + is_repeated_guest + 
##     previous_cancellations + booking_changes + deposit_type + 
##     days_in_waiting_list + required_car_parking_spaces + customer_type, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.6445  -0.7905  -0.4655   0.2011   3.6080  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.944e+00  7.882e-02 -37.355  < 2e-16 ***
## hotelResort Hotel              1.057e-02  1.966e-02   0.538 0.590622    
## lead_time                      3.744e-03  1.011e-04  37.041  < 2e-16 ***
## adr                            5.550e-03  2.102e-04  26.401  < 2e-16 ***
## total_guests                   1.264e-01  1.443e-02   8.758  < 2e-16 ***
## total_of_special_requests     -5.876e-01  1.275e-02 -46.086  < 2e-16 ***
## distribution_channelDirect    -3.644e-01  5.670e-02  -6.426 1.31e-10 ***
## distribution_channelGDS       -3.090e-01  2.276e-01  -1.358 0.174501    
## distribution_channelTA/TO      5.079e-01  4.949e-02  10.264  < 2e-16 ***
## distribution_channelUndefined  1.370e+01  6.862e+01   0.200 0.841802    
## is_repeated_guest1            -1.408e+00  8.733e-02 -16.123  < 2e-16 ***
## previous_cancellations         1.930e+00  5.705e-02  33.826  < 2e-16 ***
## booking_changes               -3.700e-01  1.815e-02 -20.380  < 2e-16 ***
## deposit_typeNon Refund         4.990e+00  1.308e-01  38.147  < 2e-16 ***
## deposit_typeRefundable         3.844e-01  2.560e-01   1.502 0.133165    
## days_in_waiting_list          -1.903e-03  5.555e-04  -3.426 0.000613 ***
## required_car_parking_spaces   -2.158e+01  9.521e+01  -0.227 0.820698    
## customer_typeGroup            -1.336e-01  1.895e-01  -0.705 0.480797    
## customer_typeTransient         1.141e+00  5.825e-02  19.592  < 2e-16 ***
## customer_typeTransient-Party   2.310e-01  6.067e-02   3.808 0.000140 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110022  on 83572  degrees of freedom
## Residual deviance:  75017  on 83553  degrees of freedom
## AIC: 75057
## 
## Number of Fisher Scoring iterations: 17

Model Performance : Train Data set

model$deviance
## [1] 75016.92
AIC(model)
## [1] 75056.92
BIC(model)
## [1] 75243.59

In sample Prediction

#COnsidering cut-off probability=0.5
table(predict(model,type="response") > 0.5)
## 
## FALSE  TRUE 
## 64892 18681
#confusion matrix
pp=predict(model, data=train, type="response")
p_val=1*(pp>0.5)
a_val <-train$is_canceled
confusion_matrix <- table(a_val, p_val)
confusion_matrix
##      p_val
## a_val     0     1
##     0 50134  2629
##     1 14758 16052
#misclasscification or error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2080457
#In-sample prediction
pred.glm0.train<- predict(model, type="response")
##ROC Curve
pred <- prediction(pred.glm0.train, as.numeric(train$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8337266

Out of sample Prediction

#out-of-sample prediction
pred.glm0.test<- predict(model, newdata = test, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(test$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8329665

Summary

  1. The logistic regression predicting the chances of a booking getting cancelled is around 83% accurate.
  2. Model Performance: Deviance- 74986.43| AIC-75026.43| BIC-75026.43
  3. In -sample misclassification rate- 20.86% In-sample Area under the ROC- 83.5%
  4. Out of sample Area Under the ROC- 82.7%
  5. Overall we can say that using the variables present in our data we can predict 83%times whether the booking will be cancelled.