MidTermProject-1.knit

Introduction

Background

Hospitality generates revenue for local economies directly when tourists spend money in hotels, restaurants and entertainment venues. Hospitality industry is growing, with more and more people spending their money for vacation and leisure activities. People may only lodge into a hotel when it’s a holiday season or a special event, thus the demand for staying in room is not equally distributed across the year. Hotel industry is a very volatile industry and the bookings depend on variety of factors such as type of hotels, seasonality, days of week and many more. This makes analyzing the patterns available in the past data more important to help the hotels plan better. Using the historical data, hotels can perform various campaigns to boost the business. The data consists of around 119,390 booking transactions from 2 hotel: an anonymous city hotel from Lisbon and a resort hotel from Algarve. The dataset comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. There is so much to explore from this data, but we will only focus on demand forecasting.

Proposed Analytical Methodology

We will tackle this problem statement in three segments:

We will be analyzing some key metrics for hotel bookings like: Most preferred meal types, country wise bookings, new customers acquired, the number of cancellations, number of bookings on weekday vs weekends, customer lifetime value of the existing customers, type of rooms preferred by customers, booking types, hotels available for booking and the revenue of the hotels
Further, we will be using various angles to look through the data to analyze patterns associated with each segment such as: The type of hotel, day of week, type of customers,type of rooms
Furthermore, we will also try to to develop a predictive model of bookings with and without cancellation using logistic regression.

Benefit of Analysis

The project aims to gain interesting insight into customers’ behavior when booking a hotel. To maximize the revenue gained by the hotel, the management often employed a pricing strategy, one of them being raising the room rate when the demand is high and making a promo when the demand is low. Thus, the ability to accurately forecast the future demand is very important and became a vital part on the pricing scheme. The demand for different segment of customer may differ and forecasting become harder as it may requires different model for different segment.These insights can guide hotels to adjust their customer strategies and make preparation for unknown.

Packages Required

Rpart, rpart.plot and ROCR : These packages are used for building classification and regression models using decision trees. Further, we can visualize the tree structure and evaluate the performance of the models.
Tidyverse : This package consists of 6 core packages out of which the below 3 are most important for this project: dplyr: Used for data manipulation tidyr: Used for data modifications ggplot2: Used for creating powerful visualizations.
Forecast, tseries and sarima : These packages are used to model the time-series data including the seasonal component in the series (if any).

#Importing the packages required

library(rpart) #used for classification trees
library(rpart.plot) #used for plotting the trees
library(tidyverse) #used for data manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(rmarkdown) #used for formatting the markdown file
library(lubridate) #Used for manipulating dates

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tseries) #used for time-series forecasting

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(sarima) #used for time-series forecasting including seasonal components

## Loading required package: stats4

library(caret) #used for createDataPartition

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ROCR) #used for predict function

## Warning: package 'ROCR' was built under R version 4.1.2

Data Dictionary

The data set contains the following variables:

hotel : Type of Hotel whether Resort Hotel(H1) or City Hotel(H2)
is_canceled : Value indicating if the booking was canceled (1) or not (0)
lead_time : Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year : Year of arrival date
arrival_date_month : Month of arrival date
arrival_date_week_number : week number of year for arrival date
arrival_date_day_of_month : Day of arrival date
stays_in_weekend_nights : Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights : Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults : Number of adults
children : Number of children
babies : Number of babies
meal : Type of meal booked. Undefined/SC – no meal package, BB – Bed & Breakfast, HB – Half board (breakfast and one other meal – usually dinner), FB – Full board (breakfast, lunch and dinner)
country : Country of origin.
market_segment : Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel : Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest : Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations : Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled : Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type : Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type : Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request.
booking_changes : Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type : Type of deposit made for booking: No Deposit – no deposit was made,Non Refund – a deposit was made in the value of the total stay cost, Refundable – a deposit was made with a value under the total cost of stay.
agent : ID of the travel agency that made the booking
company : ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list : Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type : Type of booking :Contract - when the booking has an allotment or other type of contract associated to it, Group – when the booking is associated to a group, Transient – when the booking is not part of a group or contract, and is not associated to other transient booking, Transient-party – when the booking is transient, but is associated to at least other transient booking
adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces : Number of car parking spaces required by the customer
total_of_special_requests : Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status : Reservation last status:Canceled – booking was canceled by the customer,Check-Out – customer has checked in but already departed, No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date : Date at which the last status was set. This variable can be used in conjunction with the Reservation Status to understand when was the booking canceled or when did the customer checked-out of the hotel

Data Preparation

The source data can be found here: Hotel Data

Data Structure

This data set contains 119390 observations and 32 variables. Some of the variables are integer, others are character. It includes information about hotel names, lead time, number of children, agent, and much more.

#Importing the dataset
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')

## Rows: 119390 Columns: 32

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(hotels)

## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

#Structure of the dataset
str(hotels)

## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

Missing values

By checking missing values for each variables, we can see that there are only 4 missing value for variable “children”, no missing value in other variables.

colSums(is.na(hotels))

##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

Taking care of the missing value

hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)

Exploratory Data Analysis

Total bookings across years

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count(is_repeated_guest) %>% 
  summarise(percent = round(n/nrow(hotels)*100,2), total = n)

## # A tibble: 2 x 3
##   is_repeated_guest percent  total
##               <dbl>   <dbl>  <int>
## 1                 0   96.8  115580
## 2                 1    3.19   3810

Leading time difference between two groups of guests

guestbehabar <- function(behavior){
   hotels %>%
    ggplot(aes(is_repeated_guest, fill = behavior)) +
    geom_bar(position = "fill") +
    labs(title = "Behavior feature by guest type", 
         subtitle = behavior,
         x = "Guest type (1 for repeated guests)", 
         y = "Percentage of group") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))
}
behaviourboxplot <- function(behavior){
   hotels %>%
    ggplot(aes(x = is_repeated_guest, y = behavior)) +
    geom_boxplot() +
    geom_jitter(width = .15, alpha = .2) +
    labs(title = "Behavior feature by guest type",
         subtitle = behavior,
         x = "Guest type (1 for repeated guests)", 
         y = "count") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black"))
}
behaviourboxplot(hotels$lead_time)

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Type of hotel booked by repeated guests

guestbehabar(hotels$hotel)

It turns out that repeated guests prefer City Hotel other than Resort Hotel. However, the percentage of repeated guests that book Resort Hotel is 46.7%, while only 33.1% of unrepeated guests book Resort Hotel.

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(hotel == "Resort Hotel") %>% 
  count() -> filhot       
hotels %>% 
  group_by(is_repeated_guest) %>% 
  count() -> total

as.data.frame(filhot/total)

##   is_repeated_guest         n
## 1               NaN 0.3312165
## 2                 1 0.4666667

Repeated guests cancelling booking

hotels$is_repeated_guest <- as.factor(hotels$is_repeated_guest)
hotels$is_canceled <- as.factor(hotels$is_canceled)
guestbehabar(hotels$is_canceled)

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(is_canceled == "1") %>% 
  count() -> filcan

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count() -> total

as.data.frame(filcan/total)

## Warning in Ops.factor(left, right): '/' not meaningful for factors

##   is_repeated_guest         n
## 1                NA 0.3778508
## 2                NA 0.1448819

Meal choice of repeated guests

guestbehabar(hotels$meal)

table(hotels$meal, hotels$is_repeated_guest)

##            
##                 0     1
##   BB        88837  3473
##   FB          789     9
##   HB        14277   186
##   SC        10540   110
##   Undefined  1137    32

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(meal == "BB") %>% 
  count() -> filmea     

as.data.frame(filmea/total)

## Warning in Ops.factor(left, right): '/' not meaningful for factors

##   is_repeated_guest         n
## 1                NA 0.7686191
## 2                NA 0.9115486

Percentage of repeated guests making deposit

guestbehabar(hotels$deposit_type)

Normally 98.2% of repeated guests don’t make deposit. A possible reason is that they are reliable guests, so that they don’t need to make deposit. The percentage for unrepeated guests is 87.3%.

hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(deposit_type == "No Deposit") %>% 
  count() -> fildep       

as.data.frame(fildep/total)

## Warning in Ops.factor(left, right): '/' not meaningful for factors

##   is_repeated_guest         n
## 1                NA 0.8729798
## 2                NA 0.9821522

Customer type of repeated guests

guestbehabar(hotels$customer_type)

80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking. 4.2% of the repeated guests make bookings that are associated to a group. Both of these two numbers are higher than that of unrepeated guests making the bookings.

custype <- function(type){
  hotels %>% 
  group_by(is_repeated_guest) %>% 
  filter(customer_type == type) %>% 
  count()
}

custype("Transient") -> filtra
custype("Group") -> filgro

as.data.frame(filtra/total)

## Warning in Ops.factor(left, right): '/' not meaningful for factors

##   is_repeated_guest         n
## 1                NA 0.7487455
## 2                NA 0.8065617

Logistic Regression

Splitting the dataset into training and test sets

The data was Randomly splitted to training (80%) and testing (20%) datasets.

i <- createDataPartition(hotels$is_canceled, p=0.80, list=FALSE)
# select 20% of the data for testing
testset <- hotels[-i,]
# select 80% of data to train the models
trainset <- hotels[i,]

Training a Logistic Regression Model

Fitting the full model with some explanatory variables, after excluding some lesser interesting variables such as date, month, year, country etc.

fullmodel <- glm(is_canceled ~ hotel + lead_time  + adr  
                 + total_of_special_requests + distribution_channel +
                 is_repeated_guest + previous_cancellations + booking_changes +
                 deposit_type + days_in_waiting_list + required_car_parking_spaces +
                 customer_type, family = "binomial" , data=trainset)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Further find the best model using backward elimination technique.

# Backward Elimination
model_step_b <- step(fullmodel,direction='backward')

## Start:  AIC=86045.78
## is_canceled ~ hotel + lead_time + adr + total_of_special_requests + 
##     distribution_channel + is_repeated_guest + previous_cancellations + 
##     booking_changes + deposit_type + days_in_waiting_list + required_car_parking_spaces + 
##     customer_type

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                               Df Deviance   AIC
## - hotel                        1    86009 86045
## <none>                              86008 86046
## - days_in_waiting_list         1    86020 86056
## - is_repeated_guest            1    86369 86405
## - booking_changes              1    86589 86625
## - distribution_channel         4    87072 87102
## - adr                          1    87301 87337
## - lead_time                    1    87769 87805
## - previous_cancellations       1    87954 87990
## - customer_type                3    88124 88156
## - total_of_special_requests    1    88747 88783
## - required_car_parking_spaces  1    89483 89519
## - deposit_type                 2    94884 94918

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Step:  AIC=86045.32
## is_canceled ~ lead_time + adr + total_of_special_requests + distribution_channel + 
##     is_repeated_guest + previous_cancellations + booking_changes + 
##     deposit_type + days_in_waiting_list + required_car_parking_spaces + 
##     customer_type

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                               Df Deviance   AIC
## <none>                              86009 86045
## - days_in_waiting_list         1    86022 86056
## - is_repeated_guest            1    86370 86404
## - booking_changes              1    86589 86623
## - distribution_channel         4    87081 87109
## - adr                          1    87308 87342
## - lead_time                    1    87786 87820
## - previous_cancellations       1    87956 87990
## - customer_type                3    88130 88160
## - total_of_special_requests    1    88747 88781
## - required_car_parking_spaces  1    89557 89591
## - deposit_type                 2    94950 94982

#Remove hotel and days_in_waiting_list, Lower AIC or BOC value indicates a better fit.
Finalmodel <- glm(is_canceled ~  lead_time  + adr + total_of_special_requests + distribution_channel +
                    is_repeated_guest + previous_cancellations + booking_changes +
                    deposit_type + required_car_parking_spaces +
                    customer_type, family = "binomial" , data=trainset)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(Finalmodel)

## 
## Call:
## glm(formula = is_canceled ~ lead_time + adr + total_of_special_requests + 
##     distribution_channel + is_repeated_guest + previous_cancellations + 
##     booking_changes + deposit_type + required_car_parking_spaces + 
##     customer_type, family = "binomial", data = trainset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.6016  -0.7944  -0.4589   0.2252   3.6189  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.823e+00  7.110e-02 -39.708  < 2e-16 ***
## lead_time                      3.902e-03  9.285e-05  42.021  < 2e-16 ***
## adr                            6.460e-03  1.806e-04  35.761  < 2e-16 ***
## total_of_special_requests     -5.879e-01  1.191e-02 -49.358  < 2e-16 ***
## distribution_channelDirect    -2.735e-01  5.365e-02  -5.099 3.42e-07 ***
## distribution_channelGDS       -4.749e-01  2.248e-01  -2.113   0.0346 *  
## distribution_channelTA/TO      5.895e-01  4.706e-02  12.526  < 2e-16 ***
## distribution_channelUndefined  1.737e+01  3.319e+02   0.052   0.9582    
## is_repeated_guest1            -1.383e+00  8.096e-02 -17.084  < 2e-16 ***
## previous_cancellations         1.906e+00  5.263e-02  36.208  < 2e-16 ***
## booking_changes               -3.709e-01  1.688e-02 -21.977  < 2e-16 ***
## deposit_typeNon Refund         4.817e+00  1.144e-01  42.115  < 2e-16 ***
## deposit_typeRefundable         5.075e-01  2.308e-01   2.199   0.0279 *  
## required_car_parking_spaces   -2.857e+01  5.736e+01  -0.498   0.6184    
## customer_typeGroup             5.388e-02  1.643e-01   0.328   0.7429    
## customer_typeTransient         1.102e+00  5.307e-02  20.774  < 2e-16 ***
## customer_typeTransient-Party   1.347e-01  5.512e-02   2.444   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 125920  on 95512  degrees of freedom
## Residual deviance:  86022  on 95496  degrees of freedom
## AIC: 86056
## 
## Number of Fisher Scoring iterations: 11

Finalmodel$deviance

## [1] 86022.41

AIC(Finalmodel)

## [1] 86056.41

BIC(Finalmodel)

## [1] 86217.35

In-Sample Prediction and ROC curve

In-sample prediction and Confusion Matrix with cut-off probability of 0.5 is calculated.

#Setting cut-off probability=0.5
table(predict(Finalmodel,type="response") > 0.5)

## 
## FALSE  TRUE 
## 73804 21709

#confusion matrix
pred_prob=predict(Finalmodel, data=trainset, type="response")
pred_value=1*(pred_prob>0.5)
actual_value <-trainset$is_canceled
confusion_matrix <- table(actual_value, pred_value)
confusion_matrix

##             pred_value
## actual_value     0     1
##            0 57017  3116
##            1 16787 18593

#misclasscification or error rate
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate #0.21

## [1] 0.20838

#In-sample prediction
pred.glm0.train<- predict(Finalmodel, type="response")
##ROC Curve
pred <- prediction(pred.glm0.train, as.numeric(trainset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8336302

Out-of-Sample Prediction and ROC curve Similarly, out-of-sample prediction was evaluated and ROC curve for testing data set was created. The Receiver Operating Characteristic (ROC) curve was created.

#out-of-sample prediction
pred.glm0.test<- predict(Finalmodel, newdata = testset, type="response")
##ROC Curve
pred <- prediction(pred.glm0.test, as.numeric(testset$is_canceled))
perf <- performance(pred, measure="tpr", x.measure="fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8306352

The logistic regression model correctly predicts the cancellation with 83% of accuracy.

Summary

The probability of a repeated guest cancel a booking is much lower than that of a unrepeated guest does. This informs that repeated guests are much loyal.
81.1% of repeated guests don’t make change of booking.
98.2% of repeated guests don’t make deposit. Higher than unrepeated guests do.
80.7% of the repeated guests make bookings that are not part of a group or contract, and are not associated to other transient booking.
Only 3.19% of the guests are repeated guests.
Repeated guests tend to book the hotel one month ahead of visiting, which is much shorter than that made by unrepeated guests. This indicates repeated guests don’t rush to book hotels may because they always know which hotel to book if visiting that place.
The logistic regression model correctly predicts the cancellation with 83% of accuracy.

Data Wrangling Final Project

Akash Khanvilkar, Guna Shanmukha Gedela, Sindhu Herle

Hotel Bookings Analysis