Hotel Booking Data Analysis

1: Introduction

Hotel cancellations are a bad news for the business yet it’s unavoidable. The over and under estimation of guest can lead to wastage of resources for eg. if a large group of individuals were to cancel their booking, at the last moment, this may lead to wastage of raw food materials and it is often difficult to find new guests at such a short notice

If, hotel’s registration were to have an algorithm, which can predict the possibility of cancellation, it is easier on the hotel manager to make desicions. We are trying to analyse the booking data of 3 years of multiple hotels and trying to find a common trait which can be seen in the case of cancellation. In short, we are trying to determine, what are some of the common characteristics which indicate possibility of a cancellation.

We will identify which parameters are correlated with cancellation. After deciding the correlated variables, we will see whether they positevely influence the cancellation or negatively influence the cancellation. For eg, whether more number of children/babies indicate there’s higher probability of a cancellation or a lower probability of a cancellation. Finally, we plan to come up with a logistic regression model which can predict cancellation.

The consumer will be benefitted in the following ways:

  1. If the analysis were to determine, that there’s a possibility of cancellation, then the system can auto-generate an email 2 weeks prior the booking date, asking the guests to verify their reservation. The auto generated email can have a voting feature which the guests can use, to send their response.
  2. If no response, is received within a week, another auto generated email can be sent to the guests informing them, to confirm their decision. This email can also let the guests know that if no response is received within 3 days, there will be auto cancellation of the reservation
  3. If our analysis is right, and if the guest now wish to cancel their registration, they can let the hotel staff know via email response that they wish to cancel.
  4. This will help the hotel to adjust the room avialabity well in advance so that registration can be reopened on those specific days.
  5. This will ensure the balance of supply and demand equation, and day to day activities can be run in seamless manner
  6. If our analysis is wrong, and the guests do not wish to cancel their reservation, they can confirm their reservation status to let the hotel staff know
  7. The cancellation algorithm may not alwayse be right in prediction, however with this system in place, emergency situations can definately be avoided

2: Packages Required

library(plyr)
library(dplyr)
library(psych)
library(leaps)
library(ggplot2)
library(tibble)
library(tidyverse)
library(ROCR)
library(verification)

The purpose of the packages:

  1. Package psych is used to plot multiple histograms.
  2. Package plyr and dplyr are used to use functionality %>% which takes the output from previous command and treated as input for the next command
  3. Package leaps is used for forward selection
  4. Package ggplot2 is used to plot graphs
  5. Package ROCR is used for calculating accuracy of the model
  6. Package Verification is used for plotting ROC curve

3: Data Preparation

The data is obtained from sciencedirect.com website which can be accessed by clicking here The authors of this dataset are Nuno Antonio, Ana de Almeida, and Luis Nunes. The following extract is taken from the sciencedirect website.

This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields

Below table gives us more information about the dataset like how it was colleceted and for which industry this data is useful

Title Description
Subject area Hospitality Management
More specific subject area Revenue Management
How data was acquired Extraction from hotels’ Property Management System (PMS) SQL databases
Data format Mixed (raw and preprocessed)
Experimental factors Some of the variables were engineered from other variables from different database tables. The data point time for each observation was defined as the day prior to each booking׳s arrival
Experimental features Data was extracted via TSQL queries executed directly in the hotels’ PMS databases and R was employed to perform data analysis
Data source location Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon
Subject area Hospitality Management
Data accessibility Data is supplied with the paper



Here’s the table which contains all the information about the columns present in the data.


Variable Type Description
ADR Numeric Average Daily Rate as defined by [5]
Adults Integer Number of adults
Agent Categorical ID of the travel agency that made the bookinga
ArrivalDateDayOfMonth Integer Day of the month of the arrival date
ArrivalDateMonth Categorical Month of arrival date with 12 categories: “January” to “December”
ArrivalDateWeekNumber Integer Week number of the arrival date
ArrivalDateYear Integer Year of arrival date
AssignedRoomType Categorical Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
Babies Integer Number of babies
BookingChanges Integer Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
Children Integer Number of children
Company Categorical ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
Country Categorical Country of origin. Categories are represented in the ISO 3155–3:2013 format [6]
CustomerType Categorical Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
DaysInWaitingList Integer Number of days the booking was in the waiting list before it was confirmed to the customer
DepositType Categorical Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
DistributionChannel Categorical Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
IsCanceled Categorical Value indicating if the booking was canceled (1) or not (0)
IsRepeatedGuest Categorical Value indicating if the booking name was from a repeated guest (1) or not (0)
LeadTime Integer Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
MarketSegment Categorical Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
Meal Categorical Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
PreviousBookingsNotCanceled Integer Number of previous bookings not cancelled by the customer prior to the current booking
PreviousCancellations Integer Number of previous bookings that were cancelled by the customer prior to the current booking
RequiredCardParkingSpaces Integer Number of car parking spaces required by the customer
ReservationStatus Categorical Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
ReservationStatusDate Date Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel
ReservedRoomType Categorical Code of room type reserved. Code is presented instead of designation for anonymity reasons
StaysInWeekendNights Integer Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
StaysInWeekNights Integer Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
TotalOfSpecialRequests Integer Number of special requests made by the customer (e.g. twin bed or high floor)

The below table gives mean, standard deviation, and all the quantiles for resort hotels



The below table gives the same information City hotels


# Read the dataset from CSV file
hotels <- read.csv("hotels.csv")
# To get number of NA values per column
colSums(is.na(hotels))  
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
# To see how many observations have at least a single NA value
sum(complete.cases(hotels) == FALSE)
## [1] 4
# To see how many rows do not have a single NA value
sum(complete.cases(hotels) == TRUE)
## [1] 119386

Interpretation: There are hardly any missing values in the dataset. There are only 4 NA’s in the children column which can safely be imputed as 0

# Replacing the 4 NA values by imputing 0
hotels$children[is.na(hotels$children)] = 0
# Verifying whether there are any rows with missing values
sum(complete.cases(hotels) == FALSE)
## [1] 0

Interpretation: After imputing, now there are no missing values in our dataset

# To show 50 rows of the data
head(hotels, 10)
##           hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1  Resort Hotel           0       342              2015               July
## 2  Resort Hotel           0       737              2015               July
## 3  Resort Hotel           0         7              2015               July
## 4  Resort Hotel           0        13              2015               July
## 5  Resort Hotel           0        14              2015               July
## 6  Resort Hotel           0        14              2015               July
## 7  Resort Hotel           0         0              2015               July
## 8  Resort Hotel           0         9              2015               July
## 9  Resort Hotel           1        85              2015               July
## 10 Resort Hotel           1        75              2015               July
##    arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                        27                         1                       0
## 2                        27                         1                       0
## 3                        27                         1                       0
## 4                        27                         1                       0
## 5                        27                         1                       0
## 6                        27                         1                       0
## 7                        27                         1                       0
## 8                        27                         1                       0
## 9                        27                         1                       0
## 10                       27                         1                       0
##    stays_in_week_nights adults children babies meal country market_segment
## 1                     0      2        0      0   BB     PRT         Direct
## 2                     0      2        0      0   BB     PRT         Direct
## 3                     1      1        0      0   BB     GBR         Direct
## 4                     1      1        0      0   BB     GBR      Corporate
## 5                     2      2        0      0   BB     GBR      Online TA
## 6                     2      2        0      0   BB     GBR      Online TA
## 7                     2      2        0      0   BB     PRT         Direct
## 8                     2      2        0      0   FB     PRT         Direct
## 9                     3      2        0      0   BB     PRT      Online TA
## 10                    3      2        0      0   HB     PRT  Offline TA/TO
##    distribution_channel is_repeated_guest previous_cancellations
## 1                Direct                 0                      0
## 2                Direct                 0                      0
## 3                Direct                 0                      0
## 4             Corporate                 0                      0
## 5                 TA/TO                 0                      0
## 6                 TA/TO                 0                      0
## 7                Direct                 0                      0
## 8                Direct                 0                      0
## 9                 TA/TO                 0                      0
## 10                TA/TO                 0                      0
##    previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                               0                  C                  C
## 2                               0                  C                  C
## 3                               0                  A                  C
## 4                               0                  A                  A
## 5                               0                  A                  A
## 6                               0                  A                  A
## 7                               0                  C                  C
## 8                               0                  C                  C
## 9                               0                  A                  A
## 10                              0                  D                  D
##    booking_changes deposit_type agent company days_in_waiting_list
## 1                3   No Deposit  NULL    NULL                    0
## 2                4   No Deposit  NULL    NULL                    0
## 3                0   No Deposit  NULL    NULL                    0
## 4                0   No Deposit   304    NULL                    0
## 5                0   No Deposit   240    NULL                    0
## 6                0   No Deposit   240    NULL                    0
## 7                0   No Deposit  NULL    NULL                    0
## 8                0   No Deposit   303    NULL                    0
## 9                0   No Deposit   240    NULL                    0
## 10               0   No Deposit    15    NULL                    0
##    customer_type   adr required_car_parking_spaces total_of_special_requests
## 1      Transient   0.0                           0                         0
## 2      Transient   0.0                           0                         0
## 3      Transient  75.0                           0                         0
## 4      Transient  75.0                           0                         0
## 5      Transient  98.0                           0                         1
## 6      Transient  98.0                           0                         1
## 7      Transient 107.0                           0                         0
## 8      Transient 103.0                           0                         1
## 9      Transient  82.0                           0                         1
## 10     Transient 105.5                           0                         0
##    reservation_status reservation_status_date
## 1           Check-Out              2015-07-01
## 2           Check-Out              2015-07-01
## 3           Check-Out              2015-07-02
## 4           Check-Out              2015-07-02
## 5           Check-Out              2015-07-03
## 6           Check-Out              2015-07-03
## 7           Check-Out              2015-07-03
## 8           Check-Out              2015-07-03
## 9            Canceled              2015-05-06
## 10           Canceled              2015-04-22
# Plot histograms of numeric columns
multi.hist(hotels[,sapply(hotels, is.numeric)])

4: Proposed Exploratory Data Analysis

To find the total number of guests for each observation we can add 3 columns adults, children and babies

hotels$no_of_guests = rowSums(cbind(hotels$adults,hotels$children, hotels$babies),na.rm = TRUE)

We can combine 3 columns to get date in a single column

hotels$arrival_date_month <- factor(hotels$arrival_date_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), labels = c(1,2,3,4,5,6,7,8,9,10,11,12))
hotels$arrival_date <- as.Date(with(hotels, paste(hotels$arrival_date_year, hotels$arrival_date_month, hotels$arrival_date_day_of_month,sep = "-")), "%Y-%m-%d")

Now the date is acquired in one column, we can delete date of the day and year columns as they are not needed, and if required can be extracted from date column

#hotels <- hotels %>% select(-hotels$arrival_date_year, hotels$arrival_date_day_of_month)

We can see how many times reserved room type was assigned and how many times it was changed. We can calculate this by matching the two columns

hotels$binary_room_type <-  as.character(hotels$assigned_room_type) == as.character(hotels$reserved_room_type)
round(sum(hotels$binary_room_type) / nrow(hotels), 2)
## [1] 0.88

As we can see, 88% of the times, reserved room was assigned and 12% of the times a different room was assigned.

To calculate total number of nights spent, we can add up two columns to get that result.

hotels$total_nights <- hotels$stays_in_weekend_nights + hotels$stays_in_week_nights

We are using barplots to determine how many cancellations were made monthly and based on hotel type.

count <- table(hotels$is_canceled, hotels$arrival_date_month)

barplot(count, main = "Number of cancellations made monthly",horiz = TRUE, names.arg = c("Jan", "Feb","March","April","May","June","July","Aug","Sept","Oct","Nov","Dec"),cex.names = 0.4)

Interpretation : We conclude that cancellations in August month were highest.Thus interpreting that the hotel owners should specifically focus at this time of the year to atttract customers.

counts <- table(hotels$is_canceled, hotels$hotel)

barplot(counts, main = "Cancellations made based on hotel type",
xlab =  "Number of Cancellations",col = c("darkblue","red"),legend = rownames(counts), beside = TRUE)

Interpretation : From the above graph we conclude that the most of the booking cancelation is in city hotel.

ggplot(data = hotels, aes( x = hotel, y = lead_time, fill = factor(is_canceled))) +
geom_boxplot(position = position_dodge()) +
labs( title = "Cancellation Boxplot by Hotel Type", x = "Hotel Type", y = "Lead Time in days") +
scale_fill_discrete( name = "Booking Status", breaks = c("0", "1"), labels = c("Cancelled", "Not Cancelled")) + theme_light()

Interpretation: We can see from the box plot that the cancellation normally occurs soon after the booking is done. If sufficient time has elapsed after the booking, then there are less chances of cancellation.

hotels %>% group_by(arrival_date_month) %>% summarize(Mean_Adults = mean(adults)) -> ADULTS
hotels %>% group_by(arrival_date_month) %>% summarize(Mean_Children = mean(children)) -> CHILDREN
ADULTS <- data.frame(ADULTS)
CHILDREN <- data.frame(CHILDREN)
Adults_Children <- merge(ADULTS, CHILDREN, by = "arrival_date_month")
   
ggplot(data = Adults_Children) + 
  geom_line(aes(y = Mean_Adults, x = arrival_date_month, group = 1, col = "Adults")) + 
  geom_line(aes(y = Mean_Children, x = arrival_date_month, group = 1, col = "Children")) +
  ggtitle("Average Adults and Children Reserved for by Arrival Month") + xlab("Month") + ylab("Average Days") +
  theme(axis.text.x = element_text(angle = 40))

Interpretation: As we can see, in the months June and July, there are highest number of visitors. This peak can be seen in both adults as well as children. The next two peaks are in the month of May (due to Summer vacations) and in the month of December (Christmas & Thanksgiving period)

ggplot(hotels, aes(x = arrival_date_month, y = days_in_waiting_list, group = 1)) + stat_summary(fun = "mean", geom = "line", col = "black") + 
ggtitle("Average days on Waiting List by Arrival Month") + ylab("Average Days for waiting") + xlab("Month Number") + theme(axis.text.x = element_text(angle = 40))
## Warning: Ignoring unknown parameters: fun
## No summary function supplied, defaulting to `mean_se()

Interpretation: Waiting period for the months May and October is the highest and waiting period is least on September, february and november respectively.

To see which numeric columns are correlated with our cancellation column, we subset the numeric dataset and only see the correlation with cancellation column

hotels_corr <- hotels[unlist(lapply(hotels, is.numeric))]
cor(hotels_corr)[,1] * 100
##                    is_canceled                      lead_time 
##                    100.0000000                     29.3123356 
##              arrival_date_year       arrival_date_week_number 
##                      1.6659860                      0.8148065 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                     -0.6130079                     -0.1791078 
##           stays_in_week_nights                         adults 
##                      2.4764629                      6.0017213 
##                       children                         babies 
##                      0.5036255                     -3.2491089 
##              is_repeated_guest         previous_cancellations 
##                     -8.4793418                     11.0132808 
## previous_bookings_not_canceled                booking_changes 
##                     -5.7357723                    -14.4380991 
##           days_in_waiting_list                            adr 
##                      5.4185824                      4.7556598 
##    required_car_parking_spaces      total_of_special_requests 
##                    -19.5497817                    -23.4657774 
##                   no_of_guests                   total_nights 
##                      4.6521756                      1.7779269

We do not find any strong correlation of cancellation with other variables. There are some weak correlations like lead_time (29%), required_car_parking_spaces (-19%) and total special requests (-23%)

5: Logistic Regression to predict cancellation

We plan to use logistic regression which can predict the cancellation

#We are creating a null model which will help us in forward variable selection

nullmodel = glm(is_canceled~1, data = hotels, family = "binomial")

model_1 <- glm(is_canceled ~ lead_time  + hotel + no_of_guests + deposit_type + adr + total_of_special_requests + total_nights + binary_room_type + required_car_parking_spaces + booking_changes + days_in_waiting_list + customer_type, data = hotels , family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_1)
## 
## Call:
## glm(formula = is_canceled ~ lead_time + hotel + no_of_guests + 
##     deposit_type + adr + total_of_special_requests + total_nights + 
##     binary_room_type + required_car_parking_spaces + booking_changes + 
##     days_in_waiting_list + customer_type, family = "binomial", 
##     data = hotels)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0823  -0.8275  -0.3440   0.2023   3.6969  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -4.055e+00  6.489e-02 -62.492  < 2e-16 ***
## lead_time                     4.009e-03  8.558e-05  46.847  < 2e-16 ***
## hotelResort Hotel            -5.394e-02  1.699e-02  -3.174 0.001502 ** 
## no_of_guests                  1.812e-01  1.269e-02  14.276  < 2e-16 ***
## deposit_typeNon Refund        5.086e+00  1.061e-01  47.948  < 2e-16 ***
## deposit_typeRefundable        1.613e-02  2.073e-01   0.078 0.937980    
## adr                           3.763e-03  1.771e-04  21.255  < 2e-16 ***
## total_of_special_requests    -5.509e-01  1.052e-02 -52.352  < 2e-16 ***
## total_nights                  1.830e-02  3.052e-03   5.996 2.03e-09 ***
## binary_room_typeTRUE          1.966e+00  3.954e-02  49.732  < 2e-16 ***
## required_car_parking_spaces  -1.651e+01  4.139e+01  -0.399 0.690066    
## booking_changes              -4.043e-01  1.521e-02 -26.590  < 2e-16 ***
## days_in_waiting_list         -2.669e-03  4.807e-04  -5.553 2.81e-08 ***
## customer_typeGroup           -5.947e-01  1.630e-01  -3.648 0.000265 ***
## customer_typeTransient        8.373e-01  4.608e-02  18.170  < 2e-16 ***
## customer_typeTransient-Party  6.630e-02  4.868e-02   1.362 0.173200    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157398  on 119389  degrees of freedom
## Residual deviance: 107028  on 119374  degrees of freedom
## AIC: 107060
## 
## Number of Fisher Scoring iterations: 16
## To find the probability of the cancellation, 
hotels$predict <- predict(model_1, type = 'response')
prob = mean(hotels$is_canceled)

hotels$predictions <- ifelse(hotels$predict > prob , 1,0)
#This operation show us the model accuracy
mean(hotels$is_canceled == hotels$predictions) * 100
## [1] 75.23829

Interpretation: This model is more accurate 77.65%

#We are using forward selection method to identify whether we can build a better model with greater accuracy but with fewer variables

model_step_f <- step(nullmodel, scope = list(lower = nullmodel, upper = model_1), direction = 'forward')
## Start:  AIC=157400.2
## is_canceled ~ 1
## 
##                               Df Deviance    AIC
## + deposit_type                 2   126130 126136
## + lead_time                    1   147158 147162
## + binary_room_type             1   148090 148094
## + total_of_special_requests    1   150096 150100
## + required_car_parking_spaces  1   150249 150253
## + booking_changes              1   154057 154061
## + customer_type                3   155059 155067
## + hotel                        1   155120 155124
## + days_in_waiting_list         1   157056 157060
## + adr                          1   157115 157119
## + no_of_guests                 1   157128 157132
## + total_nights                 1   157361 157365
## <none>                             157398 157400
## 
## Step:  AIC=126135.6
## is_canceled ~ deposit_type
## 
##                               Df Deviance    AIC
## + binary_room_type             1   119998 120006
## + required_car_parking_spaces  1   120953 120961
## + lead_time                    1   123754 123762
## + total_of_special_requests    1   124375 124383
## + booking_changes              1   124779 124787
## + adr                          1   124875 124883
## + no_of_guests                 1   125006 125014
## + customer_type                3   125156 125168
## + total_nights                 1   125492 125500
## + hotel                        1   125694 125702
## + days_in_waiting_list         1   126127 126135
## <none>                             126130 126136
## 
## Step:  AIC=120005.6
## is_canceled ~ deposit_type + binary_room_type
## 
##                               Df Deviance    AIC
## + required_car_parking_spaces  1   115180 115190
## + total_of_special_requests    1   118068 118078
## + lead_time                    1   118208 118218
## + booking_changes              1   118962 118972
## + no_of_guests                 1   119133 119143
## + customer_type                3   119136 119150
## + adr                          1   119358 119368
## + total_nights                 1   119674 119684
## + hotel                        1   119842 119852
## <none>                             119998 120006
## + days_in_waiting_list         1   119996 120006
## 
## Step:  AIC=115190.1
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces
## 
##                             Df Deviance    AIC
## + total_of_special_requests  1   113467 113479
## + lead_time                  1   113763 113775
## + no_of_guests               1   114050 114062
## + customer_type              3   114052 114068
## + adr                        1   114259 114271
## + booking_changes            1   114328 114340
## + total_nights               1   114901 114913
## + days_in_waiting_list       1   115174 115186
## <none>                           115180 115190
## + hotel                      1   115179 115191
## 
## Step:  AIC=113478.9
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests
## 
##                        Df Deviance    AIC
## + customer_type         3   111758 111776
## + no_of_guests          1   111850 111864
## + lead_time             1   112004 112018
## + adr                   1   112046 112060
## + booking_changes       1   112630 112644
## + total_nights          1   113109 113123
## + days_in_waiting_list  1   113441 113455
## + hotel                 1   113465 113479
## <none>                      113467 113479
## 
## Step:  AIC=111776.3
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type
## 
##                        Df Deviance    AIC
## + lead_time             1   109070 109090
## + no_of_guests          1   110586 110606
## + adr                   1   110842 110862
## + booking_changes       1   111086 111106
## + total_nights          1   111426 111446
## + hotel                 1   111755 111775
## + days_in_waiting_list  1   111756 111776
## <none>                      111758 111776
## 
## Step:  AIC=109090
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time
## 
##                        Df Deviance    AIC
## + adr                   1   108185 108207
## + booking_changes       1   108284 108306
## + no_of_guests          1   108337 108359
## + days_in_waiting_list  1   109043 109065
## + total_nights          1   109043 109065
## + hotel                 1   109052 109074
## <none>                      109070 109090
## 
## Step:  AIC=108207.3
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time + adr
## 
##                        Df Deviance    AIC
## + booking_changes       1   107329 107353
## + no_of_guests          1   107928 107952
## + days_in_waiting_list  1   108160 108184
## + total_nights          1   108164 108188
## + hotel                 1   108182 108206
## <none>                      108185 108207
## 
## Step:  AIC=107353.4
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time + adr + 
##     booking_changes
## 
##                        Df Deviance    AIC
## + no_of_guests          1   107099 107125
## + total_nights          1   107289 107315
## + days_in_waiting_list  1   107297 107323
## <none>                      107329 107353
## + hotel                 1   107329 107355
## 
## Step:  AIC=107125
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time + adr + 
##     booking_changes + no_of_guests
## 
##                        Df Deviance    AIC
## + days_in_waiting_list  1   107066 107094
## + total_nights          1   107069 107097
## <none>                      107099 107125
## + hotel                 1   107097 107125
## 
## Step:  AIC=107094.5
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time + adr + 
##     booking_changes + no_of_guests + days_in_waiting_list
## 
##                Df Deviance    AIC
## + total_nights  1   107038 107068
## + hotel         1   107064 107094
## <none>              107066 107094
## 
## Step:  AIC=107068.5
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time + adr + 
##     booking_changes + no_of_guests + days_in_waiting_list + total_nights
## 
##         Df Deviance    AIC
## + hotel  1   107028 107060
## <none>       107038 107068
## 
## Step:  AIC=107060.4
## is_canceled ~ deposit_type + binary_room_type + required_car_parking_spaces + 
##     total_of_special_requests + customer_type + lead_time + adr + 
##     booking_changes + no_of_guests + days_in_waiting_list + total_nights + 
##     hotel

Interpretation: As we can see, AIC is decreasing every time we are adding a variable. This concludes that, we cannot build a better model with fewer variables. This is the reason we have decided to add more covariates in our second model.

#Partition the data into train & test into 70-30 combination
index <- sample(nrow(hotels),nrow(hotels)*0.70)
train <- hotels[index,]
test <- hotels[-index,]

#Creating second model with more varibles
set.seed(10641207)
Model_2 <- glm(is_canceled ~ market_segment + hotel + lead_time + arrival_date_month + children + is_repeated_guest + adults + babies + previous_cancellations + deposit_type + booking_changes  + reserved_room_type + adr + days_in_waiting_list + customer_type + total_of_special_requests, data = train , family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(Model_2)
## 
## Call:
## glm(formula = is_canceled ~ market_segment + hotel + lead_time + 
##     arrival_date_month + children + is_repeated_guest + adults + 
##     babies + previous_cancellations + deposit_type + booking_changes + 
##     reserved_room_type + adr + days_in_waiting_list + customer_type + 
##     total_of_special_requests, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.1932  -0.7519  -0.4848   0.2058   3.8111  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.004e+00  2.063e-01  -9.714  < 2e-16 ***
## market_segmentComplementary  -3.573e-01  2.489e-01  -1.436 0.151074    
## market_segmentCorporate      -5.906e-01  1.992e-01  -2.965 0.003027 ** 
## market_segmentDirect         -6.757e-01  1.948e-01  -3.469 0.000522 ***
## market_segmentGroups         -2.015e-01  1.966e-01  -1.025 0.305375    
## market_segmentOffline TA/TO  -7.839e-01  1.943e-01  -4.034 5.49e-05 ***
## market_segmentOnline TA       5.316e-01  1.928e-01   2.758 0.005819 ** 
## market_segmentUndefined       1.447e+01  1.970e+02   0.073 0.941423    
## hotelResort Hotel            -1.058e-01  2.117e-02  -4.996 5.85e-07 ***
## lead_time                     4.819e-03  1.111e-04  43.390  < 2e-16 ***
## arrival_date_month2           7.910e-02  5.476e-02   1.445 0.148597    
## arrival_date_month3          -1.357e-01  5.318e-02  -2.551 0.010745 *  
## arrival_date_month4          -4.101e-02  5.236e-02  -0.783 0.433459    
## arrival_date_month5          -8.792e-02  5.243e-02  -1.677 0.093580 .  
## arrival_date_month6          -1.785e-01  5.404e-02  -3.304 0.000954 ***
## arrival_date_month7          -1.874e-01  5.281e-02  -3.549 0.000387 ***
## arrival_date_month8          -1.131e-01  5.334e-02  -2.120 0.033977 *  
## arrival_date_month9          -2.196e-01  5.526e-02  -3.975 7.04e-05 ***
## arrival_date_month10         -7.374e-02  5.342e-02  -1.381 0.167420    
## arrival_date_month11          1.688e-02  5.781e-02   0.292 0.770237    
## arrival_date_month12          1.236e-01  5.705e-02   2.166 0.030301 *  
## children                      1.692e-01  2.813e-02   6.016 1.79e-09 ***
## is_repeated_guest            -1.351e+00  8.749e-02 -15.443  < 2e-16 ***
## adults                        1.440e-01  1.942e-02   7.418 1.19e-13 ***
## babies                        2.348e-01  1.021e-01   2.301 0.021409 *  
## previous_cancellations        1.698e+00  5.328e-02  31.879  < 2e-16 ***
## deposit_typeNon Refund        5.420e+00  1.287e-01  42.115  < 2e-16 ***
## deposit_typeRefundable       -5.264e-02  2.403e-01  -0.219 0.826610    
## booking_changes              -4.325e-01  1.826e-02 -23.689  < 2e-16 ***
## reserved_room_typeB           8.610e-02  8.705e-02   0.989 0.322633    
## reserved_room_typeC           1.398e-01  1.018e-01   1.374 0.169470    
## reserved_room_typeD          -2.775e-02  2.498e-02  -1.111 0.266608    
## reserved_room_typeE           7.695e-02  4.050e-02   1.900 0.057427 .  
## reserved_room_typeF          -3.872e-01  6.614e-02  -5.854 4.81e-09 ***
## reserved_room_typeG          -3.013e-01  7.674e-02  -3.926 8.64e-05 ***
## reserved_room_typeH          -3.333e-01  1.185e-01  -2.812 0.004926 ** 
## reserved_room_typeL           2.478e-01  1.253e+00   0.198 0.843214    
## reserved_room_typeP           1.354e+01  6.234e+01   0.217 0.828091    
## adr                           4.030e-03  2.708e-04  14.878  < 2e-16 ***
## days_in_waiting_list         -1.935e-03  5.671e-04  -3.411 0.000646 ***
## customer_typeGroup           -2.874e-01  1.985e-01  -1.448 0.147743    
## customer_typeTransient        5.897e-01  6.035e-02   9.772  < 2e-16 ***
## customer_typeTransient-Party  1.125e-01  6.414e-02   1.754 0.079478 .  
## total_of_special_requests    -7.493e-01  1.352e-02 -55.404  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110138  on 83572  degrees of freedom
## Residual deviance:  75955  on 83529  degrees of freedom
## AIC: 76043
## 
## Number of Fisher Scoring iterations: 10
#To find the accuracy of the second model with training data

train_pred <- predict(Model_2, train,type = 'response')

pred <- prediction(train_pred,train$is_canceled)
perform <- performance(pred,"acc")
max <- which.max(slot(perform,"y.values")[[1]])
prob <- slot(perform,"x.values")[[1]][max]
prob
##     72954 
## 0.4774827
train_pred1 <- ifelse(train_pred >  prob, 1,0)
mean(train$is_canceled == train_pred1)
## [1] 0.8019935

Interpretation: Accuracy of this model is 80.26% which is 3% more than our previous model

#Plotting Confusion Matrix for Training Dataset

train_table <- table(Actual = train$is_canceled,Predicted = train_pred1 )
train_table
##       Predicted
## Actual     0     1
##      0 48749  3906
##      1 12642 18276
# We can also calculate Missclassification error of training dataset

1 - sum(diag(train_table))/sum(train_table)
## [1] 0.1980065

Interpretation: From the cunfusion matrix, we can see that 67,035 (48476 + 18559) times, our model could identify the cancellation correctly. And 16,538 times our model failed to identify cancellation

#Plotting ROC and AUC for the training dataset

roc.plot( train$is_canceled, train_pred, threshold = seq(0,max(train_pred),0.01))

auc_train <- performance(pred,"auc")
auc_train <- unlist(slot(auc_train,"y.values"))
auc_train
## [1] 0.8312023

Interpretation: For the ROC curve, the more curved the graph towards the top left corner, the better. Here, since this ROC curve is significantly away from the 45 degree line, we can consider this as acceptable

#Checking accuracy on the test dataset
test_pred <- predict(Model_2, newdata = test,type = 'response')
test_pred1 <- ifelse(test_pred > prob , 1,0)
mean(test$is_canceled == test_pred1)
## [1] 0.8002066

Interpretation: Accuracy is 80% even on the testing dataset

#Creating Confusion matrix for testing dataste

Test_Table <- table(Actual = test$is_canceled,Predicted = test_pred1 )
Test_Table
##       Predicted
## Actual     0     1
##      0 20852  1659
##      1  5497  7809
#Calculating missclassification error for the testing dataset

1 - sum(diag(Test_Table))/sum(Test_Table)
## [1] 0.1997934

Interpretation: From the cunfusion matrix, we can see that, 28,691 (20,741 + 7,950) times, our model could identify the cancellation correctly. And 7,126 times our model failed to identify cancellation

#Plotting the ROC curve and AUC for the testing dataset

roc.plot(test$is_canceled, test_pred, threshold = seq(0,max(test_pred),0.01))

pred_test <- prediction(test_pred,test$is_canceled)
auc_test <- performance(pred_test,"auc")
auc_test <- unlist(slot(auc_test,"y.values"))
auc_test
## [1] 0.829879

Interpretation: For the ROC curve, the more curved the graph towards the top left corner, the better. Here, since this ROC curve is significantly away from the 45 degree line, we can consider this as acceptable

6: Summary

Hotels can experience a great deal of inconvinience if they were to have a cancellations in reservations. It is dificult to get new guests at short notices. This causes adverse effect on the hotels’ inventory and will result in financial losses (Assuming hotel isn’t charging cancellation fee)

The algorithm will identify the chances of cancellation and will auto generated email to the guests if there are chances of reservation.

Following are the insights that we got of our analysis:

1.The model correctly predicts the cancellation with the accuracy of 80%.
2.By distinguishing between City Hotel and Resort hotel our insights provided that most of the booking cancellation was done in Resort hotel. People plan ahead for vacations and make room and set aside money for booking a resort hotel, whereas city hotel bookings could be made merely a few days prior for work or general travel which could be cancelled for poor weather or last minute changes.
3.Month wise analysis helped us to conclude that the cancellations were highest in the month of August. Thus, resort hotels should focus at this time of the year to attract more customers.
4.There is 29% correlation between lead time and whether the booking is cancelled attribute, this concludes that there are high chances that the customer is planning to cancel his booking when his lead time is greater as it is the Number of days that elapsed between the entering date of the booking into the premise and the arrival date.

Based on the customer’s email response, the hotel can decide whether they need to up the availability of the rooms or not. This gives the hotel staff enough time to protect their business.

Below are the limitations of our analysis that we could recognize:

  1. The accurate prediction rate of our analysis is 80%, with advanced regression techniques this rate can be increased.
  2. To predict whether or not a booking will be canceled, the number of booking changes is a possible source of leakage, because this information can change over time. However by making use of forward selection method,we concluded that we require booking_changes variable for accurate prediction.