Initial setup and Configure the data set. Load the data file in
variable hotel_data.
Data set - Hotels : This data comes from an
open hotel booking demand data-set of hotels like City Hotel, Resort
Hotel.
# Load the Data
hotel_data <- read.csv(file.choose())
Question:
#Explore the structure of the dataset
str(hotel_data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
#Summary statistics
summary(hotel_data)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
# Explore Top 5 rows of Hotel data
head(hotel_data)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
# Check for missing values
colSums(is.na(hotel_data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
# Remove missing values from the dataset as anova testing was not performing due to difference in both the model
hotel_data <- na.omit(hotel_data)
hotel_data$is_canceled<-factor(hotel_data$is_canceled)
#Model building with is_canceled and lead_time
glm_morel_0 <- glm(is_canceled ~ lead_time, data = hotel_data, family = "binomial")
#Model building with is_canceled and lead_time:stays_in_weekend_nights:stays_in_week_nights:adults:children:babies:booking_changes:booking_changes:adr:total_of_special_requests
glm_morel_1 <- glm(is_canceled ~ lead_time + stays_in_weekend_nights + stays_in_week_nights + adults + children + babies + booking_changes + adr + total_of_special_requests, data = hotel_data, family = "binomial")
#Print summary of the model
summary_glm_morel_0 <- summary(glm_morel_0)
summary_glm_morel_1 <- summary(glm_morel_1)
summary_glm_morel_0
##
## Call:
## glm(formula = is_canceled ~ lead_time, family = "binomial", data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.166e+00 9.183e-03 -126.96 <2e-16 ***
## lead_time 5.856e-03 6.137e-05 95.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157390 on 119385 degrees of freedom
## Residual deviance: 147146 on 119384 degrees of freedom
## AIC: 147150
##
## Number of Fisher Scoring iterations: 4
summary_glm_morel_1
##
## Call:
## glm(formula = is_canceled ~ lead_time + stays_in_weekend_nights +
## stays_in_week_nights + adults + children + babies + booking_changes +
## adr + total_of_special_requests, family = "binomial", data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.418e+00 2.816e-02 -50.363 < 2e-16 ***
## lead_time 5.952e-03 6.687e-05 89.008 < 2e-16 ***
## stays_in_weekend_nights -2.185e-02 7.602e-03 -2.874 0.00406 **
## stays_in_week_nights 3.236e-03 4.055e-03 0.798 0.42483
## adults 1.106e-01 1.434e-02 7.711 1.25e-14 ***
## children 1.993e-02 1.761e-02 1.131 0.25788
## babies 5.974e-02 8.243e-02 0.725 0.46856
## booking_changes -7.739e-01 1.592e-02 -48.620 < 2e-16 ***
## adr 5.483e-03 1.574e-04 34.847 < 2e-16 ***
## total_of_special_requests -7.638e-01 1.011e-02 -75.520 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157390 on 119385 degrees of freedom
## Residual deviance: 136010 on 119376 degrees of freedom
## AIC: 136030
##
## Number of Fisher Scoring iterations: 5
lead_time_coef <- summary_glm_morel_1$coefficients["lead_time", "Estimate"]
lead_time_se <- summary_glm_morel_1$coefficients["lead_time", "Std. Error"]
# Print coefficient , standard error and p-value
cat("1. Coefficient for lead_time (model -:", lead_time_coef, "\n")
## 1. Coefficient for lead_time (model -: 0.005951949
cat("2. Standard Error for lead_time-:", lead_time_se, "\n")
## 2. Standard Error for lead_time-: 6.686953e-05
levels(hotel_data$is_canceled)
## [1] "0" "1"
probs <- predict(glm_morel_1, type = "response")
mean(probs[hotel_data$is_canceled == "0"])
## [1] 0.3061183
mean(probs[hotel_data$is_canceled == "1"])
## [1] 0.4796542
## test model differences with chi square test (Above Removing missing values from the dataset as anova testing was not performing due to difference in both the model)
anova(glm_morel_1, glm_morel_0, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: is_canceled ~ lead_time + stays_in_weekend_nights + stays_in_week_nights +
## adults + children + babies + booking_changes + adr + total_of_special_requests
## Model 2: is_canceled ~ lead_time
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 119376 136010
## 2 119384 147146 -8 -11137 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! y values must be 0 <= y <= 1
Conclude: In the Above result, - coefficients, significance,
goodness of fit, and convergence and graphs suggests that lead time is a
significant predictor of booking cancellations in the dataset.
The
Analysis of Deviance Table suggests that the model - glm_morel_0 which
has only lead time as predictor variable which provides a significantly
better fit to the data compared to model - glm_morel_1 which has
multiple predictor variables. This could be good indicator that
lead_time alone may be sufficient for predicting the cancellation status
of hotel bookings, without the need for additional predictor variables
in the model Although Other factors such as adults, booking changes,
adr, and total of special requests might have less significant roles in
predicting cancellations.