Initial setup and Configure the data set. Load the data file in
variable hotel_data.
Data set - Hotels : This data comes from an
open hotel booking demand data-set of hotels like City Hotel, Resort
Hotel.
hotel_data <- read.csv(file.choose())
summary(hotel_data)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
head(hotel_data)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
Variable Selection: There are few binary columns -
“is_repeated_guest” , “is_canceled” and “market_segment” which can be
used for analysis for our Regression Model.
“is_repeated_guest” can
be impressive variable for this analysis because modeling on this
variable provide insights into factors that influence guest loyalty or
repeat bookings and can be valuable data point for hotel customer
retention strategies and enhance the customer experience so This could
be worth our modeling.
hotel_data$is_repeated_guest<-factor(hotel_data$is_repeated_guest)
#Model 1 : Build a linear regression model with binary variable - is_repeated_guest
reg_lm_model <- glm(is_repeated_guest ~ lead_time, data = hotel_data,family = binomial)
# Summary of the model
summary_reg_lm_model <- summary(reg_lm_model)
summary_reg_lm_model
##
## Call:
## glm(formula = is_repeated_guest ~ lead_time, family = binomial,
## data = hotel_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5175516 0.0209558 -120.14 <2e-16 ***
## lead_time -0.0157767 0.0004027 -39.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33746 on 119389 degrees of freedom
## Residual deviance: 30717 on 119388 degrees of freedom
## AIC: 30721
##
## Number of Fisher Scoring iterations: 8
lead_time_coef <- summary_reg_lm_model$coefficients["lead_time", "Estimate"]
lead_time_se <- summary_reg_lm_model$coefficients["lead_time", "Std. Error"]
#Confidence interval calculation
ci_lower <- lead_time_coef - 1.96 * lead_time_se
ci_upper <- lead_time_coef + 1.96 * lead_time_se
# Extract p-value
lead_time_p_value <- summary_reg_lm_model$coefficients["lead_time", "Pr(>|z|)"]
# Print coefficient , standard error and p-value
cat("1. Coefficient for lead_time-:", lead_time_coef, "\n")
## 1. Coefficient for lead_time-: -0.01577667
cat("2. Standard Error for lead_time-:", lead_time_se, "\n")
## 2. Standard Error for lead_time-: 0.0004026524
cat("P-value for lead_time:", lead_time_p_value, "\n")
## P-value for lead_time: 0
# Print the confidence interval
cat("Confidence interval (CI) for the coefficient of lead_time :[", ci_lower, " - ", ci_upper, "]\n")
## Confidence interval (CI) for the coefficient of lead_time :[ -0.01656587 - -0.01498747 ]
#coef(reg_lm_model)
#Odds ratios
#exp(coef(reg_lm_model))
levels(hotel_data$is_repeated_guest)
## [1] "0" "1"
probs <- predict(reg_lm_model, type = "response")
mean(probs[hotel_data$is_repeated_guest == "0"])
## [1] 0.03094604
mean(probs[hotel_data$is_repeated_guest == "1"])
## [1] 0.06122226
# Plot between lead_time and is_repeated_guest
ggplot(data = hotel_data, aes(x = lead_time, y = is_repeated_guest)) +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(x = "Lead Time", y = "Probability of being a repeated guest") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! y values must be 0 <= y <= 1
Confidence interval (CI): (estimate point ± 1.96std error ) where
1.96 is critical value
In above result, Estimate(lead_time):
-0.0157767 and Std. Error : 0.0004027
CI = (-0.0157767 ± 1.96
0.0004027)
CI = (−0.01656587,−0.01498747)
meaning that we have
95% confidence that coefficient of lead time falls between −0.0165655
and −0.0149876. This is indicating that increasing in lead time, the log
odds of being a repeated guest decrease by (0.0165655,0.0149876)
Although This does not give much sense to me for lead time but we can
interpret this like guest is booking stay with lead time zero , may be
repeated guest and they can offered with some promotions/loyalty program
Which might influence guest to book their stay in Hotel in advance.
Square Root Transformation could be useful in this analysis to
determine the relatioship between the lead time and the probability of
being a repeated guest more strong linear as it will mitigate the effect
of extreme values of lead time.
The Square Root Transformation
involves taking the square root of the lead time variable. This
transformation is useful as the rate of change in the probability of
being a repeated guest with respect to lead time is not constant.
#compute Square Root of lead_time
hotel_data$lead_time_sqrt <- sqrt(hotel_data$lead_time)
# Plot transformed relationships
ggplot(data = hotel_data, aes(x = lead_time_sqrt, y = is_repeated_guest)) +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(x = "Square Root of Lead Time", y = "Probability of being a repeated guest") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! y values must be 0 <= y <= 1
Thank you!!!