May 29th, 2020
Abstract: In our regression project, our purpose is to figure out what is the causal effect of lead time on hotel booking cancelation. We use logit regression for our model with several control variables because our dependent variable is binary. The major finding of our “best” model is that when holding all else constant, if lead time increases by 10%, the average probability of cancellation will increase by 0.008. The pseudo R-squared value of our “best” model is 0.24, which means the regressors explain about 24% of hotel cancellations.
Data Source Link–> click here
Research Question: What are the causal effects of lead time on hotel booking cancellations?
Hypothesis: We hypothesize that our dependent variable (is_canceled) is related to our variable of interest (lead_time), in conjunction with the control variables that we have identified. So, our hypothesis is that the cancellation of a hotel booking is partly caused by the lead time of the booking. During our brain storming sessions while we were analyzing the casual relationship, we have also considered and determined control variables to control for different categories which are used to determine the relationship.
Data: This dataset from Kaggle contains booking information for city hotels and resort hotels, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
Dependent Variable:
- is_canceled: value indicating if the booking was canceled (1) or not (0)
Variables of Interest:
- lead_time: number of days that elapsed between the entering date of the booking into the PMS (property management system) and the arrival date
Control Variables:
These control variables are all dummy variables
- hotel: indicates between resort or city hotel
- is_repeated_guest: if booking name was from repeat guest (1) or not (0)
- deposit_type: No deposit, refundable, or non-refundable
- distribution_channel: Corporate, Direct, GDS (global distribution system) or TA/TO (Travel Agents/Tour Operators)
- is_holiday: 1 for holiday months (Nov, Dec, Jan) and 0 for other months
# Load data
data <- read.csv('hotel_bookings.csv')
head(data)
# Clean data
# Replace 'Undefined' distribution_channel with NaN (5 undefined rows)
data$distribution_channel[data$distribution_channel == 'Undefined'] <- NaN
# Check
#summary(data$distribution_channel)
# Make is_holiday column
# Make holiday season nov to jan
data$is_holiday <- data$arrival_date_month
data$is_holiday <- as.character(data$is_holiday)
# Check
#head(data)
data$is_holiday[data$is_holiday == 'January'] <- 1
data$is_holiday[data$is_holiday == 'February'] <- 0
data$is_holiday[data$is_holiday == 'March'] <- 0
data$is_holiday[data$is_holiday == 'April'] <- 0
data$is_holiday[data$is_holiday == 'May'] <- 0
data$is_holiday[data$is_holiday == 'June'] <- 0
data$is_holiday[data$is_holiday == 'July'] <- 0
data$is_holiday[data$is_holiday == 'August'] <- 0
data$is_holiday[data$is_holiday == 'September'] <- 0
data$is_holiday[data$is_holiday == 'October'] <- 0
data$is_holiday[data$is_holiday == 'November'] <- 1
data$is_holiday[data$is_holiday == 'December'] <- 1
data$is_holiday <- as.integer(data$is_holiday)
Descriptive Statistics and Scatter Plots:
Below we have descriptive stats for the dependent variable and the variable of interest.
# Descriptive statistics for is_canceled (dependent var) and lead_time (variable of interest)
library(stargazer)
stargazer(data[c('is_canceled', 'lead_time')], type="text", digits=2,
summary.stat=c("n", "mean", "sd"),
title="Descriptive Stats for Dependent and Interest Variables", flip=FALSE,
covariate.labels=c('Cancelled', 'Lead Time'))
##
## Descriptive Stats for Dependent and Interest Variables
## =================================
## Statistic N Mean St. Dev.
## ---------------------------------
## Cancelled 119,390 0.37 0.48
## Lead Time 119,390 104.01 106.86
## ---------------------------------
corrm=cor(data[c('is_canceled', 'lead_time')])
paste("Correlation between cancelled and lead time:", round(corrm[1,2], digits=2))
## [1] "Correlation between cancelled and lead time: 0.29"
We decided to take the log of our variable of interest, lead_time, because it had a wide range of values compared to the other variables. Using lead_time without taking the log results in incorporating a wrong functional form and creating a variable bias. Additionally, as you can see from the two charts below, taking the log got rid of most of the outliers. From the second boxplot, we can see that, on average, hotel bookings with a longer lead time were more likely to be cancelled.
We take log(lead_time + 1) to cater for those situation where someone shows up at the hotel without an advance booking. In that case, lead_time would be equal to zero and log(lead_time) would be undefined.
# plots to see correlations
library(ggplot2)
ggplot(data, aes(x = is_canceled, y = lead_time, fill = is_canceled)) +
geom_boxplot() +
facet_wrap(~is_canceled) + ggtitle("lead_time")
ggplot(data, aes(x = is_canceled, y = log(lead_time + 1), fill = is_canceled)) +
geom_boxplot() +
facet_wrap(~is_canceled) + ggtitle("log(lead_time)")
Logit Regressions (base and alternatives w/ control vars):
We used a logistic regression because our dependent variable is binary. We started with a base model that only included the dependent variable (is_canceled) and variable of interest (lead_time). We then added one control variable with each new regression. Model 5 (fit3) is the “best” model we found. The coefficient of the interest variable is relatively stable, which means the sensitivity of the coefficient of the interest variable is low and is close to the true Beta.
To fix some problems, like omitted variable bias, we need to keep adding control variables and to keep an eye on the coefficient of the interest variable. Once we find the coefficient which is very close to the true Beta, no matter what variables we add in, the coefficient should not change a lot. From model 5 to model 6, we can see the coefficient increases by 0.001. It is a very small change compared to the previous models, whose coefficients changed by as much as almost 0.2. Also, from model 2 to model 3, it has a very small change.
# Fit models
base <- glm(is_canceled ~ log(lead_time+1),
family = binomial, x = TRUE, data = data)
fit0 <- glm(is_canceled ~ log(lead_time+1) + deposit_type,
family = binomial, x = TRUE, data = data)
fit1 <- glm(is_canceled ~ log(lead_time+1) + deposit_type + hotel,
family = binomial, x = TRUE, data = data)
fit2 <- glm(is_canceled ~ log(lead_time+1) + deposit_type + hotel
+ is_repeated_guest,
family = binomial, x = TRUE, data = data)
fit3 <- glm(is_canceled ~ log(lead_time+1) + deposit_type+ hotel
+ is_repeated_guest + distribution_channel,
family = binomial, x = TRUE, data = data)
fit4 <- glm(is_canceled ~ log(lead_time+1) + deposit_type+ hotel
+ is_repeated_guest + distribution_channel + is_holiday,
family = binomial, x = TRUE, data = data)
# Display results
stargazer(base, fit0, fit1, fit2, fit3, fit4, se = list(NULL, NULL, NULL, NULL, NULL, NULL),
title = "Logit: Hotel Cancellation", type = "text",
star.cutoffs = c(0.05, 0.01, 0.001), df = FALSE, digits = 3, omit.stat = c("adj.rsq"))
##
## Logit: Hotel Cancellation
## ==================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------
## is_canceled
## (1) (2) (3) (4) (5) (6)
## --------------------------------------------------------------------------------------------------
## log(lead_time + 1) 0.510*** 0.348*** 0.347*** 0.340*** 0.311*** 0.312***
## (0.005) (0.005) (0.005) (0.005) (0.005) (0.005)
##
## deposit_typeNon Refund 5.591*** 5.538*** 5.540*** 5.552*** 5.551***
## (0.104) (0.104) (0.104) (0.104) (0.104)
##
## deposit_typeRefundable -0.620** -0.478* -0.478* -0.180 -0.178
## (0.191) (0.192) (0.192) (0.192) (0.192)
##
## hotelResort Hotel -0.263*** -0.260*** -0.205*** -0.205***
## (0.015) (0.015) (0.015) (0.015)
##
## is_repeated_guest -0.359*** -0.202*** -0.202***
## (0.051) (0.053) (0.053)
##
## distribution_channelDirect -0.163*** -0.162***
## (0.044) (0.044)
##
## distribution_channelGDS 0.171 0.169
## (0.189) (0.189)
##
## distribution_channelTA/TO 0.433*** 0.434***
## (0.038) (0.038)
##
## is_holiday 0.021
## (0.020)
##
## Constant -2.591*** -2.271*** -2.174*** -2.139*** -2.398*** -2.406***
## (0.022) (0.022) (0.022) (0.023) (0.041) (0.042)
##
## --------------------------------------------------------------------------------------------------
## Observations 119,390 119,390 119,390 119,390 119,385 119,385
## Log Likelihood -71,828.830 -60,342.250 -60,184.530 -60,158.030 -59,813.760 -59,813.230
## Akaike Inf. Crit. 143,661.700 120,692.500 120,379.100 120,328.100 119,645.500 119,646.500
## ==================================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
F-tests:
So first, we run a F-test on model 3 to decide whether we should include the hotel type variable. We found the p-value is very small, so we should use the unrestricted model and include the hotel type variable.
library(AER)
lht(fit1, c('hotelResort Hotel = 0'))
After that, we run a F-test on model 6 to decide whether we should include the holiday variable. The p-value is large, so we decide not to include the holiday variable as a control variable.
lht(fit4, c('is_holiday = 0'))
Pseudo R2:
Our base model with only the variables of interest has a pseudo R-squared value of 0.087. That is, the regressors explain about 8.7% of hotel cancellations. Our “best” and final model had a pseudo R-squared value of 0.24. That is, the regressors explain about 24% of hotel cancellations. Including the control variables improved our model by 15.3%.
# Compare models with pseudo R squared
pseudoR2base = (base$null.deviance - base$deviance) / base$null.deviance
paste('Pseudo R-squared for base model: ', round(pseudoR2base, 3))
## [1] "Pseudo R-squared for base model: 0.087"
pseudoR2fit0 = (fit0$null.deviance - fit0$deviance) / fit0$null.deviance
paste('Pseudo R-squared for fit0 model: ', round(pseudoR2fit0, 3))
## [1] "Pseudo R-squared for fit0 model: 0.233"
pseudoR2fit1 = (fit1$null.deviance - fit1$deviance) / fit1$null.deviance
paste('Pseudo R-squared for fit1 model: ', round(pseudoR2fit1, 3))
## [1] "Pseudo R-squared for fit1 model: 0.235"
pseudoR2fit2 = (fit2$null.deviance - fit2$deviance) / fit2$null.deviance
paste('Pseudo R-squared for fit2 model: ', round(pseudoR2fit2, 3))
## [1] "Pseudo R-squared for fit2 model: 0.236"
pseudoR2fit3 = (fit3$null.deviance - fit3$deviance) / fit3$null.deviance
paste('Pseudo R-squared for fit3 model: ', round(pseudoR2fit3, 3))
## [1] "Pseudo R-squared for fit3 model: 0.24"
pseudoR2fit4 = (fit4$null.deviance - fit4$deviance) / fit4$null.deviance
paste('Pseudo R-squared for fit4 model: ', round(pseudoR2fit4, 3))
## [1] "Pseudo R-squared for fit4 model: 0.24"
Analysis of Results:
Holding all else constant, if you increase the lead time by 10%, the average probability of cancellation will increase by 0.008.
# Interpretation: marginal effects
library(erer)
basema <- maBina(base, x.mean = TRUE, rev.dum = TRUE, digits = 4)
fit0ma <- maBina(fit0, x.mean = TRUE, rev.dum = TRUE, digits = 4)
fit1ma <- maBina(fit1, x.mean = TRUE, rev.dum = TRUE, digits = 4)
fit2ma <- maBina(fit2, x.mean = TRUE, rev.dum = TRUE, digits = 4)
fit3ma <- maBina(fit3, x.mean = TRUE, rev.dum = TRUE, digits = 4)
fit4ma <- maBina(fit4, x.mean = TRUE, rev.dum = TRUE, digits = 4)
# Marginal effects for all models
#stargazer(basema, fit1ma, fit2ma, fit3ma, fit4ma, se=list(NULL, NULL, NULL, NULL, NULL),
# title="Marginal Effects", type="text",
# star.cutoffs = c(0.05, 0.01, 0.001), df=FALSE, digits=4, omit.stat=c("adj.rsq"))
# Marginal effects for "best" model
stargazer(fit3ma, se = list(NULL),
title = "Marginal Effects: Best Model", type = "text",
star.cutoffs = c(0.05, 0.01, 0.001), df = FALSE, digits = 3, omit.stat = c("adj.rsq"))
##
## Marginal Effects: Best Model
## ========================================================
## Dependent variable:
## -----------------------------
## is_canceled
## --------------------------------------------------------
## log(lead_time + 1) 0.076***
## (0.001)
##
## deposit_typeNon Refund 0.710***
## (0.002)
##
## deposit_typeRefundable -0.044
## (0.046)
##
## hotelResort Hotel -0.050***
## (0.004)
##
## is_repeated_guest -0.049***
## (0.012)
##
## distribution_channelDirect -0.040***
## (0.010)
##
## distribution_channelGDS 0.042
## (0.047)
##
## distribution_channelTA/TO 0.104***
## (0.009)
##
## Constant -0.589***
## (0.010)
##
## --------------------------------------------------------
## Observations 119,385
## Akaike Inf. Crit. 119,645.500
## ========================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Internal/External Validity:
Internal Validity Concerns:
- Omitted variable bias: There are likely some variables not included in our dataset that may help to better explain the causes for hotel cancellations. For example, the hotel rating (e.g. out of 5 stars) or the reason for hotel booking (e.g. vacation, necessity, etc.) could be a good control variable to include if possible.
- Sample selection bias and missing data: We are unsure of what the selection process of the data was.
External Validity Concerns:
- This dataset contains data of hotel bookings across the world, with the majority from Portugal. Therefore, we are unsure if our causal effect finding can be generalized to all hotels everywhere.