This R Notebook provides an analysis of hotel reservation data using logistic regression. We will address several key tasks, including selecting an interesting binary variable, building a logistic regression model, interpreting coefficients, calculating confidence intervals, considering data transformations, and discussing the insights gained from the analysis.
# Load necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(broom)
library(tidyr)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
# Load the dataset
data <- read.csv("G:/semester_1/4_Statistics_R/syllabus/lab/week10/hotel_bookings.csv")
For my logistic regression analysis, I have selected the binary
column is_canceled, which is a fundamental variable in my
dataset. This column is essential in understanding whether a hotel
booking was canceled or not. It takes on two distinct values: 1
(indicating the booking was canceled) and 0 (indicating the booking was
not canceled).
Understanding the factors that influence hotel booking
cancellations is of significant importance to the hotel industry. By
modeling the is_canceled variable, I aim to uncover the
predictors that contribute to cancellations. This knowledge can assist
hotels in optimizing their booking management, revenue strategies, and
resource allocation.
In the subsequent sections, I will build a logistic regression
model using this binary variable as the response variable. I will also
explore the relationships between is_canceled and other
explanatory variables in the dataset, providing valuable insights and
actionable information.
# Fit the logistic regression model
model <- glm(is_canceled ~ lead_time + total_of_special_requests + previous_cancellations + booking_changes,
data = data, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Summarize the model
summary(model)
##
## Call:
## glm(formula = is_canceled ~ lead_time + total_of_special_requests +
## previous_cancellations + booking_changes, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.352e-01 1.061e-02 -69.27 <2e-16 ***
## lead_time 5.431e-03 6.588e-05 82.43 <2e-16 ***
## total_of_special_requests -6.539e-01 9.713e-03 -67.32 <2e-16 ***
## previous_cancellations 1.475e+00 3.410e-02 43.25 <2e-16 ***
## booking_changes -7.289e-01 1.573e-02 -46.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157398 on 119389 degrees of freedom
## Residual deviance: 135055 on 119385 degrees of freedom
## AIC: 135065
##
## Number of Fisher Scoring iterations: 6
lead_time: A one-unit increase in lead time is associated with a change in the log-odds of cancellation by the coefficient value.
total_of_special_requests: For each additional special request, the log-odds of cancellation increase by the coefficient.
previous_cancellations: One additional previous cancellation increases the log-odds of the current booking being canceled by the coefficient value.
booking_changes: Each additional booking change is associated with a change in the log-odds of cancellation by the coefficient.
# Calculate the confidence interval for the lead_time coefficient
conf_interval <- confint(model, "lead_time")
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print the confidence interval
conf_interval
## 2.5 % 97.5 %
## 0.005301857 0.005560112
# Scatter plots
par(mfrow = c(2, 2))
# Scatter plot of lead_time
plot(data$lead_time, model$fitted.values, xlab = "Lead Time", ylab = "Fitted Values")
# Scatter plot of total_of_special_requests
plot(data$total_of_special_requests, model$fitted.values, xlab = "Total Special Requests", ylab = "Fitted Values")
# Scatter plot of previous_cancellations
plot(data$previous_cancellations, model$fitted.values, xlab = "Previous Cancellations", ylab = "Fitted Values")
# Scatter plot of booking_changes
plot(data$booking_changes, model$fitted.values, xlab = "Booking Changes", ylab = "Fitted Values")
# Reset par settings
par(mfrow = c(1, 1))
From the scatter plots, I observe that the relationships between my explanatory variables and the log-odds of cancellation appear relatively linear without evident outliers. This suggests that a transformation may not be necessary for these variables. However, it’s important to note that while transformations may not be required in this context, further analysis, such as exploring interactions or polynomial terms, could be considered for improving the model’s predictive performance.
These scatter plots provide valuable insights into the linear relationships between my explanatory variables and the response variable, helping me make informed decisions about the need for transformations.
The analysis provides valuable insights into the factors influencing hotel booking cancellations. These insights are essential for making informed decisions in the hotel industry. Let’s delve deeper into the findings and explore some important questions:
Significant Predictors: The logistic regression
model identifies several significant predictors of hotel booking
cancellations, including lead_time,
total_of_special_requests,
previous_cancellations, and booking_changes.
These variables play a substantial role in determining the likelihood of
a booking being canceled.
Lead Time: The confidence interval for the
lead_time coefficient indicates that it is statistically
significant. This suggests that as lead time increases, the odds of a
booking being canceled also increase. Hotel management should consider
this when planning their booking strategies.
Relationships with Explanatory Variables: Scatter plots illustrate the relationship between explanatory variables and the odds of booking cancellation. They provide visual evidence of how these variables affect the cancellation odds.
While I have gained valuable insights, further investigation is necessary to refine our understanding and improve model performance. Here are some questions and areas for future exploration:
Variable Interactions: Are there interactions between the predictor variables that amplify or mitigate their effects on booking cancellations? Analyzing interactions can lead to more accurate predictions.
Model Assessment: How well does our logistic regression model perform in predicting cancellations? I can assess its accuracy, sensitivity, and specificity to evaluate its overall quality.
Data Collection: Are there additional data sources or variables that can enhance the predictive power of the model? Consider collecting more data to improve the model’s performance.
Booking Management Strategies: How can the insights from this analysis be translated into effective booking management strategies? Hotel management should develop and test strategies to reduce cancellations and optimize revenue.
Customer Behavior Analysis: Can we explore the reasons behind special requests, previous cancellations, and booking changes? Understanding customer behavior can help hotels tailor their services and policies.
In conclusion, my analysis provides a strong foundation for decision-making and optimization of hotel booking management. By addressing these further questions, I can refine our strategies and enhance the customer experience.
This concludes the analysis of hotel reservation data using logistic regression, but it’s just the beginning of the journey toward more effective hotel booking management.