Introduction

This R Notebook provides an analysis of hotel reservation data using logistic regression. We will address several key tasks, including selecting an interesting binary variable, building a logistic regression model, interpreting coefficients, calculating confidence intervals, considering data transformations, and discussing the insights gained from the analysis.

Data Description

# Load necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
library(tidyr)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
# Load the dataset
data <- read.csv("G:/semester_1/4_Statistics_R/syllabus/lab/week10/hotel_bookings.csv")

Choose a binary variable worth modeling

Build a logistic regression model

# Fit the logistic regression model
model <- glm(is_canceled ~ lead_time + total_of_special_requests + previous_cancellations + booking_changes,
             data = data, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Summarize the model
summary(model)
## 
## Call:
## glm(formula = is_canceled ~ lead_time + total_of_special_requests + 
##     previous_cancellations + booking_changes, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -7.352e-01  1.061e-02  -69.27   <2e-16 ***
## lead_time                  5.431e-03  6.588e-05   82.43   <2e-16 ***
## total_of_special_requests -6.539e-01  9.713e-03  -67.32   <2e-16 ***
## previous_cancellations     1.475e+00  3.410e-02   43.25   <2e-16 ***
## booking_changes           -7.289e-01  1.573e-02  -46.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157398  on 119389  degrees of freedom
## Residual deviance: 135055  on 119385  degrees of freedom
## AIC: 135065
## 
## Number of Fisher Scoring iterations: 6

Interpret the coefficients

Build a Confidence Interval for one coefficient

# Calculate the confidence interval for the lead_time coefficient
conf_interval <- confint(model, "lead_time")
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print the confidence interval
conf_interval
##       2.5 %      97.5 % 
## 0.005301857 0.005560112

Interpretation of Confidence Interval

Warnings

Consider a transformation for explanatory variables

# Scatter plots
par(mfrow = c(2, 2))

# Scatter plot of lead_time
plot(data$lead_time, model$fitted.values, xlab = "Lead Time", ylab = "Fitted Values")

# Scatter plot of total_of_special_requests
plot(data$total_of_special_requests, model$fitted.values, xlab = "Total Special Requests", ylab = "Fitted Values")

# Scatter plot of previous_cancellations
plot(data$previous_cancellations, model$fitted.values, xlab = "Previous Cancellations", ylab = "Fitted Values")

# Scatter plot of booking_changes
plot(data$booking_changes, model$fitted.values, xlab = "Booking Changes", ylab = "Fitted Values")

# Reset par settings
par(mfrow = c(1, 1))

Task 7: Insights and Further Questions

The analysis provides valuable insights into the factors influencing hotel booking cancellations. These insights are essential for making informed decisions in the hotel industry. Let’s delve deeper into the findings and explore some important questions:

Insights:

  1. Significant Predictors: The logistic regression model identifies several significant predictors of hotel booking cancellations, including lead_time, total_of_special_requests, previous_cancellations, and booking_changes. These variables play a substantial role in determining the likelihood of a booking being canceled.

  2. Lead Time: The confidence interval for the lead_time coefficient indicates that it is statistically significant. This suggests that as lead time increases, the odds of a booking being canceled also increase. Hotel management should consider this when planning their booking strategies.

  3. Relationships with Explanatory Variables: Scatter plots illustrate the relationship between explanatory variables and the odds of booking cancellation. They provide visual evidence of how these variables affect the cancellation odds.

Further Questions:

While I have gained valuable insights, further investigation is necessary to refine our understanding and improve model performance. Here are some questions and areas for future exploration:

  1. Variable Interactions: Are there interactions between the predictor variables that amplify or mitigate their effects on booking cancellations? Analyzing interactions can lead to more accurate predictions.

  2. Model Assessment: How well does our logistic regression model perform in predicting cancellations? I can assess its accuracy, sensitivity, and specificity to evaluate its overall quality.

  3. Data Collection: Are there additional data sources or variables that can enhance the predictive power of the model? Consider collecting more data to improve the model’s performance.

  4. Booking Management Strategies: How can the insights from this analysis be translated into effective booking management strategies? Hotel management should develop and test strategies to reduce cancellations and optimize revenue.

  5. Customer Behavior Analysis: Can we explore the reasons behind special requests, previous cancellations, and booking changes? Understanding customer behavior can help hotels tailor their services and policies.

In conclusion, my analysis provides a strong foundation for decision-making and optimization of hotel booking management. By addressing these further questions, I can refine our strategies and enhance the customer experience.

This concludes the analysis of hotel reservation data using logistic regression, but it’s just the beginning of the journey toward more effective hotel booking management.