Initial setup and Configure the data set.
Load the data file in variable hotel_data.
Data set - Hotels : This data comes from an open hotel booking demand data-set of hotels like City Hotel, Resort Hotel.

hotel_data <- read.csv(file.choose())

Hotel data summary

summary(hotel_data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 
head(hotel_data)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03

Variable Selection: There are few binary columns - “is_repeated_guest” , “is_canceled” and “market_segment” which can be used for analysis for our Regression Model.
“is_repeated_guest” can be impressive variable for this analysis because modeling on this variable provide insights into factors that influence guest loyalty or repeat bookings and can be valuable data point for hotel customer retention strategies and enhance the customer experience so This could be worth our modeling.

Compute regression model with binary variable - is_repeated_guest and lead_time and see the relationship between these variables.


hotel_data$is_repeated_guest<-factor(hotel_data$is_repeated_guest)
#Model 1 : Build a linear regression model with binary variable - is_repeated_guest
reg_lm_model <- glm(is_repeated_guest ~ lead_time, data = hotel_data,family = binomial)

# Summary of the model
summary_reg_lm_model <- summary(reg_lm_model)
summary_reg_lm_model
## 
## Call:
## glm(formula = is_repeated_guest ~ lead_time, family = binomial, 
##     data = hotel_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.5175516  0.0209558 -120.14   <2e-16 ***
## lead_time   -0.0157767  0.0004027  -39.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33746  on 119389  degrees of freedom
## Residual deviance: 30717  on 119388  degrees of freedom
## AIC: 30721
## 
## Number of Fisher Scoring iterations: 8
lead_time_coef <- summary_reg_lm_model$coefficients["lead_time", "Estimate"]
lead_time_se <- summary_reg_lm_model$coefficients["lead_time", "Std. Error"]

#Confidence interval calculation
ci_lower <- lead_time_coef - 1.96 * lead_time_se
ci_upper <- lead_time_coef + 1.96 * lead_time_se

# Extract p-value
lead_time_p_value <- summary_reg_lm_model$coefficients["lead_time", "Pr(>|z|)"]


# Print coefficient ,  standard error and  p-value
cat("1. Coefficient for lead_time-:", lead_time_coef, "\n")
## 1. Coefficient for lead_time-: -0.01577667
cat("2. Standard Error for lead_time-:", lead_time_se, "\n")
## 2. Standard Error for lead_time-: 0.0004026524
cat("P-value for lead_time:", lead_time_p_value, "\n")
## P-value for lead_time: 0
# Print the confidence interval
cat("Confidence interval (CI) for the coefficient of lead_time :[", ci_lower, " - ", ci_upper, "]\n")
## Confidence interval (CI) for the coefficient of lead_time :[ -0.01656587  -  -0.01498747 ]
#coef(reg_lm_model)
#Odds ratios
#exp(coef(reg_lm_model))


levels(hotel_data$is_repeated_guest) 
## [1] "0" "1"
probs <- predict(reg_lm_model, type = "response")

mean(probs[hotel_data$is_repeated_guest == "0"])
## [1] 0.03094604
mean(probs[hotel_data$is_repeated_guest == "1"])
## [1] 0.06122226
# Plot between lead_time and is_repeated_guest
ggplot(data = hotel_data, aes(x = lead_time, y = is_repeated_guest)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(x = "Lead Time", y = "Probability of being a repeated guest") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! y values must be 0 <= y <= 1


Explanation:The variable “lead_time” is a significant forecaster if a guest is repeated guest or not.This means that changes in the lead time have significant effect on the likelihood of the guest being a repeated.
Above result [“1” (repeated guest) :0.06122226 ] shows that given hotel data has good amount of repeated guest. Lead time and is_repeated_guest have strong relationship.This result is proved by the plot.
Interpret the coefficients:In above result, Intercept : -2.5175516 means customer booked their stay at the last minute, It is less likely they will book again.
Build a C.I. for this coefficient:

Confidence interval (CI): (estimate point ± 1.96std error ) where 1.96 is critical value
In above result, Estimate(lead_time): -0.0157767 and Std. Error : 0.0004027
CI = (-0.0157767 ± 1.96
0.0004027)
CI = (−0.01656587,−0.01498747)
meaning that we have 95% confidence that coefficient of lead time falls between −0.0165655 and −0.0149876. This is indicating that increasing in lead time, the log odds of being a repeated guest decrease by (0.0165655,0.0149876) Although This does not give much sense to me for lead time but we can interpret this like guest is booking stay with lead time zero , may be repeated guest and they can offered with some promotions/loyalty program Which might influence guest to book their stay in Hotel in advance.

Consider a transformation for any explanatory variable, and illustrate why you need the transformation.
Square Root Transformation

Square Root Transformation could be useful in this analysis to determine the relatioship between the lead time and the probability of being a repeated guest more strong linear as it will mitigate the effect of extreme values of lead time.
The Square Root Transformation involves taking the square root of the lead time variable. This transformation is useful as the rate of change in the probability of being a repeated guest with respect to lead time is not constant.

#compute Square Root of lead_time
hotel_data$lead_time_sqrt <- sqrt(hotel_data$lead_time)


# Plot transformed relationships
ggplot(data = hotel_data, aes(x = lead_time_sqrt, y = is_repeated_guest)) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(x = "Square Root of Lead Time", y = "Probability of being a repeated guest") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! y values must be 0 <= y <= 1

Above plot is showing much strong graph and evidence the statement made above for Lead_time and is_repeated_guest.



Thank you!!!