Initial setup and Configure the data set.
Load the data file in variable hotel_data.
Data set - Hotels : This data comes from an open hotel booking demand data-set of hotels like City Hotel, Resort Hotel.

# Load the Data
hotel_data <- read.csv(file.choose())

Question:

Data summary and analysis
#Explore the structure of the dataset
str(hotel_data)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
#Summary statistics
summary(hotel_data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 
# Explore Top 5 rows of Hotel data 
head(hotel_data)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
# Check for missing values
colSums(is.na(hotel_data))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
Build a linear (or generalized linear) model:
Variable Selection: To Build the linear model, I have selected below listed variables.
  1. is_canceled
  2. lead_time
  3. stays_in_weekend_nights
  4. stays_in_week_nights
  5. adults
  6. children
  7. babies
  8. booking_changes
  9. adr
    10.total_of_special_requests
    where is_canceled is response variable and others like lead_time,stays_in_weekend_nights ( index 2 to 10 from above list) is explanatory variables.
# Remove missing values from the dataset as anova testing was not performing due to difference in both the model
hotel_data <- na.omit(hotel_data)
hotel_data$is_canceled<-factor(hotel_data$is_canceled)
#Model building with is_canceled and lead_time
glm_morel_0 <- glm(is_canceled ~ lead_time, data = hotel_data, family = "binomial")

#Model building with is_canceled and lead_time:stays_in_weekend_nights:stays_in_week_nights:adults:children:babies:booking_changes:booking_changes:adr:total_of_special_requests
glm_morel_1 <- glm(is_canceled ~ lead_time + stays_in_weekend_nights + stays_in_week_nights + adults + children + babies + booking_changes + adr + total_of_special_requests, data = hotel_data, family = "binomial")

#Print summary of the model
summary_glm_morel_0 <- summary(glm_morel_0)
summary_glm_morel_1 <- summary(glm_morel_1)
summary_glm_morel_0
## 
## Call:
## glm(formula = is_canceled ~ lead_time, family = "binomial", data = hotel_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.166e+00  9.183e-03 -126.96   <2e-16 ***
## lead_time    5.856e-03  6.137e-05   95.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157390  on 119385  degrees of freedom
## Residual deviance: 147146  on 119384  degrees of freedom
## AIC: 147150
## 
## Number of Fisher Scoring iterations: 4
summary_glm_morel_1
## 
## Call:
## glm(formula = is_canceled ~ lead_time + stays_in_weekend_nights + 
##     stays_in_week_nights + adults + children + babies + booking_changes + 
##     adr + total_of_special_requests, family = "binomial", data = hotel_data)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.418e+00  2.816e-02 -50.363  < 2e-16 ***
## lead_time                  5.952e-03  6.687e-05  89.008  < 2e-16 ***
## stays_in_weekend_nights   -2.185e-02  7.602e-03  -2.874  0.00406 ** 
## stays_in_week_nights       3.236e-03  4.055e-03   0.798  0.42483    
## adults                     1.106e-01  1.434e-02   7.711 1.25e-14 ***
## children                   1.993e-02  1.761e-02   1.131  0.25788    
## babies                     5.974e-02  8.243e-02   0.725  0.46856    
## booking_changes           -7.739e-01  1.592e-02 -48.620  < 2e-16 ***
## adr                        5.483e-03  1.574e-04  34.847  < 2e-16 ***
## total_of_special_requests -7.638e-01  1.011e-02 -75.520  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157390  on 119385  degrees of freedom
## Residual deviance: 136010  on 119376  degrees of freedom
## AIC: 136030
## 
## Number of Fisher Scoring iterations: 5
lead_time_coef <- summary_glm_morel_1$coefficients["lead_time", "Estimate"]
lead_time_se <- summary_glm_morel_1$coefficients["lead_time", "Std. Error"]


# Print coefficient ,  standard error and  p-value
cat("1. Coefficient for lead_time (model -:", lead_time_coef, "\n")
## 1. Coefficient for lead_time (model -: 0.005951949
cat("2. Standard Error for lead_time-:", lead_time_se, "\n")
## 2. Standard Error for lead_time-: 6.686953e-05
levels(hotel_data$is_canceled) 
## [1] "0" "1"
probs <- predict(glm_morel_1, type = "response")

mean(probs[hotel_data$is_canceled == "0"])
## [1] 0.3061183
mean(probs[hotel_data$is_canceled == "1"])
## [1] 0.4796542
## test model differences with chi square test (Above Removing missing values from the dataset  as anova testing was not performing due to difference in both the model)
anova(glm_morel_1, glm_morel_0, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: is_canceled ~ lead_time + stays_in_weekend_nights + stays_in_week_nights + 
##     adults + children + babies + booking_changes + adr + total_of_special_requests
## Model 2: is_canceled ~ lead_time
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1    119376     136010                          
## 2    119384     147146 -8   -11137 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation: In the Above model glm_morel_0 and glm_morel_1 is about to predict the probability of a booking being canceled.The objective of this model is to understand how each of these predictors influences(explanatory variables) the likelihood of a booking being canceled.
Interpret the coefficients:In above result, Intercept : 0.005951949 means at this value, Booking might be canceled in model glm_morel_0 and in another glm_morel_1 (with multiple predictors), there are multiple Intercept Coefficients like lead_time,stays_in_weekend_nights etc(Refer table above)
Akaike Information Criterion (AIC) : In model glm_morel_0 , AIC is 147150 and in motel glm_morel_1 AIC is 136030 i.e AIC(glm_morel_0)>AIC (glm_morel_1)
Null Deviance: This measures the goodness of fit of the model with the intercept and Lower values indicate better fit. Its value 157390 in both models
Residual Deviance: It measures the goodness of fit in the model after adding predictor variables and A significant reduction from the null deviance suggests that the model explains a significant amount of variability in the response variable.
+ The probability of cancellation in the above-described hotel data is 0.306 for cancellation (1) is 0.479.
+ The coefficient and standard error for lead_time in glm_morel_0 are approximately 0.00595 and 6.69e-05 respectively.
Plots

## `geom_smooth()` using formula = 'y ~ x'
## Warning: glm.fit: algorithm did not converge
## Warning: Computation failed in `stat_smooth()`
## Caused by error:
## ! y values must be 0 <= y <= 1

Conclude: In the Above result, - coefficients, significance, goodness of fit, and convergence and graphs suggests that lead time is a significant predictor of booking cancellations in the dataset.
The Analysis of Deviance Table suggests that the model - glm_morel_0 which has only lead time as predictor variable which provides a significantly better fit to the data compared to model - glm_morel_1 which has multiple predictor variables. This could be good indicator that lead_time alone may be sufficient for predicting the cancellation status of hotel bookings, without the need for additional predictor variables in the model Although Other factors such as adults, booking changes, adr, and total of special requests might have less significant roles in predicting cancellations.



Thank you!!!