Initial setup and Configure the data set.
Load the data file in variable hotel_data.
Data set - Hotels : This data comes from an open hotel booking demand data-set of hotels like City Hotel, Resort Hotel.

knitr::opts_chunk$set(echo = TRUE)

#library(dplyr)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
# Load the data into hotel_data for further use
#hotel_data <- read.csv(file.choose())
hotel_data <- read.csv('C:/Users/amitg/Documents/workspaceR/data/hotels.csv')
Question:Refer to the simple linear regression model you built last week. Include 1-3 more variables into your regression model.

Evaluate this model

Simple Linear Regression Model:Simple Linear Regression Model is a Linear Regression which is used to describe the relation between two variables and to decide whether that relationship is statistically significant.
The simplest version of the linear regression model (with a single independent variable) can be expressed as follows:

y = x ∗ βx + β0 + ϵ

  1. βx : The βx value tells us how much we would expect y to change given a one-unit change in x.
  2. The intercept β0 is an overall offset, which tells us what value we would expect y to have when x=0.
  3. y : Dependent vaiable.
  4. x : independent variable. (section-14.1 reference - https://statsthinking21.github.io/statsthinking21-core-site/the-general-linear-model.html#)

In my hotel dataset - during last week data dive#8 analysis, Column “adr”(Average Daily Rate) was taken as continuous variables and Column - “reserved_room_type” and “lead_time” have been taken as explanatory variable.
In continuity with same analysis, To include additional variables in the regression model, we can extend this model by adding more predictors like “lead_time”+“arrival_date_month” ,“lead_time”:“arrival_date_month”:“reserved_room_type”.
Assumptions
1. lead_time has small impact over ADR.
2. lead_time+arrival_date_month has further more impact on ADR.
3. lead_time:arrival_date_month:reserved_room_type has significant impact on ADR.

hotel_data$reserved_room_type <- factor(hotel_data$reserved_room_type)
#Model 1 : Build a simple linear regression model with ard and lead_time
simple_lm_model_0 <- lm(adr ~ lead_time, data = hotel_data)
#Model 2 : Build a simple linear regression model with adr , lead_time and arrival_date_month as predictors
simple_lm_model_1 <- lm(adr ~ lead_time + arrival_date_month, data = hotel_data)

#Model 3 : Build a simple complex regression model with adr , lead_time , arrival_date_month and reserved_room_type  as predictors
simple_lm_model_2 <- lm(adr ~ lead_time*arrival_date_month + reserved_room_type, data = hotel_data) 

# Summary of the model
summary(simple_lm_model_0)
## 
## Call:
## lm(formula = adr ~ lead_time, data = hotel_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -105.5  -31.4   -7.2   23.9 5296.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.933697   0.203692  515.16   <2e-16 ***
## lead_time    -0.029829   0.001366  -21.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.44 on 119388 degrees of freedom
## Multiple R-squared:  0.003979,   Adjusted R-squared:  0.00397 
## F-statistic: 476.9 on 1 and 119388 DF,  p-value: < 2.2e-16
# Summary of the model
summary(simple_lm_model_1)
## 
## Call:
## lm(formula = adr ~ lead_time + arrival_date_month, data = hotel_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -150.1  -26.2   -2.4   20.6 5316.4 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 108.222060   0.440476  245.69   <2e-16 ***
## lead_time                    -0.082694   0.001261  -65.56   <2e-16 ***
## arrival_date_monthAugust     41.906807   0.569585   73.57   <2e-16 ***
## arrival_date_monthDecember  -21.062212   0.688744  -30.58   <2e-16 ***
## arrival_date_monthFebruary  -30.682092   0.655922  -46.78   <2e-16 ***
## arrival_date_monthJanuary   -34.113960   0.720929  -47.32   <2e-16 ***
## arrival_date_monthJuly       29.838991   0.582973   51.18   <2e-16 ***
## arrival_date_monthJune       19.055394   0.603053   31.60   <2e-16 ***
## arrival_date_monthMarch     -21.743983   0.619811  -35.08   <2e-16 ***
## arrival_date_monthMay        10.202071   0.591237   17.25   <2e-16 ***
## arrival_date_monthNovember  -28.094336   0.688166  -40.83   <2e-16 ***
## arrival_date_monthOctober   -10.142281   0.599626  -16.91   <2e-16 ***
## arrival_date_monthSeptember   8.129796   0.610045   13.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.64 on 119377 degrees of freedom
## Multiple R-squared:  0.2197, Adjusted R-squared:  0.2197 
## F-statistic:  2802 on 12 and 119377 DF,  p-value: < 2.2e-16
summary(simple_lm_model_2)
## 
## Call:
## lm(formula = adr ~ lead_time * arrival_date_month + reserved_room_type, 
##     data = hotel_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -218.9  -23.7    0.7   21.6 5327.4 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            9.398e+01  6.019e-01 156.148  < 2e-16
## lead_time                             -3.605e-02  4.775e-03  -7.549 4.43e-14
## arrival_date_monthAugust               4.839e+01  7.881e-01  61.403  < 2e-16
## arrival_date_monthDecember            -2.304e+01  8.571e-01 -26.884  < 2e-16
## arrival_date_monthFebruary            -2.873e+01  8.046e-01 -35.713  < 2e-16
## arrival_date_monthJanuary             -3.258e+01  8.527e-01 -38.207  < 2e-16
## arrival_date_monthJuly                 3.470e+01  8.265e-01  41.987  < 2e-16
## arrival_date_monthJune                 2.315e+01  8.562e-01  27.037  < 2e-16
## arrival_date_monthMarch               -2.073e+01  7.931e-01 -26.139  < 2e-16
## arrival_date_monthMay                  1.360e+01  8.184e-01  16.612  < 2e-16
## arrival_date_monthNovember            -2.930e+01  8.559e-01 -34.233  < 2e-16
## arrival_date_monthOctober             -9.021e+00  8.006e-01 -11.268  < 2e-16
## arrival_date_monthSeptember            1.770e+01  8.330e-01  21.244  < 2e-16
## reserved_room_typeB                   -9.385e-02  1.220e+00  -0.077    0.939
## reserved_room_typeC                    5.190e+01  1.337e+00  38.804  < 2e-16
## reserved_room_typeD                    2.433e+01  3.259e-01  74.654  < 2e-16
## reserved_room_typeE                    2.941e+01  5.209e-01  56.451  < 2e-16
## reserved_room_typeF                    6.920e+01  7.672e-01  90.201  < 2e-16
## reserved_room_typeG                    7.656e+01  8.973e-01  85.319  < 2e-16
## reserved_room_typeH                    8.746e+01  1.658e+00  52.739  < 2e-16
## reserved_room_typeL                   -8.581e+00  1.653e+01  -0.519    0.604
## reserved_room_typeP                   -8.901e+01  1.169e+01  -7.616 2.65e-14
## lead_time:arrival_date_monthAugust    -8.700e-02  5.742e-03 -15.152  < 2e-16
## lead_time:arrival_date_monthDecember   5.199e-02  6.968e-03   7.461 8.63e-14
## lead_time:arrival_date_monthFebruary   3.900e-02  7.903e-03   4.935 8.03e-07
## lead_time:arrival_date_monthJanuary    7.312e-02  8.349e-03   8.757  < 2e-16
## lead_time:arrival_date_monthJuly      -6.817e-02  5.791e-03 -11.770  < 2e-16
## lead_time:arrival_date_monthJune      -4.238e-02  6.065e-03  -6.988 2.80e-12
## lead_time:arrival_date_monthMarch      1.881e-02  6.702e-03   2.807    0.005
## lead_time:arrival_date_monthMay       -3.653e-02  5.970e-03  -6.119 9.44e-10
## lead_time:arrival_date_monthNovember   7.527e-02  6.803e-03  11.064  < 2e-16
## lead_time:arrival_date_monthOctober    3.987e-04  5.665e-03   0.070    0.944
## lead_time:arrival_date_monthSeptember -7.008e-02  5.716e-03 -12.260  < 2e-16
##                                          
## (Intercept)                           ***
## lead_time                             ***
## arrival_date_monthAugust              ***
## arrival_date_monthDecember            ***
## arrival_date_monthFebruary            ***
## arrival_date_monthJanuary             ***
## arrival_date_monthJuly                ***
## arrival_date_monthJune                ***
## arrival_date_monthMarch               ***
## arrival_date_monthMay                 ***
## arrival_date_monthNovember            ***
## arrival_date_monthOctober             ***
## arrival_date_monthSeptember           ***
## reserved_room_typeB                      
## reserved_room_typeC                   ***
## reserved_room_typeD                   ***
## reserved_room_typeE                   ***
## reserved_room_typeF                   ***
## reserved_room_typeG                   ***
## reserved_room_typeH                   ***
## reserved_room_typeL                      
## reserved_room_typeP                   ***
## lead_time:arrival_date_monthAugust    ***
## lead_time:arrival_date_monthDecember  ***
## lead_time:arrival_date_monthFebruary  ***
## lead_time:arrival_date_monthJanuary   ***
## lead_time:arrival_date_monthJuly      ***
## lead_time:arrival_date_monthJune      ***
## lead_time:arrival_date_monthMarch     ** 
## lead_time:arrival_date_monthMay       ***
## lead_time:arrival_date_monthNovember  ***
## lead_time:arrival_date_monthOctober      
## lead_time:arrival_date_monthSeptember ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.47 on 119357 degrees of freedom
## Multiple R-squared:  0.3587, Adjusted R-squared:  0.3586 
## F-statistic:  2087 on 32 and 119357 DF,  p-value: < 2.2e-16
coef(simple_lm_model_0)
##  (Intercept)    lead_time 
## 104.93369683  -0.02982918
ggplot(hotel_data, aes(x = lead_time, y = adr)) +
  geom_point() +  # Add scatter plot points
  geom_abline(aes(intercept = coef(simple_lm_model_0)[1], slope = coef(simple_lm_model_0)[2]),
                colour = "red")+
  geom_smooth(method = "lm", se = FALSE) +  # Add linear regression line
  labs(x = "Lead Time", y = "Average Daily Rate (ADR)") +  # Add axis labels
  theme_minimal()  # Use minimal theme for cleaner look
## `geom_smooth()` using formula = 'y ~ x'

 ggplot(hotel_data, aes(x = lead_time, y = adr, color = arrival_date_month)) +
  geom_point() +  # Add scatter plot points
  geom_smooth(method = "lm", se = FALSE) +  # Add linear regression line
  labs(x = "Lead Time", y = "Average Daily Rate (ADR)", color = "Arrival Month") +  # Add axis labels
  theme_minimal()  # Use minimal theme for cleaner look
## `geom_smooth()` using formula = 'y ~ x'

ncvTest(simple_lm_model_0)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4554.749, Df = 1, p = < 2.22e-16
ncvTest(simple_lm_model_1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5846.116, Df = 1, p = < 2.22e-16
ncvTest(simple_lm_model_1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5846.116, Df = 1, p = < 2.22e-16
#To see the multicollinearity in model simple_lm_model_1
#vif(simple_lm_model_1)
#To see the multicollinearity in model simple_lm_model_2
#vif(simple_lm_model_2)
  1. This model suggest that lead time has statistically significant but very small effect on ADR. However the adjusted R-squared indicates that lead_time explains only a very small proportion of the variance in ADR, suggesting that other factors not included in the model may be more influential.
  2. An interaction term between two variables define the combined effect that is not explained by the individual effects of each variable alone. In our case/analysis, We added arrival_date_month into lead_time then we fit the model and Outcome is - The model suggests that both lead_time and arrival_date_month significantly influence the ADR. The adjusted R-squared value indicates that these variables collectively explain around 21.97% of the variance in ADR. The significant coefficients for arrival_date_month indicate that different months have different effects on ADR compared to the reference month.
  3. In another analysis, To find the combine effect of variables arrival_date_month and reserved_room_type on ADR, lead_time is multiplied with arrival_date_month and added reserved_room_type and outcome is - The model suggests that lead time, arrival date month, reserved room type, and their interaction terms significantly influence the ADR. The adjusted R-squared value indicates that these variables collectively explain around 35.86% of the variance in ADR.
  4. In a regression model,Multicollinearity is a situation where predictor variables may highly correlated with each other.It can cause issues such as unstable coefficient estimates and inflated standard errors, making it difficult to interpret the individual effects of predictors.
    In above models, lead_time and arrival_date_month are unlikely to suffer from multicollinearity issues because lead_time is a continuous variable representing the number of days between booking and arrival, while arrival_date_month is a categorical variable representing the month of arrival.
    reserved_room_type is a categorical variable and may exhibit some level of multicollinearity if there are strong relationships between different room types. E.g. if certain room types tend to be booked together or are associated with similar ADRs, this could lead to multicollinearity.
    The interaction terms - lead_time:arrival_date_month could potentially exacerbate multicollinearity if they are highly correlated with the main effects (lead_time and arrival_date_month).
    GVIF (“Generalized Variance Inflation Factor”) value is near to 1 which indeicates little multicollinearity while higher values suggest increasing multicollinearity.
# GVIF means - Generalized Variance Inflation Factor

#To see the multicollinearity in model simple_lm_model_1
vif(simple_lm_model_1)
##                        GVIF Df GVIF^(1/(2*Df))
## lead_time          1.088277  1        1.043205
## arrival_date_month 1.088277 11        1.003853

In model simple_lm_model_1, GIVF value is near to 1 which means it has little multicollinearity.

#To see the multicollinearity in model simple_lm_model_2
vif(simple_lm_model_2)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                                      GVIF Df GVIF^(1/(2*Df))
## lead_time                       18.978114  1        4.356388
## arrival_date_month            1408.865335 11        1.390370
## reserved_room_type               1.042284  9        1.002303
## lead_time:arrival_date_month 11210.666844 11        1.527827

But in model simple_lm_model_2, Where GIVF value is greater 1 which means it has significant multicollinearity.

Evaluate the models

Lets plot simple_lm_model_0

#Diagnostic plots given by applying plot to our fitted model simple_lm_model_0
plot(simple_lm_model_0)

By analyzing plot for model simple_lm_model_0 ,below are the highlight
Residuals and Fitted value, Q-Q Residuals, Scale- Location,Residuals vs Levrage, all plots have horizontal line and parallel to x-Axes which means,lead_time has liner relationship with ADR.


1. Plot simple_lm_model_1

plot(simple_lm_model_1)

By analyzing plot 1 for model simple_lm_model_1,below are the highlight
Residuals and Fitted value, Q-Q Residuals, Scale- Location,Residuals vs Leverage, all plots have horizontal line and parallel to x-Axes ( have little improvement than plot -1) which means,lead_time+arrival_date_month has liner relationship and small impact on ADR.

  1. Plot simple_lm_model_2
plot(simple_lm_model_2)

By analyzing plot 2 for model simple_lm_model_2,below are the highlight
Residuals and Fitted value, Q-Q Residuals, Scale- Location,Residuals vs Levrage, all plots have horizontal up-ward line ( have move steep than plot 1 and plot 2 ) which means,lead_time:arrival_date_month:reserved_room_type has liner relationship and significant impact on ADR.
Conclusion : By having above quantitative analysis,visual inspection, comparative assessment between plots, my assumption about impact of,lead_time,lead_time+arrival_date_month and lead_time:arrival_date_month:reserved_room_type over ARD are inline.Quantitative values(in above analysis, p-value< 2.2e-16 means near to zero) and graph both are depicting same aspect of my assumption which give confidence in my drived model.

Thank you!!!