Initial setup and Configure the data set. Load the data file in
variable hotel_data.
Data set - Hotels : This data comes from an
open hotel booking demand data-set of hotels like City Hotel, Resort
Hotel.
knitr::opts_chunk$set(echo = TRUE)
#library(dplyr)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
# Load the data into hotel_data for further use
#hotel_data <- read.csv(file.choose())
hotel_data <- read.csv('C:/Users/amitg/Documents/workspaceR/data/hotels.csv')
Question:Refer to the simple linear regression model you built
last week. Include 1-3 more variables into your regression model.
Evaluate this model
y = x ∗ βx + β0 + ϵ
In my hotel dataset - during
last
week data dive#8 analysis, Column “adr”(Average Daily Rate) was
taken as continuous variables and Column - “reserved_room_type” and
“lead_time” have been taken as explanatory variable.
In continuity
with same analysis, To include additional variables in the regression
model, we can extend this model by adding more predictors like
“lead_time”+“arrival_date_month”
,“lead_time”:“arrival_date_month”:“reserved_room_type”.
Assumptions
1. lead_time has small impact over ADR.
2.
lead_time+arrival_date_month has further more impact on ADR.
3.
lead_time:arrival_date_month:reserved_room_type has significant impact
on ADR.
hotel_data$reserved_room_type <- factor(hotel_data$reserved_room_type)
#Model 1 : Build a simple linear regression model with ard and lead_time
simple_lm_model_0 <- lm(adr ~ lead_time, data = hotel_data)
#Model 2 : Build a simple linear regression model with adr , lead_time and arrival_date_month as predictors
simple_lm_model_1 <- lm(adr ~ lead_time + arrival_date_month, data = hotel_data)
#Model 3 : Build a simple complex regression model with adr , lead_time , arrival_date_month and reserved_room_type as predictors
simple_lm_model_2 <- lm(adr ~ lead_time*arrival_date_month + reserved_room_type, data = hotel_data)
# Summary of the model
summary(simple_lm_model_0)
##
## Call:
## lm(formula = adr ~ lead_time, data = hotel_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.5 -31.4 -7.2 23.9 5296.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.933697 0.203692 515.16 <2e-16 ***
## lead_time -0.029829 0.001366 -21.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.44 on 119388 degrees of freedom
## Multiple R-squared: 0.003979, Adjusted R-squared: 0.00397
## F-statistic: 476.9 on 1 and 119388 DF, p-value: < 2.2e-16
# Summary of the model
summary(simple_lm_model_1)
##
## Call:
## lm(formula = adr ~ lead_time + arrival_date_month, data = hotel_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.1 -26.2 -2.4 20.6 5316.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.222060 0.440476 245.69 <2e-16 ***
## lead_time -0.082694 0.001261 -65.56 <2e-16 ***
## arrival_date_monthAugust 41.906807 0.569585 73.57 <2e-16 ***
## arrival_date_monthDecember -21.062212 0.688744 -30.58 <2e-16 ***
## arrival_date_monthFebruary -30.682092 0.655922 -46.78 <2e-16 ***
## arrival_date_monthJanuary -34.113960 0.720929 -47.32 <2e-16 ***
## arrival_date_monthJuly 29.838991 0.582973 51.18 <2e-16 ***
## arrival_date_monthJune 19.055394 0.603053 31.60 <2e-16 ***
## arrival_date_monthMarch -21.743983 0.619811 -35.08 <2e-16 ***
## arrival_date_monthMay 10.202071 0.591237 17.25 <2e-16 ***
## arrival_date_monthNovember -28.094336 0.688166 -40.83 <2e-16 ***
## arrival_date_monthOctober -10.142281 0.599626 -16.91 <2e-16 ***
## arrival_date_monthSeptember 8.129796 0.610045 13.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.64 on 119377 degrees of freedom
## Multiple R-squared: 0.2197, Adjusted R-squared: 0.2197
## F-statistic: 2802 on 12 and 119377 DF, p-value: < 2.2e-16
summary(simple_lm_model_2)
##
## Call:
## lm(formula = adr ~ lead_time * arrival_date_month + reserved_room_type,
## data = hotel_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.9 -23.7 0.7 21.6 5327.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.398e+01 6.019e-01 156.148 < 2e-16
## lead_time -3.605e-02 4.775e-03 -7.549 4.43e-14
## arrival_date_monthAugust 4.839e+01 7.881e-01 61.403 < 2e-16
## arrival_date_monthDecember -2.304e+01 8.571e-01 -26.884 < 2e-16
## arrival_date_monthFebruary -2.873e+01 8.046e-01 -35.713 < 2e-16
## arrival_date_monthJanuary -3.258e+01 8.527e-01 -38.207 < 2e-16
## arrival_date_monthJuly 3.470e+01 8.265e-01 41.987 < 2e-16
## arrival_date_monthJune 2.315e+01 8.562e-01 27.037 < 2e-16
## arrival_date_monthMarch -2.073e+01 7.931e-01 -26.139 < 2e-16
## arrival_date_monthMay 1.360e+01 8.184e-01 16.612 < 2e-16
## arrival_date_monthNovember -2.930e+01 8.559e-01 -34.233 < 2e-16
## arrival_date_monthOctober -9.021e+00 8.006e-01 -11.268 < 2e-16
## arrival_date_monthSeptember 1.770e+01 8.330e-01 21.244 < 2e-16
## reserved_room_typeB -9.385e-02 1.220e+00 -0.077 0.939
## reserved_room_typeC 5.190e+01 1.337e+00 38.804 < 2e-16
## reserved_room_typeD 2.433e+01 3.259e-01 74.654 < 2e-16
## reserved_room_typeE 2.941e+01 5.209e-01 56.451 < 2e-16
## reserved_room_typeF 6.920e+01 7.672e-01 90.201 < 2e-16
## reserved_room_typeG 7.656e+01 8.973e-01 85.319 < 2e-16
## reserved_room_typeH 8.746e+01 1.658e+00 52.739 < 2e-16
## reserved_room_typeL -8.581e+00 1.653e+01 -0.519 0.604
## reserved_room_typeP -8.901e+01 1.169e+01 -7.616 2.65e-14
## lead_time:arrival_date_monthAugust -8.700e-02 5.742e-03 -15.152 < 2e-16
## lead_time:arrival_date_monthDecember 5.199e-02 6.968e-03 7.461 8.63e-14
## lead_time:arrival_date_monthFebruary 3.900e-02 7.903e-03 4.935 8.03e-07
## lead_time:arrival_date_monthJanuary 7.312e-02 8.349e-03 8.757 < 2e-16
## lead_time:arrival_date_monthJuly -6.817e-02 5.791e-03 -11.770 < 2e-16
## lead_time:arrival_date_monthJune -4.238e-02 6.065e-03 -6.988 2.80e-12
## lead_time:arrival_date_monthMarch 1.881e-02 6.702e-03 2.807 0.005
## lead_time:arrival_date_monthMay -3.653e-02 5.970e-03 -6.119 9.44e-10
## lead_time:arrival_date_monthNovember 7.527e-02 6.803e-03 11.064 < 2e-16
## lead_time:arrival_date_monthOctober 3.987e-04 5.665e-03 0.070 0.944
## lead_time:arrival_date_monthSeptember -7.008e-02 5.716e-03 -12.260 < 2e-16
##
## (Intercept) ***
## lead_time ***
## arrival_date_monthAugust ***
## arrival_date_monthDecember ***
## arrival_date_monthFebruary ***
## arrival_date_monthJanuary ***
## arrival_date_monthJuly ***
## arrival_date_monthJune ***
## arrival_date_monthMarch ***
## arrival_date_monthMay ***
## arrival_date_monthNovember ***
## arrival_date_monthOctober ***
## arrival_date_monthSeptember ***
## reserved_room_typeB
## reserved_room_typeC ***
## reserved_room_typeD ***
## reserved_room_typeE ***
## reserved_room_typeF ***
## reserved_room_typeG ***
## reserved_room_typeH ***
## reserved_room_typeL
## reserved_room_typeP ***
## lead_time:arrival_date_monthAugust ***
## lead_time:arrival_date_monthDecember ***
## lead_time:arrival_date_monthFebruary ***
## lead_time:arrival_date_monthJanuary ***
## lead_time:arrival_date_monthJuly ***
## lead_time:arrival_date_monthJune ***
## lead_time:arrival_date_monthMarch **
## lead_time:arrival_date_monthMay ***
## lead_time:arrival_date_monthNovember ***
## lead_time:arrival_date_monthOctober
## lead_time:arrival_date_monthSeptember ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.47 on 119357 degrees of freedom
## Multiple R-squared: 0.3587, Adjusted R-squared: 0.3586
## F-statistic: 2087 on 32 and 119357 DF, p-value: < 2.2e-16
coef(simple_lm_model_0)
## (Intercept) lead_time
## 104.93369683 -0.02982918
ggplot(hotel_data, aes(x = lead_time, y = adr)) +
geom_point() + # Add scatter plot points
geom_abline(aes(intercept = coef(simple_lm_model_0)[1], slope = coef(simple_lm_model_0)[2]),
colour = "red")+
geom_smooth(method = "lm", se = FALSE) + # Add linear regression line
labs(x = "Lead Time", y = "Average Daily Rate (ADR)") + # Add axis labels
theme_minimal() # Use minimal theme for cleaner look
## `geom_smooth()` using formula = 'y ~ x'
ggplot(hotel_data, aes(x = lead_time, y = adr, color = arrival_date_month)) +
geom_point() + # Add scatter plot points
geom_smooth(method = "lm", se = FALSE) + # Add linear regression line
labs(x = "Lead Time", y = "Average Daily Rate (ADR)", color = "Arrival Month") + # Add axis labels
theme_minimal() # Use minimal theme for cleaner look
## `geom_smooth()` using formula = 'y ~ x'
ncvTest(simple_lm_model_0)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4554.749, Df = 1, p = < 2.22e-16
ncvTest(simple_lm_model_1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5846.116, Df = 1, p = < 2.22e-16
ncvTest(simple_lm_model_1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5846.116, Df = 1, p = < 2.22e-16
#To see the multicollinearity in model simple_lm_model_1
#vif(simple_lm_model_1)
#To see the multicollinearity in model simple_lm_model_2
#vif(simple_lm_model_2)
# GVIF means - Generalized Variance Inflation Factor
#To see the multicollinearity in model simple_lm_model_1
vif(simple_lm_model_1)
## GVIF Df GVIF^(1/(2*Df))
## lead_time 1.088277 1 1.043205
## arrival_date_month 1.088277 11 1.003853
In model simple_lm_model_1, GIVF value is near to 1 which means it has little multicollinearity.
#To see the multicollinearity in model simple_lm_model_2
vif(simple_lm_model_2)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## lead_time 18.978114 1 4.356388
## arrival_date_month 1408.865335 11 1.390370
## reserved_room_type 1.042284 9 1.002303
## lead_time:arrival_date_month 11210.666844 11 1.527827
But in model simple_lm_model_2, Where GIVF value is greater 1 which means it has significant multicollinearity.
Lets plot simple_lm_model_0
#Diagnostic plots given by applying plot to our fitted model simple_lm_model_0
plot(simple_lm_model_0)
By analyzing plot for model simple_lm_model_0 ,below are the
highlight
Residuals and Fitted value, Q-Q Residuals, Scale-
Location,Residuals vs Levrage, all plots have horizontal line and
parallel to x-Axes which means,lead_time has liner relationship with
ADR.
1. Plot simple_lm_model_1
plot(simple_lm_model_1)
By analyzing plot 1 for model simple_lm_model_1,below are the highlight
Residuals and Fitted value, Q-Q Residuals, Scale-
Location,Residuals vs Leverage, all plots have horizontal line and
parallel to x-Axes ( have little improvement than plot -1) which
means,lead_time+arrival_date_month has liner relationship and small
impact on ADR.
plot(simple_lm_model_2)
By analyzing plot 2 for model simple_lm_model_2,below are the
highlight
Residuals and Fitted value, Q-Q Residuals, Scale-
Location,Residuals vs Levrage, all plots have horizontal up-ward line (
have move steep than plot 1 and plot 2 ) which
means,lead_time:arrival_date_month:reserved_room_type has liner
relationship and significant impact on ADR.
Conclusion : By
having above quantitative analysis,visual inspection, comparative
assessment between plots, my assumption about impact
of,lead_time,lead_time+arrival_date_month and
lead_time:arrival_date_month:reserved_room_type over ARD are
inline.Quantitative values(in above analysis, p-value< 2.2e-16 means
near to zero) and graph both are depicting same aspect of my assumption
which give confidence in my drived model.
Thank
you!!!