# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car) # For VIF
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(broom) # For tidying model output and diagnostics
# Load your data (adjust path as needed)
data <- read_csv("AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date (1): last_review
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
extended_var <- lm(price ~ number_of_reviews + room_type + availability_365 + room_type:neighbourhood_group, data = data)
# Summarize the model
summary(extended_var)
##
## Call:
## lm(formula = price ~ number_of_reviews + room_type + availability_365 +
## room_type:neighbourhood_group, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -271.1 -61.0 -22.3 14.1 9947.2
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 108.435444 11.822470
## number_of_reviews -0.292373 0.023697
## room_typePrivate room -64.738931 14.770147
## room_typeShared room -73.899533 31.773637
## availability_365 0.177472 0.008077
## room_typeEntire home/apt:neighbourhood_groupBrooklyn 60.811315 11.985117
## room_typePrivate room:neighbourhood_groupBrooklyn 21.236182 9.255543
## room_typeShared room:neighbourhood_groupBrooklyn -11.498399 31.590440
## room_typeEntire home/apt:neighbourhood_groupManhattan 125.223174 11.919191
## room_typePrivate room:neighbourhood_groupManhattan 62.666420 9.330258
## room_typeShared room:neighbourhood_groupManhattan 36.105118 31.311002
## room_typeEntire home/apt:neighbourhood_groupQueens 23.600678 12.764170
## room_typePrivate room:neighbourhood_groupQueens 9.695375 9.784009
## room_typeShared room:neighbourhood_groupQueens 4.429815 33.696558
## room_typeEntire home/apt:neighbourhood_groupStaten Island 43.537702 20.856826
## room_typePrivate room:neighbourhood_groupStaten Island -12.759020 18.932959
## room_typeShared room:neighbourhood_groupStaten Island 11.867079 81.735083
## t value Pr(>|t|)
## (Intercept) 9.172 < 2e-16 ***
## number_of_reviews -12.338 < 2e-16 ***
## room_typePrivate room -4.383 1.17e-05 ***
## room_typeShared room -2.326 0.0200 *
## availability_365 21.973 < 2e-16 ***
## room_typeEntire home/apt:neighbourhood_groupBrooklyn 5.074 3.91e-07 ***
## room_typePrivate room:neighbourhood_groupBrooklyn 2.294 0.0218 *
## room_typeShared room:neighbourhood_groupBrooklyn -0.364 0.7159
## room_typeEntire home/apt:neighbourhood_groupManhattan 10.506 < 2e-16 ***
## room_typePrivate room:neighbourhood_groupManhattan 6.716 1.88e-11 ***
## room_typeShared room:neighbourhood_groupManhattan 1.153 0.2489
## room_typeEntire home/apt:neighbourhood_groupQueens 1.849 0.0645 .
## room_typePrivate room:neighbourhood_groupQueens 0.991 0.3217
## room_typeShared room:neighbourhood_groupQueens 0.131 0.8954
## room_typeEntire home/apt:neighbourhood_groupStaten Island 2.087 0.0369 *
## room_typePrivate room:neighbourhood_groupStaten Island -0.674 0.5004
## room_typeShared room:neighbourhood_groupStaten Island 0.145 0.8846
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 228.6 on 48878 degrees of freedom
## Multiple R-squared: 0.09384, Adjusted R-squared: 0.09354
## F-statistic: 316.3 on 16 and 48878 DF, p-value: < 2.2e-16
Details of Every Variable
Number of Reviews: kept from the model from last
week because it had a weak but significant relationship with
price.
room_type : Adding a binary term to differentiate
between different sorts of rooms, room_type may have an impact on price
depending on the type of listing.
availability_365: Indicates the listing’s yearly
availability, which may be related to pricing (for example, homes with
high demand may offer reduced prices to attract in more guests).
Interaction (room_type * neighbourhood_group): This
includes the combined effect of neighborhood and room type on price,
assuming that the cost of different types of rooms may differ by
borough.
Multicollinearity can be evaluated using the Variance Inflation Factor (VIF), since high VIF values suggest strongly correlated predictors, which could skew model accuracy.
vif(extended_var)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## number_of_reviews 1.042340 1 1.020950
## room_type 990.716320 2 5.610316
## availability_365 1.056938 1 1.028075
## room_type:neighbourhood_group 1021.173517 12 1.334686
Program output: VIF output indicates that room_type has a VIF greater than 5. A VIF over 5 typically indicates multicollinearity, suggesting that room_type is strongly correlated with other predictor variables in model.
Interpretation: High multicollinearity can complicate
the assessment of each predictor variable’s individual effect on the
outcome, resulting in less reliable coefficient estimates. Given the VIF
threshold of 5 in analysis, addressing multicollinearity may require
removing or combining room_type with related variables, or employing
techniques such as principal component analysis (PCA) if
multicollinearity continues among other predictors as well.
# 1. Residuals vs Fitted Plot
plot(extended_var, which = 1)
# 2. Normal Q-Q Plot
plot(extended_var, which = 2)
# 3. Scale-Location Plot
plot(extended_var, which = 3)
# 4. Residuals vs Leverage Plot
plot(extended_var, which = 5)
# 5. Cook’s Distance Plot to identify influential points
cooks_dist <- cooks.distance(extended_var)
plot(cooks_dist, type = "h", main = "Cook's Distance", ylab = "Cook's distance")
# Highlight influential points based on Cook's Distance threshold
influential_points <- which(cooks_dist > (4 / nrow(data)))
data[influential_points, ]
## # A tibble: 415 × 16
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 174966 Luxury 2… 836168 Henry Manhattan Upper West S… 40.8
## 2 273190 6 Bedroo… 605463 West Vil… Manhattan West Village 40.7
## 3 279857 #1 Yello… 1420300 Gordy Brooklyn Bedford-Stuy… 40.7
## 4 363673 Beautifu… 256239 Tracey Manhattan Upper West S… 40.8
## 5 468613 $ (Phone… 2325861 Cynthia Manhattan Lower East S… 40.7
## 6 598612 Most bre… 2960326 Fabio Brooklyn Williamsburg 40.7
## 7 634353 Luxury 1… 836168 Henry Manhattan Upper West S… 40.8
## 8 639199 Beautifu… 1483081 Marina Staten Island Tottenville 40.5
## 9 664047 Lux 2Bed… 836168 Henry Manhattan Upper West S… 40.8
## 10 738588 Wedding … 1360198 Marina Staten Island Arrochar 40.6
## # ℹ 405 more rows
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>
Residuals vs Fitted plot:
Program output: A random scatter of residuals around the horizontal axis in your plot indicates that the model fits the data without systematic bias. If a pattern (e. g. , a curve) is observed, it may indicate non-linearity, suggesting that model does not adequately capture the relationship.
Interpretation: Explain what the specific pattern in plot
signifies. For instance, a funnel shape suggests heteroscedasticity (the
variance of residuals increases with fitted values), which contravenes
one of the assumptions of linear regression.
Normal Q-Q Plot: Determine whether the residuals are normal; non-normal residuals are shown by departures from the diagonal line.
Program Output: The normal Q-Q plot illustrates the extent to which your residuals conform to a normal distribution. If the points closely follow the line, the residuals are approximately normally distributed, which supports this model assumption. Deviations, particularly at the tails, indicate possible issues with non-normality.
Interpretation: Identify specific areas where the points diverge
from the line, if applicable, and clarify that significant deviations
suggest your residuals may not be normally distributed, potentially
impacting the validity of hypothesis tests and confidence intervals for
the regression coefficients.
Scale-location plot: Homoscedasticity is identified by the scale-location plot. Variance is probably constant if points are distributed uniformly across a horizontal line.
Program output: If the scale-location plot displays a generally horizontal line with evenly spaced points, it indicates homoscedasticity (constant variance). An upward or downward trend suggests the presence of heteroscedasticity.
Interpretation: Discuss whether plot indicates a trend. If it does, note that heteroscedasticity indicates unequal variance in errors across the range of predictor values, which may warrant the use of robust standard errors or the transformation of the dependent variable to stabilize variance.
Residuals vs. Leverage Plot: The Residuals vs. Leverage
Plot aids in identifying significant points that could have an excessive
impact on the model. Significant outliers may be found at points outside
of Cook’s distance lines.
Program Output: This plot assists in identifying influential data points that may disproportionately impact model. Points with high leverage (distant from others in terms of predictors) and substantial residuals can greatly influence the model’s predictions.
Interpretation: Address any high-leverage points or outliers, if present, and recommend that these should be examined more closely. You may consider utilizing techniques such as Cook’s distance to quantitatively evaluate the impact of these points and contemplate their removal if they significantly compromise model stability.
Cook’s Distance Plot: Influential points are directly
displayed in Cook’s Distance Plot. Find those that have high values and
investigate at how they influence the model.
The Cook’s distance plot illustrates the influence of each data point on the model. Points with a high Cook’s distance (usually above a threshold of 0. 5 or 1, depending on the size of your dataset and model specifications) suggest observations that significantly influence the model’s parameter estimates and predictions.
Interpretation: Discuss any points in the plot that surpass the
Cook’s distance threshold. Explain that these significant factors may
disproportionately impact the model’s outcomes, potentially distorting
predictions or coefficients. If any points exhibit a high Cook’s
distance, it may be beneficial to investigate them further to comprehend
their influence. For instance, if they arise from data entry mistakes or
constitute outliers, you might choose to exclude them or evaluate their
influence by conducting model re-estimation both with and without these
observations.
Summarize the findings, explaining what each plot reveals about the model assumptions and how any detected issues impact model reliability.
Further Investigation: Consider additional steps, such as transforming variables, removing outliers, or adjusting the model if assumptions are violated or influential points affect the results.