1. Extended Regression Model

# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)      # For VIF
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(broom)    # For tidying model output and diagnostics

# Load your data (adjust path as needed)
data <- read_csv("AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
extended_var <- lm(price ~ number_of_reviews + room_type + availability_365 + room_type:neighbourhood_group, data = data)

# Summarize the model
summary(extended_var)
## 
## Call:
## lm(formula = price ~ number_of_reviews + room_type + availability_365 + 
##     room_type:neighbourhood_group, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -271.1  -61.0  -22.3   14.1 9947.2 
## 
## Coefficients:
##                                                             Estimate Std. Error
## (Intercept)                                               108.435444  11.822470
## number_of_reviews                                          -0.292373   0.023697
## room_typePrivate room                                     -64.738931  14.770147
## room_typeShared room                                      -73.899533  31.773637
## availability_365                                            0.177472   0.008077
## room_typeEntire home/apt:neighbourhood_groupBrooklyn       60.811315  11.985117
## room_typePrivate room:neighbourhood_groupBrooklyn          21.236182   9.255543
## room_typeShared room:neighbourhood_groupBrooklyn          -11.498399  31.590440
## room_typeEntire home/apt:neighbourhood_groupManhattan     125.223174  11.919191
## room_typePrivate room:neighbourhood_groupManhattan         62.666420   9.330258
## room_typeShared room:neighbourhood_groupManhattan          36.105118  31.311002
## room_typeEntire home/apt:neighbourhood_groupQueens         23.600678  12.764170
## room_typePrivate room:neighbourhood_groupQueens             9.695375   9.784009
## room_typeShared room:neighbourhood_groupQueens              4.429815  33.696558
## room_typeEntire home/apt:neighbourhood_groupStaten Island  43.537702  20.856826
## room_typePrivate room:neighbourhood_groupStaten Island    -12.759020  18.932959
## room_typeShared room:neighbourhood_groupStaten Island      11.867079  81.735083
##                                                           t value Pr(>|t|)    
## (Intercept)                                                 9.172  < 2e-16 ***
## number_of_reviews                                         -12.338  < 2e-16 ***
## room_typePrivate room                                      -4.383 1.17e-05 ***
## room_typeShared room                                       -2.326   0.0200 *  
## availability_365                                           21.973  < 2e-16 ***
## room_typeEntire home/apt:neighbourhood_groupBrooklyn        5.074 3.91e-07 ***
## room_typePrivate room:neighbourhood_groupBrooklyn           2.294   0.0218 *  
## room_typeShared room:neighbourhood_groupBrooklyn           -0.364   0.7159    
## room_typeEntire home/apt:neighbourhood_groupManhattan      10.506  < 2e-16 ***
## room_typePrivate room:neighbourhood_groupManhattan          6.716 1.88e-11 ***
## room_typeShared room:neighbourhood_groupManhattan           1.153   0.2489    
## room_typeEntire home/apt:neighbourhood_groupQueens          1.849   0.0645 .  
## room_typePrivate room:neighbourhood_groupQueens             0.991   0.3217    
## room_typeShared room:neighbourhood_groupQueens              0.131   0.8954    
## room_typeEntire home/apt:neighbourhood_groupStaten Island   2.087   0.0369 *  
## room_typePrivate room:neighbourhood_groupStaten Island     -0.674   0.5004    
## room_typeShared room:neighbourhood_groupStaten Island       0.145   0.8846    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228.6 on 48878 degrees of freedom
## Multiple R-squared:  0.09384,    Adjusted R-squared:  0.09354 
## F-statistic: 316.3 on 16 and 48878 DF,  p-value: < 2.2e-16

Details of Every Variable

Number of Reviews: kept from the model from last week because it had a weak but significant relationship with price.
room_type : Adding a binary term to differentiate between different sorts of rooms, room_type may have an impact on price depending on the type of listing.
availability_365: Indicates the listing’s yearly availability, which may be related to pricing (for example, homes with high demand may offer reduced prices to attract in more guests).
Interaction (room_type * neighbourhood_group): This includes the combined effect of neighborhood and room type on price, assuming that the cost of different types of rooms may differ by borough.

Multicollinearity check:

Multicollinearity can be evaluated using the Variance Inflation Factor (VIF), since high VIF values suggest strongly correlated predictors, which could skew model accuracy.

vif(extended_var)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                                      GVIF Df GVIF^(1/(2*Df))
## number_of_reviews                1.042340  1        1.020950
## room_type                      990.716320  2        5.610316
## availability_365                 1.056938  1        1.028075
## room_type:neighbourhood_group 1021.173517 12        1.334686

Program output: VIF output indicates that room_type has a VIF greater than 5. A VIF over 5 typically indicates multicollinearity, suggesting that room_type is strongly correlated with other predictor variables in model.


Interpretation: High multicollinearity can complicate the assessment of each predictor variable’s individual effect on the outcome, resulting in less reliable coefficient estimates. Given the VIF threshold of 5 in analysis, addressing multicollinearity may require removing or combining room_type with related variables, or employing techniques such as principal component analysis (PCA) if multicollinearity continues among other predictors as well.

2. Model Diagnostics

# 1. Residuals vs Fitted Plot
plot(extended_var, which = 1)

# 2. Normal Q-Q Plot
plot(extended_var, which = 2)

# 3. Scale-Location Plot
plot(extended_var, which = 3)

# 4. Residuals vs Leverage Plot
plot(extended_var, which = 5)

# 5. Cook’s Distance Plot to identify influential points
cooks_dist <- cooks.distance(extended_var)
plot(cooks_dist, type = "h", main = "Cook's Distance", ylab = "Cook's distance")

# Highlight influential points based on Cook's Distance threshold
influential_points <- which(cooks_dist > (4 / nrow(data)))
data[influential_points, ]
## # A tibble: 415 × 16
##        id name      host_id host_name neighbourhood_group neighbourhood latitude
##     <dbl> <chr>       <dbl> <chr>     <chr>               <chr>            <dbl>
##  1 174966 Luxury 2…  836168 Henry     Manhattan           Upper West S…     40.8
##  2 273190 6 Bedroo…  605463 West Vil… Manhattan           West Village      40.7
##  3 279857 #1 Yello… 1420300 Gordy     Brooklyn            Bedford-Stuy…     40.7
##  4 363673 Beautifu…  256239 Tracey    Manhattan           Upper West S…     40.8
##  5 468613 $ (Phone… 2325861 Cynthia   Manhattan           Lower East S…     40.7
##  6 598612 Most bre… 2960326 Fabio     Brooklyn            Williamsburg      40.7
##  7 634353 Luxury 1…  836168 Henry     Manhattan           Upper West S…     40.8
##  8 639199 Beautifu… 1483081 Marina    Staten Island       Tottenville       40.5
##  9 664047 Lux 2Bed…  836168 Henry     Manhattan           Upper West S…     40.8
## 10 738588 Wedding … 1360198 Marina    Staten Island       Arrochar          40.6
## # ℹ 405 more rows
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

Analysis of Every Diagnostic Plot


Residuals vs Fitted plot:

Normal Q-Q Plot: Determine whether the residuals are normal; non-normal residuals are shown by departures from the diagonal line.

Scale-location plot: Homoscedasticity is identified by the scale-location plot. Variance is probably constant if points are distributed uniformly across a horizontal line.


Residuals vs. Leverage Plot: The Residuals vs. Leverage Plot aids in identifying significant points that could have an excessive impact on the model. Significant outliers may be found at points outside of Cook’s distance lines.


Cook’s Distance Plot: Influential points are directly displayed in Cook’s Distance Plot. Find those that have high values and investigate at how they influence the model.

Insights and Questions for Further Analysis