Week 9 Data Dive

1. Extended Regression Model

Start by expanding on last week’s model with additional predictors to gain further insight into factors affecting price.

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)      # For VIF

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(broom)    # For tidying model output and diagnostics

# Load your data (adjust path as needed)
data <- read_csv("AB_NYC_2019.csv")

## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

extended_var <- lm(price ~ number_of_reviews + room_type + availability_365 + room_type:neighbourhood_group, data = data)

# Summarize the model
summary(extended_var)

## 
## Call:
## lm(formula = price ~ number_of_reviews + room_type + availability_365 + 
##     room_type:neighbourhood_group, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -271.1  -61.0  -22.3   14.1 9947.2 
## 
## Coefficients:
##                                                             Estimate Std. Error
## (Intercept)                                               108.435444  11.822470
## number_of_reviews                                          -0.292373   0.023697
## room_typePrivate room                                     -64.738931  14.770147
## room_typeShared room                                      -73.899533  31.773637
## availability_365                                            0.177472   0.008077
## room_typeEntire home/apt:neighbourhood_groupBrooklyn       60.811315  11.985117
## room_typePrivate room:neighbourhood_groupBrooklyn          21.236182   9.255543
## room_typeShared room:neighbourhood_groupBrooklyn          -11.498399  31.590440
## room_typeEntire home/apt:neighbourhood_groupManhattan     125.223174  11.919191
## room_typePrivate room:neighbourhood_groupManhattan         62.666420   9.330258
## room_typeShared room:neighbourhood_groupManhattan          36.105118  31.311002
## room_typeEntire home/apt:neighbourhood_groupQueens         23.600678  12.764170
## room_typePrivate room:neighbourhood_groupQueens             9.695375   9.784009
## room_typeShared room:neighbourhood_groupQueens              4.429815  33.696558
## room_typeEntire home/apt:neighbourhood_groupStaten Island  43.537702  20.856826
## room_typePrivate room:neighbourhood_groupStaten Island    -12.759020  18.932959
## room_typeShared room:neighbourhood_groupStaten Island      11.867079  81.735083
##                                                           t value Pr(>|t|)    
## (Intercept)                                                 9.172  < 2e-16 ***
## number_of_reviews                                         -12.338  < 2e-16 ***
## room_typePrivate room                                      -4.383 1.17e-05 ***
## room_typeShared room                                       -2.326   0.0200 *  
## availability_365                                           21.973  < 2e-16 ***
## room_typeEntire home/apt:neighbourhood_groupBrooklyn        5.074 3.91e-07 ***
## room_typePrivate room:neighbourhood_groupBrooklyn           2.294   0.0218 *  
## room_typeShared room:neighbourhood_groupBrooklyn           -0.364   0.7159    
## room_typeEntire home/apt:neighbourhood_groupManhattan      10.506  < 2e-16 ***
## room_typePrivate room:neighbourhood_groupManhattan          6.716 1.88e-11 ***
## room_typeShared room:neighbourhood_groupManhattan           1.153   0.2489    
## room_typeEntire home/apt:neighbourhood_groupQueens          1.849   0.0645 .  
## room_typePrivate room:neighbourhood_groupQueens             0.991   0.3217    
## room_typeShared room:neighbourhood_groupQueens              0.131   0.8954    
## room_typeEntire home/apt:neighbourhood_groupStaten Island   2.087   0.0369 *  
## room_typePrivate room:neighbourhood_groupStaten Island     -0.674   0.5004    
## room_typeShared room:neighbourhood_groupStaten Island       0.145   0.8846    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228.6 on 48878 degrees of freedom
## Multiple R-squared:  0.09384,    Adjusted R-squared:  0.09354 
## F-statistic: 316.3 on 16 and 48878 DF,  p-value: < 2.2e-16

Details of Every Variable

Number of Reviews: kept from the model from last week because it had a weak but significant relationship with price.
room_type : Adding a binary term to differentiate between different sorts of rooms, room_type may have an impact on price depending on the type of listing.
availability_365: Indicates the listing’s yearly availability, which may be related to pricing (for example, homes with high demand may offer reduced prices to attract in more guests).
Interaction (room_type * neighbourhood_group): This includes the combined effect of neighborhood and room type on price, assuming that the cost of different types of rooms may differ by borough.

Multicollinearity check:

Multicollinearity can be evaluated using the Variance Inflation Factor (VIF), since high VIF values suggest strongly correlated predictors, which could skew model accuracy.

vif(extended_var)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##                                      GVIF Df GVIF^(1/(2*Df))
## number_of_reviews                1.042340  1        1.020950
## room_type                      990.716320  2        5.610316
## availability_365                 1.056938  1        1.028075
## room_type:neighbourhood_group 1021.173517 12        1.334686

Program output: VIF output indicates that room_type has a VIF greater than 5. A VIF over 5 typically indicates multicollinearity, suggesting that room_type is strongly correlated with other predictor variables in model.

Interpretation: High multicollinearity can complicate the assessment of each predictor variable’s individual effect on the outcome, resulting in less reliable coefficient estimates. Given the VIF threshold of 5 in analysis, addressing multicollinearity may require removing or combining room_type with related variables, or employing techniques such as principal component analysis (PCA) if multicollinearity continues among other predictors as well.

2. Model Diagnostics

The five standard diagnostic plots provide insights into whether the model meets regression assumptions and where potential issues may lie.

# 1. Residuals vs Fitted Plot
plot(extended_var, which = 1)

# 2. Normal Q-Q Plot
plot(extended_var, which = 2)

# 3. Scale-Location Plot
plot(extended_var, which = 3)

# 4. Residuals vs Leverage Plot
plot(extended_var, which = 5)

# 5. Cook’s Distance Plot to identify influential points
cooks_dist <- cooks.distance(extended_var)
plot(cooks_dist, type = "h", main = "Cook's Distance", ylab = "Cook's distance")

# Highlight influential points based on Cook's Distance threshold
influential_points <- which(cooks_dist > (4 / nrow(data)))
data[influential_points, ]

## # A tibble: 415 × 16
##        id name      host_id host_name neighbourhood_group neighbourhood latitude
##     <dbl> <chr>       <dbl> <chr>     <chr>               <chr>            <dbl>
##  1 174966 Luxury 2…  836168 Henry     Manhattan           Upper West S…     40.8
##  2 273190 6 Bedroo…  605463 West Vil… Manhattan           West Village      40.7
##  3 279857 #1 Yello… 1420300 Gordy     Brooklyn            Bedford-Stuy…     40.7
##  4 363673 Beautifu…  256239 Tracey    Manhattan           Upper West S…     40.8
##  5 468613 $ (Phone… 2325861 Cynthia   Manhattan           Lower East S…     40.7
##  6 598612 Most bre… 2960326 Fabio     Brooklyn            Williamsburg      40.7
##  7 634353 Luxury 1…  836168 Henry     Manhattan           Upper West S…     40.8
##  8 639199 Beautifu… 1483081 Marina    Staten Island       Tottenville       40.5
##  9 664047 Lux 2Bed…  836168 Henry     Manhattan           Upper West S…     40.8
## 10 738588 Wedding … 1360198 Marina    Staten Island       Arrochar          40.6
## # ℹ 405 more rows
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>

Analysis of Every Diagnostic Plot

Residuals vs Fitted plot:

Program output: A random scatter of residuals around the horizontal axis in your plot indicates that the model fits the data without systematic bias. If a pattern (e. g. , a curve) is observed, it may indicate non-linearity, suggesting that model does not adequately capture the relationship.
Interpretation: Explain what the specific pattern in plot signifies. For instance, a funnel shape suggests heteroscedasticity (the variance of residuals increases with fitted values), which contravenes one of the assumptions of linear regression.

Normal Q-Q Plot: Determine whether the residuals are normal; non-normal residuals are shown by departures from the diagonal line.

Program Output: The normal Q-Q plot illustrates the extent to which your residuals conform to a normal distribution. If the points closely follow the line, the residuals are approximately normally distributed, which supports this model assumption. Deviations, particularly at the tails, indicate possible issues with non-normality.
Interpretation: Identify specific areas where the points diverge from the line, if applicable, and clarify that significant deviations suggest your residuals may not be normally distributed, potentially impacting the validity of hypothesis tests and confidence intervals for the regression coefficients.

Scale-location plot: Homoscedasticity is identified by the scale-location plot. Variance is probably constant if points are distributed uniformly across a horizontal line.

Program output: If the scale-location plot displays a generally horizontal line with evenly spaced points, it indicates homoscedasticity (constant variance). An upward or downward trend suggests the presence of heteroscedasticity.
Interpretation: Discuss whether plot indicates a trend. If it does, note that heteroscedasticity indicates unequal variance in errors across the range of predictor values, which may warrant the use of robust standard errors or the transformation of the dependent variable to stabilize variance.

Residuals vs. Leverage Plot: The Residuals vs. Leverage Plot aids in identifying significant points that could have an excessive impact on the model. Significant outliers may be found at points outside of Cook’s distance lines.

Program Output: This plot assists in identifying influential data points that may disproportionately impact model. Points with high leverage (distant from others in terms of predictors) and substantial residuals can greatly influence the model’s predictions.
Interpretation: Address any high-leverage points or outliers, if present, and recommend that these should be examined more closely. You may consider utilizing techniques such as Cook’s distance to quantitatively evaluate the impact of these points and contemplate their removal if they significantly compromise model stability.

Cook’s Distance Plot: Influential points are directly displayed in Cook’s Distance Plot. Find those that have high values and investigate at how they influence the model.

The Cook’s distance plot illustrates the influence of each data point on the model. Points with a high Cook’s distance (usually above a threshold of 0. 5 or 1, depending on the size of your dataset and model specifications) suggest observations that significantly influence the model’s parameter estimates and predictions.
Interpretation: Discuss any points in the plot that surpass the Cook’s distance threshold. Explain that these significant factors may disproportionately impact the model’s outcomes, potentially distorting predictions or coefficients. If any points exhibit a high Cook’s distance, it may be beneficial to investigate them further to comprehend their influence. For instance, if they arise from data entry mistakes or constitute outliers, you might choose to exclude them or evaluate their influence by conducting model re-estimation both with and without these observations.

Insights and Questions for Further Analysis

Summarize the findings, explaining what each plot reveals about the model assumptions and how any detected issues impact model reliability.
Further Investigation: Consider additional steps, such as transforming variables, removing outliers, or adjusting the model if assumptions are violated or influential points affect the results.