week11

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load your data
data <- read.csv("AB_NYC_2019.csv")  # Replace with your actual data file
str(data)

## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr  "2018-10-19" "2019-05-21" "" "2019-07-05" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

head(data)

##     id                                             name host_id   host_name
## 1 2539               Clean & quiet apt home by the park    2787        John
## 2 2595                            Skylit Midtown Castle    2845    Jennifer
## 3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632   Elisabeth
## 4 3831                  Cozy Entire Floor of Brownstone    4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park    7192       Laura
## 6 5099        Large Cozy 1 BR Apartment In Midtown East    7322       Chris
##   neighbourhood_group neighbourhood latitude longitude       room_type price
## 1            Brooklyn    Kensington 40.64749 -73.97237    Private room   149
## 2           Manhattan       Midtown 40.75362 -73.98377 Entire home/apt   225
## 3           Manhattan        Harlem 40.80902 -73.94190    Private room   150
## 4            Brooklyn  Clinton Hill 40.68514 -73.95976 Entire home/apt    89
## 5           Manhattan   East Harlem 40.79851 -73.94399 Entire home/apt    80
## 6           Manhattan   Murray Hill 40.74767 -73.97500 Entire home/apt   200
##   minimum_nights number_of_reviews last_review reviews_per_month
## 1              1                 9  2018-10-19              0.21
## 2              1                45  2019-05-21              0.38
## 3              3                 0                            NA
## 4              1               270  2019-07-05              4.64
## 5             10                 9  2018-11-19              0.10
## 6              3                74  2019-06-22              0.59
##   calculated_host_listings_count availability_365
## 1                              6              365
## 2                              2              355
## 3                              1              365
## 4                              1              194
## 5                              1                0
## 6                              1              129

# Load necessary libraries
library(tidyverse)

# Load the data (assuming it has already been read in the variable `data` from previous steps)
# Convert factors where needed
data <- data %>%
  mutate(
    room_type = as.factor(room_type),
    neighbourhood_group = as.factor(neighbourhood_group)
  )

# Build the linear regression model
linear_model <- lm(price ~ room_type + number_of_reviews + neighbourhood_group + availability_365, data = data)

# View the summary of the model to see coefficients and p-values
summary(linear_model)

## 
## Call:
## lm(formula = price ~ room_type + number_of_reviews + neighbourhood_group + 
##     availability_365, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -265.4  -63.6  -22.8   15.9 9958.6 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.396e+02  7.194e+00  19.408  < 2e-16 ***
## room_typePrivate room            -1.108e+02  2.131e+00 -51.987  < 2e-16 ***
## room_typeShared room             -1.439e+02  6.895e+00 -20.870  < 2e-16 ***
## number_of_reviews                -3.057e-01  2.362e-02 -12.945  < 2e-16 ***
## neighbourhood_groupBrooklyn       3.284e+01  7.138e+00   4.601 4.22e-06 ***
## neighbourhood_groupManhattan      8.745e+01  7.134e+00  12.257  < 2e-16 ***
## neighbourhood_groupQueens         1.323e+01  7.568e+00   1.748   0.0805 .  
## neighbourhood_groupStaten Island  7.885e+00  1.373e+01   0.574   0.5658    
## availability_365                  1.807e-01  8.063e-03  22.408  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 228.8 on 48886 degrees of freedom
## Multiple R-squared:  0.0925, Adjusted R-squared:  0.09235 
## F-statistic: 622.8 on 8 and 48886 DF,  p-value: < 2.2e-16

Model Interpretation

This linear model examines the correlation between the price of an Airbnb listing (price) and several predictors, such as room_type, number_of_reviews, neighbourhood_group, and availability_365. Let us examine the results.

Key Findings from the Model Output

Intercept (139. 6):

The intercept signifies the projected average price for the baseline categories: an Entire home/apt within the baseline neighborhood group (most likely unspecified), with no reviews and no availability.

Room Type:

Private Room (-110. 8): Listings categorized as Private room show a price reduction of approximately $110. 80 when compared to the baseline category of Entire home/apt. This difference is statistically significant (p-value < 2e-16).
Shared Room (-143. 9): Listings classified as Shared room are linked to a more substantial decrease of approximately $143. 90, in comparison to Entire home/apt. This result is of considerable significance (p-value < 2e-16).

It demonstrates that room_type is a robust predictor of price, with Entire home/apt typically fetching higher prices.

Number of Reviews (-0. 3057):

For every additional review, the price reduces by about $0. 31. This coefficient is statistically significant, although its impact is relatively small.

This result may indicate that properties with a higher number of reviews are located in less premium areas or cater to budget-conscious travelers.

Neighbourhood Group:

Brooklyn (32. 84): Properties in Brooklyn have experienced an average price increase of $32. 84 relative to the baseline neighborhood. This result is statistically significant (p-value = 4. 22e-06).
Manhattan (87. 45): Properties in Manhattan exhibit the highest price increase, with an average rise of $87. 45, which is highly significant (p-value < 2e-16).
Queens (13. 23): Properties in Queens show an average price increase of $13. 23; however, this finding is only marginally significant (p-value = 0. 0805).
Staten Island (7. 885): Properties in Staten Island reflect a modest price increase of $7. 88, but this result is not statistically significant (p-value = 0. 5658).

Interpretation: Manhattan has the most substantial positive effect on price among neighborhood groups, demonstrating it as the most premium location.

Availability (0. 1807):

For each additional day a property is available in a year, the price increases by roughly $0. 18. This effect is statistically significant (p-value < 2e-16).
The positive coefficient may indicate that properties with greater availability can command higher prices, potentially due to their appeal for longer-term stays or higher booking rates.

Model Diagnostics

Residual Standard Error: 228. 8, indicating that there is some unexplained variability in the model, likely attributed to unconsidered factors influencing price. R-squared (0. 0925): The R-squared value stands at 9. 25%, suggesting that although these predictors are statistically significant, they account for only a minor fraction of the variance in price. Additional variables may be needed to enhance the model’s explanatory power.

Coefficient Interpretation Example

Let’s analyze the coefficient for neighbourhood_groupManhattan:

Estimate (87. 45): The coefficient for neighbourhood_groupManhattan suggests that, with other factors held constant, a listing in Manhattan corresponds to an $87. 45 increase in price compared to a listing in the baseline neighborhood.
Statistical Significance: The p-value < 2e-16 indicates that this effect is statistically significant, implying that location is a strong predictor of listing price.

Model Issues

Low R-squared: The model explains only 9. 25% of the variance, suggesting it may be overlooking significant predictors.
Potential Non-Linearity or Heteroscedasticity: The residual standard error and a high maximum residual value indicate potential heteroscedasticity, suggesting that extreme values may not be effectively managed by this model.

# Load the car package for VIF function
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

# Check VIF values
vif(linear_model)

##                         GVIF Df GVIF^(1/(2*Df))
## room_type           1.037338  2        1.009207
## number_of_reviews   1.034015  1        1.016865
## neighbourhood_group 1.052963  4        1.006472
## availability_365    1.051919  1        1.025631

Here is a summary of the terms and values presented in the table:

GVIF (Generalized Variance Inflation Factor): GVIF serves as a more generalized version of the VIF, primarily utilized when the model includes categorical variables or interactions. It assesses the extent to which the variance of the estimated regression coefficients is increased due to the correlation among the predictors. For instance, the GVIF for room_type is 1. 037338, indicating that the variance of the estimated coefficient for room_type is increased by a factor of 1. 037338 because of its correlation with other variables in the model.
Df (Degrees of Freedom): The Df column presents the number of degrees of freedom related to each variable. For continuous variables, the degrees of freedom (Df) is generally 1; however, for categorical variables with multiple levels (such as neighbourhood_group), the Df exceeds 1. For instance, room_type possesses 2 degrees of freedom due to its likely two categories (e. g. , “Entire home/apt” and “Private room”), while neighbourhood_group has 4 degrees of freedom, likely representing four neighborhood groups.
GVIF^(1/(2Df)): This represents the GVIF raised to the power of 1 2 × Df 2×Df 1 , which serves as a normalization of the GVIF, allowing for comparison across variables with varying degrees of freedom, particularly in the presence of categorical variables. The formula for this is: GVIF 1 2 × Df GVIF 2×Df 1 The normalized values signify the “adjusted” inflation factor, facilitating comparisons across variables. For example, the normalized GVIF for room_type is 1. 009207, which is near 1, indicating a very low degree of multicollinearity for this variable. Explanation of the Values: room_type:

GVIF = 1. 037338, a small value that suggests minimal inflation in the variance of the room_type coefficient due to collinearity with other variables. Normalized GVIF = 1. 009207, close to 1, indicating minimal multicollinearity. number_of_reviews:

GVIF = 1. 034015, which indicates a small degree of variance inflation. Normalized GVIF = 1. 016865, also close to 1, suggesting minimal multicollinearity. neighbourhood_group:

GVIF = 1. 052963, indicating slight variance inflation due to correlation with other predictors. Normalized GVIF = 1. 006472, again close to 1, indicating relatively low multicollinearity, albeit slightly higher than that of room_type and number_of_reviews. availability_365:

GVIF = 1. 051919, suggesting slight inflation in the variance of the availability_365 coefficient. Normalized GVIF = 1. 025631, which is somewhat higher than the others yet still near 1, indicating that multicollinearity is not a significant concern for this variable.

Interpretation:

GVIF values close to 1: The closer the GVIF is to 1 (after normalization), the less likely multicollinearity affects the variable’s coefficient estimation. A value near 1 typically signifies that the variable exhibits low correlation with other predictors in the model.

GVIF > 1: A greater value suggests that the variable’s coefficient is being distorted due to correlation with other predictors. However, in your situation, all the values remain comparatively low, indicating that multicollinearity is not a significant concern in your model.

Normalizing the GVIF: The GVIF^(1/(2Df)) is frequently employed to compare variables with varying degrees of freedom, ensuring that variables with more levels (e. g. , neighbourhood_group) are not unfairly penalized due to having additional categories.

# Plot residuals vs. fitted values
plot(linear_model, which = 1)

The “Residuals vs Fitted” plot assists in evaluating the effectiveness of regression model and identifying potential issues. Here is a more straightforward summary:

Residuals (Y-Axis): These represent the “errors” or variations between the actual values and the predictions made by our model. Ideally, the values should be distributed around zero, indicating that our predictions are in close proximity to the actual values.
Fitted Values (X-Axis): These values represent the predictions made by our model for each observation, derived from the input variables.
Pattern in Residuals:

In an ideal model, residuals (errors) should be uniformly distributed around zero, exhibiting no discernible pattern.
Here, we observe that as predictions increase, the distribution of residuals widens (creating a fan shape), indicating that the model’s accuracy is variable and may not be consistent across all values. This phenomenon is referred to as heteroscedasticity, indicating that the model may not be an optimal fit for every case.

Labeled Points: Points such as 9152, 12343, and 17693 are distant from the primary cluster. These represent outliers or instances where the model’s predictions were significantly inaccurate, warranting further investigation into these specific data points.

# Q-Q plot for residuals
plot(linear_model, which = 2)

Interpretation: Major deviations from the line could mean that the residuals are not normally distributed. Transforming the response variable (e.g., using log(price)) might help address this issue.

# Cook's distance plot
plot(linear_model, which = 4)

The Cook’s distance plot identifies influential data points in a regression model, illustrating the impact of each observation on the model’s predictions. Observations with high Cook’s distances (such as 9152, 22354, and 40434) significantly influence the model and may be considered outliers or anomalies. Examining these aspects can enhance the model’s accuracy by identifying errors or determining whether to retain or modify these influential observations.

# Model summary output shows coefficient estimates
summary(linear_model)$coefficients

##                                      Estimate   Std. Error     t value
## (Intercept)                       139.6300092  7.194393354  19.4081700
## room_typePrivate room            -110.8012904  2.131309177 -51.9874318
## room_typeShared room             -143.8931760  6.894584665 -20.8704633
## number_of_reviews                  -0.3057198  0.023617401 -12.9446861
## neighbourhood_groupBrooklyn        32.8414524  7.138473592   4.6006267
## neighbourhood_groupManhattan       87.4490862  7.134424040  12.2573435
## neighbourhood_groupQueens          13.2280059  7.567719428   1.7479514
## neighbourhood_groupStaten Island    7.8846126 13.729662674   0.5742758
## availability_365                    0.1806665  0.008062759  22.4075279
##                                       Pr(>|t|)
## (Intercept)                       1.360197e-83
## room_typePrivate room             0.000000e+00
## room_typeShared room              2.618488e-96
## number_of_reviews                 2.910703e-38
## neighbourhood_groupBrooklyn       4.222773e-06
## neighbourhood_groupManhattan      1.724475e-34
## neighbourhood_groupQueens         8.047872e-02
## neighbourhood_groupStaten Island  5.657838e-01
## availability_365                 1.202151e-110

Insights and Further Investigation

Model Insights:

Model Performance: Diagnostic checks (e. g. , residual plots, VIF, and Cook’s distance) indicate that the model performs well, showing no significant issues such as multicollinearity or heteroscedasticity. However, if there are any concerns (e. g. , high VIFs or influential points), we may need to revise the model by removing or combining variables.
Coefficient Interpretation: The coefficient for room_type indicates that various room types are linked to a significant price variation, consistent with expectations (larger homes or entire apartments are likely to be more expensive than a single room).

Significance:

The analysis of room_type offers valuable insights into how different types of accommodations (entire home vs. The presence of a private room impacts pricing. This may benefit businesses or individuals aiming to establish competitive pricing strategies.

Further Questions:

Interactions: Would the interactions between room_type and neighbourhood_group enhance the model? For instance, does the price premium for specific room types fluctuate based on the neighborhood?
Model Improvement: Could employing a generalized linear model (GLM) or a regularization technique such as Lasso or Ridge regression enhance predictive accuracy, particularly when dealing with numerous predictors?

week11

Mounya

2024-11-12