library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("AB_NYC_2019.csv") # Replace with your actual data file
str(data)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "" "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
head(data)
## id name host_id host_name
## 1 2539 Clean & quiet apt home by the park 2787 John
## 2 2595 Skylit Midtown Castle 2845 Jennifer
## 3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth
## 4 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
## neighbourhood_group neighbourhood latitude longitude room_type price
## 1 Brooklyn Kensington 40.64749 -73.97237 Private room 149
## 2 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225
## 3 Manhattan Harlem 40.80902 -73.94190 Private room 150
## 4 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89
## 5 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80
## 6 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200
## minimum_nights number_of_reviews last_review reviews_per_month
## 1 1 9 2018-10-19 0.21
## 2 1 45 2019-05-21 0.38
## 3 3 0 NA
## 4 1 270 2019-07-05 4.64
## 5 10 9 2018-11-19 0.10
## 6 3 74 2019-06-22 0.59
## calculated_host_listings_count availability_365
## 1 6 365
## 2 2 355
## 3 1 365
## 4 1 194
## 5 1 0
## 6 1 129
# Load necessary libraries
library(tidyverse)
# Load the data (assuming it has already been read in the variable `data` from previous steps)
# Convert factors where needed
data <- data %>%
mutate(
room_type = as.factor(room_type),
neighbourhood_group = as.factor(neighbourhood_group)
)
# Build the linear regression model
linear_model <- lm(price ~ room_type + number_of_reviews + neighbourhood_group + availability_365, data = data)
# View the summary of the model to see coefficients and p-values
summary(linear_model)
##
## Call:
## lm(formula = price ~ room_type + number_of_reviews + neighbourhood_group +
## availability_365, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -265.4 -63.6 -22.8 15.9 9958.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.396e+02 7.194e+00 19.408 < 2e-16 ***
## room_typePrivate room -1.108e+02 2.131e+00 -51.987 < 2e-16 ***
## room_typeShared room -1.439e+02 6.895e+00 -20.870 < 2e-16 ***
## number_of_reviews -3.057e-01 2.362e-02 -12.945 < 2e-16 ***
## neighbourhood_groupBrooklyn 3.284e+01 7.138e+00 4.601 4.22e-06 ***
## neighbourhood_groupManhattan 8.745e+01 7.134e+00 12.257 < 2e-16 ***
## neighbourhood_groupQueens 1.323e+01 7.568e+00 1.748 0.0805 .
## neighbourhood_groupStaten Island 7.885e+00 1.373e+01 0.574 0.5658
## availability_365 1.807e-01 8.063e-03 22.408 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 228.8 on 48886 degrees of freedom
## Multiple R-squared: 0.0925, Adjusted R-squared: 0.09235
## F-statistic: 622.8 on 8 and 48886 DF, p-value: < 2.2e-16
Model Interpretation
This linear model examines the correlation between the price of an
Airbnb listing (price
) and several predictors, such as
room_type
, number_of_reviews
,
neighbourhood_group
, and availability_365
. Let
us examine the results.
Key Findings from the Model Output
Entire home/apt
within the baseline
neighborhood group (most likely unspecified), with no reviews and no
availability.Private room
show a price reduction of approximately $110.
80 when compared to the baseline category of
Entire home/apt
. This difference is statistically
significant (p-value < 2e-16).Shared room
are linked to a more substantial decrease of
approximately $143. 90, in comparison to Entire home/apt
.
This result is of considerable significance (p-value < 2e-16).It demonstrates that room_type
is a robust predictor of
price, with Entire home/apt
typically fetching higher
prices.
This result may indicate that properties with a higher number of reviews are located in less premium areas or cater to budget-conscious travelers.
Interpretation: Manhattan has the most substantial positive effect on price among neighborhood groups, demonstrating it as the most premium location.
Model Diagnostics
Coefficient Interpretation Example
Let’s analyze the coefficient for
neighbourhood_groupManhattan
:
neighbourhood_groupManhattan
suggests that, with other
factors held constant, a listing in Manhattan corresponds to an $87. 45
increase in price compared to a listing in the baseline
neighborhood.Model Issues
# Load the car package for VIF function
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
# Check VIF values
vif(linear_model)
## GVIF Df GVIF^(1/(2*Df))
## room_type 1.037338 2 1.009207
## number_of_reviews 1.034015 1 1.016865
## neighbourhood_group 1.052963 4 1.006472
## availability_365 1.051919 1 1.025631
Here is a summary of the terms and values presented in the table:
GVIF (Generalized Variance Inflation Factor): GVIF serves as a more generalized version of the VIF, primarily utilized when the model includes categorical variables or interactions. It assesses the extent to which the variance of the estimated regression coefficients is increased due to the correlation among the predictors. For instance, the GVIF for room_type is 1. 037338, indicating that the variance of the estimated coefficient for room_type is increased by a factor of 1. 037338 because of its correlation with other variables in the model.
Df (Degrees of Freedom): The Df column presents the number of degrees of freedom related to each variable. For continuous variables, the degrees of freedom (Df) is generally 1; however, for categorical variables with multiple levels (such as neighbourhood_group), the Df exceeds 1. For instance, room_type possesses 2 degrees of freedom due to its likely two categories (e. g. , “Entire home/apt” and “Private room”), while neighbourhood_group has 4 degrees of freedom, likely representing four neighborhood groups.
GVIF^(1/(2Df)): This represents the GVIF raised to the power of 1 2 × Df 2×Df 1 , which serves as a normalization of the GVIF, allowing for comparison across variables with varying degrees of freedom, particularly in the presence of categorical variables. The formula for this is: GVIF 1 2 × Df GVIF 2×Df 1 The normalized values signify the “adjusted” inflation factor, facilitating comparisons across variables. For example, the normalized GVIF for room_type is 1. 009207, which is near 1, indicating a very low degree of multicollinearity for this variable. Explanation of the Values: room_type:
GVIF = 1. 037338, a small value that suggests minimal inflation in the variance of the room_type coefficient due to collinearity with other variables. Normalized GVIF = 1. 009207, close to 1, indicating minimal multicollinearity. number_of_reviews:
GVIF = 1. 034015, which indicates a small degree of variance inflation. Normalized GVIF = 1. 016865, also close to 1, suggesting minimal multicollinearity. neighbourhood_group:
GVIF = 1. 052963, indicating slight variance inflation due to correlation with other predictors. Normalized GVIF = 1. 006472, again close to 1, indicating relatively low multicollinearity, albeit slightly higher than that of room_type and number_of_reviews. availability_365:
GVIF = 1. 051919, suggesting slight inflation in the variance of the availability_365 coefficient. Normalized GVIF = 1. 025631, which is somewhat higher than the others yet still near 1, indicating that multicollinearity is not a significant concern for this variable.
Interpretation:
GVIF values close to 1: The closer the GVIF is to 1 (after normalization), the less likely multicollinearity affects the variable’s coefficient estimation. A value near 1 typically signifies that the variable exhibits low correlation with other predictors in the model.
GVIF > 1: A greater value suggests that the variable’s coefficient is being distorted due to correlation with other predictors. However, in your situation, all the values remain comparatively low, indicating that multicollinearity is not a significant concern in your model.
Normalizing the GVIF: The GVIF^(1/(2Df)) is frequently employed to compare variables with varying degrees of freedom, ensuring that variables with more levels (e. g. , neighbourhood_group) are not unfairly penalized due to having additional categories.
# Plot residuals vs. fitted values
plot(linear_model, which = 1)
The “Residuals vs Fitted” plot assists in evaluating the effectiveness of regression model and identifying potential issues. Here is a more straightforward summary:
Residuals (Y-Axis): These represent the “errors” or variations between the actual values and the predictions made by our model. Ideally, the values should be distributed around zero, indicating that our predictions are in close proximity to the actual values.
Fitted Values (X-Axis): These values represent the predictions made by our model for each observation, derived from the input variables.
Pattern in Residuals:
# Q-Q plot for residuals
plot(linear_model, which = 2)
Interpretation: Major deviations from the line could mean that the residuals are not normally distributed. Transforming the response variable (e.g., using log(price)) might help address this issue.
# Cook's distance plot
plot(linear_model, which = 4)
The Cook’s distance plot identifies influential data points in a regression model, illustrating the impact of each observation on the model’s predictions. Observations with high Cook’s distances (such as 9152, 22354, and 40434) significantly influence the model and may be considered outliers or anomalies. Examining these aspects can enhance the model’s accuracy by identifying errors or determining whether to retain or modify these influential observations.
# Model summary output shows coefficient estimates
summary(linear_model)$coefficients
## Estimate Std. Error t value
## (Intercept) 139.6300092 7.194393354 19.4081700
## room_typePrivate room -110.8012904 2.131309177 -51.9874318
## room_typeShared room -143.8931760 6.894584665 -20.8704633
## number_of_reviews -0.3057198 0.023617401 -12.9446861
## neighbourhood_groupBrooklyn 32.8414524 7.138473592 4.6006267
## neighbourhood_groupManhattan 87.4490862 7.134424040 12.2573435
## neighbourhood_groupQueens 13.2280059 7.567719428 1.7479514
## neighbourhood_groupStaten Island 7.8846126 13.729662674 0.5742758
## availability_365 0.1806665 0.008062759 22.4075279
## Pr(>|t|)
## (Intercept) 1.360197e-83
## room_typePrivate room 0.000000e+00
## room_typeShared room 2.618488e-96
## number_of_reviews 2.910703e-38
## neighbourhood_groupBrooklyn 4.222773e-06
## neighbourhood_groupManhattan 1.724475e-34
## neighbourhood_groupQueens 8.047872e-02
## neighbourhood_groupStaten Island 5.657838e-01
## availability_365 1.202151e-110
Insights and Further Investigation
Model Insights:
Model Performance: Diagnostic checks (e. g. , residual plots, VIF, and Cook’s distance) indicate that the model performs well, showing no significant issues such as multicollinearity or heteroscedasticity. However, if there are any concerns (e. g. , high VIFs or influential points), we may need to revise the model by removing or combining variables.
Coefficient Interpretation: The coefficient for room_type indicates that various room types are linked to a significant price variation, consistent with expectations (larger homes or entire apartments are likely to be more expensive than a single room).
Significance:
Further Questions:
Interactions: Would the interactions between room_type and neighbourhood_group enhance the model? For instance, does the price premium for specific room types fluctuate based on the neighborhood?
Model Improvement: Could employing a generalized linear model (GLM) or a regularization technique such as Lasso or Ridge regression enhance predictive accuracy, particularly when dealing with numerous predictors?