Consider the data set ‘airbnbdata.csv’. This data is a simplified version of Kaggle data set (https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data)
We are trying to predict per night price of Airbnb’s in NYC.
Variables are as follows:
id: Listing idneighbourhood_group: Neighborhood in NYCroom_type: listing space typeprice:price in dollarsminimum_nights: amount of nights minimumnumber_of_reviews: number of reviewsairbnb.Please separate the testing and training sets:
- set a seed
- separate 5% of your data into testing set
airbnb <- read.csv("C:/Users/cmart/Downloads/airbnbdata.csv", header = TRUE)
#Hint: airbnb=read.table("C:/Downloads/airbnbdata.csv", header = TRUE, sep=",", dec=".")
head(airbnb)
## id neighbourhood_group room_type price minimum_nights
## 1 2539 Brooklyn Private room 149 1
## 2 2595 Manhattan Entire home/apt 225 1
## 3 3647 Manhattan Private room 150 3
## 4 3831 Brooklyn Entire home/apt 89 1
## 5 5022 Manhattan Entire home/apt 80 10
## 6 5099 Manhattan Entire home/apt 200 3
## number_of_reviews
## 1 9
## 2 45
## 3 0
## 4 270
## 5 9
## 6 74
#Specify neighborhood_group and room_type variables as categorical variables
airbnb$neighbourhood_group <- as.factor(airbnb$neighbourhood_group)
airbnb$room_type <- as.factor(airbnb$room_type)
#Total number of rows in airbnbdata - test and train set rows should sum to 1800
nrow(airbnb)
## [1] 1800
## 95% of the sample size
smp_size <- floor(0.95 * nrow(airbnb))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(airbnb)), size = smp_size) #seq_len ensures vector starts at 1
train <- airbnb[train_ind, ]
test <- airbnb[-train_ind, ]
nrow(train)
## [1] 1710
nrow(test)
## [1] 90
head(train)
## id neighbourhood_group room_type price minimum_nights
## 415 140133 Brooklyn Entire home/apt 249 3
## 463 163627 Brooklyn Private room 89 3
## 179 45393 Manhattan Entire home/apt 150 26
## 526 190968 Brooklyn Entire home/apt 215 3
## 195 51485 Manhattan Private room 83 1
## 938 353317 Manhattan Entire home/apt 129 3
## number_of_reviews
## 415 150
## 463 205
## 179 38
## 526 33
## 195 285
## 938 132
head(test)
## id neighbourhood_group room_type price minimum_nights
## 3 3647 Manhattan Private room 150 3
## 21 7801 Brooklyn Entire home/apt 299 3
## 43 12303 Brooklyn Private room 120 7
## 66 16421 Manhattan Private room 52 30
## 107 25235 Brooklyn Entire home/apt 125 90
## 109 25696 Manhattan Private room 100 2
## number_of_reviews
## 3 0
## 21 9
## 43 25
## 66 191
## 107 162
## 109 170
#Note that number of rows in train(1710) and test(90) reconcile to total airbnb rows (1800)
For questions 2 to 6 use training set*
price (dependent variable) and
number_of_reviews and between price and
minimum_nights by obtaining the scatter plots and the
respective correlations.# Scatter plot of price vs number_of_reviews
plot(train$number_of_reviews, train$price,
main = "Scatter Plot of Price vs Number of Reviews",
xlab = "Number of Reviews",
ylab = "Price",
col = "blue", pch = 19)
NOTE —> There is no noticeable relationship between number of reviews
and price in the above scatterpllot. This conforms to the correlation
matrix generated. Number of reviews and price have a coeficcient
determination of negative 0.091 suggesting a weak relationship, an
increase of 1 review for an airbnb listing is not an excellent predictor
of change in price of airbnb listing. This relationship will be
important to be mindful of as simple linear and multiple linear
regression models are formulated.
# Scatter plot of price vs minimum_nights
plot(train$minimum_nights, train$price,
main = "Scatter Plot of Price vs Minimum Nights",
xlab = "Minimum Nights",
ylab = "Price",
col = "green", pch = 19)
NOTE —> There is no clear relat5ionship between minimum nights and
price in the airbnb training dataset. This is confirmed by the
correlation matricx and plot below. Price and minimum nights have a
coefficient of determination of negatice 0.0081 suggesting that there is
a weak relationship between the two variables. Minimum nights required
to occupy the airbnb cannot be used to accurately predict a change in
price for said airbnb listing. This preliminary finding will be
important as simple and multiple variable regression models are
constructed and analyzed. Note that price and number of reviews, while a
weak association, has a stonger correlation than price and price and
minimum nights.
# Selecting numeric columns and creating the correlation matrix
numeric_airbnb_train <- train[, sapply(train, is.numeric)]
cor_matrix <- cor(numeric_airbnb_train, use = "complete.obs")
print(cor_matrix)
## id price minimum_nights number_of_reviews
## id 1.00000000 0.015717237 0.028183216 -0.15206104
## price 0.01571724 1.000000000 -0.008132282 -0.09072152
## minimum_nights 0.02818322 -0.008132282 1.000000000 -0.13004988
## number_of_reviews -0.15206104 -0.090721516 -0.130049884 1.00000000
#create correlation matrix plot visualization
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(RColorBrewer)
corrplot(cor_matrix)
price (dependent variable) and
neighbourhood_group by obtaining the box plot.# Box plot of price by neighborhood_group (note that ~ denotes relationship among 2 variables)
boxplot(train$price ~ train$neighbourhood_group,
main = "Box Plot of Price by Neighbourhood Group",
xlab = "Neighbourhood Group",
ylab = "Price",
col = "lightblue",
outline = FALSE)
# Box plot of price by room_type (note that ~ denotes relationship among 2 variables)
boxplot(train$price ~ train$room_type,
main = "Box Plot of Price by Room Type",
xlab = "Room Type",
ylab = "Price",
col = "grey",
outline = FALSE)
NOTE —> Neighborhood Group is a categorical variable, meaning that a
coefficient determination cannot be generated between neighborhood group
and price in the above correlation matrix and plot unless the
train_airbnb data set is cleansed and modified to include dummy
variables that somehow quantify neighborhood’s effect on price or change
in price. A box plot can be used for examining how a categorical
variable explains the change in price, a numerical dependent variable.
It appears that there would be a moderate association between price and
neighborhood group given the different median price and spread in
max-min/Q3-Q1 for each of the box plots representing the five different
boroughs in NYC. The relationship between both variables will need to be
carefully examined and treated with special procedures when constructing
and analyzing any regression model.
price (as the dependent variable) and the following
variables: minimum_nights,
neighbourhood_group, and room_type (recall to
specify categorical variables as a factor) separately. Report your
regression models. Comment on the significance of the
slopes.#Simple Linear Regression - Continuous Indp. Variable (minimum_nights)
reg_minimum_nights <- lm(price ~ minimum_nights, data=train)
summary(reg_minimum_nights)
##
## Call:
## lm(formula = price ~ minimum_nights, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137.38 -72.38 -31.21 37.62 2842.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 157.50968 3.55944 44.251 <2e-16 ***
## minimum_nights -0.04345 0.12928 -0.336 0.737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 138.3 on 1708 degrees of freedom
## Multiple R-squared: 6.613e-05, Adjusted R-squared: -0.0005193
## F-statistic: 0.113 on 1 and 1708 DF, p-value: 0.7368
confint(reg_minimum_nights)
## 2.5 % 97.5 %
## (Intercept) 150.5283701 164.4909958
## minimum_nights -0.2970118 0.2101103
# Plot the relationship between minimum_nights and price
plot(train$price, train$minimum_nights,
main = "Minimum Nights vs. Price",
xlab = "Price",
ylab = "Minimum Nights",
col = "red", pch = 19)
abline(reg_minimum_nights, col = "green") #Add the regression line to the plot to analyze fit
#Simple Linear Regression - Categorical Indp. Variables (neighborhood_group, room_type)
reg_neighborhood_group <- lm(price ~ neighbourhood_group, data = airbnb) #neighborhood_group
reg_room_type <- lm(price ~ room_type, data = airbnb) #room_type
summary(reg_neighborhood_group)
##
## Call:
## lm(formula = price ~ neighbourhood_group, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.81 -70.81 -27.37 28.68 2819.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.57 29.48 2.428 0.015295 *
## neighbourhood_groupBrooklyn 74.75 29.89 2.501 0.012482 *
## neighbourhood_groupManhattan 109.24 29.88 3.656 0.000264 ***
## neighbourhood_groupQueens 25.81 32.49 0.794 0.427143
## neighbourhood_groupStaten Island 41.03 43.23 0.949 0.342581
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 141.4 on 1795 degrees of freedom
## Multiple R-squared: 0.03109, Adjusted R-squared: 0.02893
## F-statistic: 14.4 on 4 and 1795 DF, p-value: 1.416e-11
summary(reg_room_type)
##
## Call:
## lm(formula = price ~ room_type, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.32 -55.32 -21.93 18.07 2899.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.319 4.121 47.395 < 2e-16 ***
## room_typePrivate room -94.394 6.601 -14.300 < 2e-16 ***
## room_typeShared room -88.319 33.226 -2.658 0.00793 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135.9 on 1797 degrees of freedom
## Multiple R-squared: 0.1033, Adjusted R-squared: 0.1023
## F-statistic: 103.5 on 2 and 1797 DF, p-value: < 2.2e-16
price (as the dependent variable) and the following
variables: minimum_nights,
neighbourhood_group, and room_type. Report
your regression model. Comment on the significance of the
partial effectsThe global or model-wide effects, as denoted by the f-statistic of 34.75 and p-value of 2.2e-16, for mlr regression model suggest that, taken together, minimum nights, neighborhood group, and room type are statistically significant. H0 assumes that there is no relationship between the three independent variables in mlr and price. In other words, a change in neighborhood, room type, or number of minimum nights needed to rent a specific airbnb would have 0 impact on the final price listed for the underlying airbnb. Although the adjusted R^2 (0.1214) suggests that the 3 independent variables in mlr has a weak positive relationship; the f-statistic (34.75) demonstrates that the overall model is statistically significant. Thus, H0 can be rejected and Ha can be accepted on a global or model-wide basis as the 3 independent variables are statistically significant and a change in price can be explained, albeit weakly, by a change in minimum nights needed to rent, room type available, and neighborhood when all are collectively considered. The weak. adjusted R^2 suggests that there is only 1 or 2 individual variables driving the change in price in the model.
To understand the partial effects of the mlr regression model given the generated output below, it is important to understand that the y-intercept, because room_type and neighborhood_group are categorical independent variables, represents the reference category for both neighborhood_group and room_type. In the above mlr regression model, the reference categories represents the Bronx for the neighborhood group and entire home/apt for room type. As such, the coefficients for the other neighborhood categories (Brooklyn, Manhattan, Queens, Staten Island) and room types (private room, shared room) represent the difference in price relative to the reference categories (the Bronx for neighborhood group and entire home/apt for room type). The y-intercept, representing an airbnb listing for an entire home/apt in the Bronx, is statistically significant with a p-value of 8.17e-06 and B coefficient of 124,91. This indicates that, holding all other variables constant at zero, the price of renting an entire home/apt as an airbnb in the Bronx for 0 nights will cost 124.91. While this has little practical meaning, you cannot rent an airbnb for 0 nights, the intercept is a foundation for the model and a starting point for further analysis. The partial effects can be further analyzed as follows:
The coefficient of 57.394 suggests that listings in Brooklyn are estimated to cost 57.394 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.04099 demonstrates that the difference in price from the reference (the Bronx) is significant and changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Brooklyn), all else being held constant, is statistically significant and cannot be explained solely by mere happenstance, despite a weak adjusted R^2 for the mlr regression model and also individually for the SLR regression model for neighborhood_group impact on price.
The coefficient of 87.513 suggests that listings in Manhattan are estimated to cost 87.513 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.00186 demonstrates that the difference in price from the reference (the Bronx) is significant and changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Manhattan), all else being held constant, is statistically significant and cannot be explained solely by mere happenstance, despite a weak adjusted R^2 for the mlr regression model and also individually for the SLR regression model for neighborhood_group impact on price.
The coefficient of 29.658 suggests that listings in Queens are estimated to cost 29.658 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.33214 demonstrates that the difference in price from the reference (the Bronx) is not statistically significant. Changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Queens), all else being held constant, is statistically insignificant. The H0 cannot be rejected and the incremental price of 29.658 USD over 124.9139 USD for a Queens airbnb listing should not be relied upon. The model should be updated to better reflect the independent variable impact on price.
The coefficient of 54.871 suggests that listings in Staten Island are estimated to cost 54.871 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.17103 demonstrates that the difference in price from the reference (the Bronx) is not statistically significant. Changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Staten Island), all else being held constant, is statistically insignificant. The H0 cannot be rejected and the incremental price of 54.8705 USD over 124.9139 USD for a Staten Island airbnb listing should not be relied upon. The model should be updated to better reflect the independent variable impact on price.
The coefficient of -88.074 suggests that listings classified as private room are estimated to cost 88.074 USD less than listings classified as entire home/apt category, holding minimum_nights constant and the neighborhood group constant. The p-value of 2e-16 indicates that H0, the incremental increase or decrease resulting from a change from rent of entire home to private room is not statistically significant, can be rejected. While the adjusted R^2 for both the mlr regression model and the SLR regression model for room_type and price is low, the results are statistically significant. Reject H0, accept Ha.
The coefficient of -89.523 suggests that listings classified as shared room are estimated to cost 89.523 USD less than listings classified as entire home/apt category, holding minimum_nights constant and the neighborhood group constant.The p-value of 0.00608 indicates that H0, the incremental increase or decrease resulting from a change from rent of entire home to shared room is not statistically significant, can be rejected. While the adjusted R^2 for both the mlr regression model and the SLR regression model for room_type and price is low, the results are statistically significant. Reject H0, accept Ha.
mlr <- lm(price~minimum_nights+neighbourhood_group+room_type, data=train)
summary(mlr)
##
## Call:
## lm(formula = price ~ minimum_nights + neighbourhood_group + room_type,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -162.01 -53.29 -22.89 15.41 2877.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.9139 27.9181 4.474 8.17e-06 ***
## minimum_nights -0.2090 0.1217 -1.717 0.08608 .
## neighbourhood_groupBrooklyn 57.3945 28.0630 2.045 0.04099 *
## neighbourhood_groupManhattan 87.5130 28.0756 3.117 0.00186 **
## neighbourhood_groupQueens 29.6584 30.5732 0.970 0.33214
## neighbourhood_groupStaten Island 54.8705 40.0669 1.369 0.17103
## room_typePrivate room -88.0743 6.5575 -13.431 < 2e-16 ***
## room_typeShared room -89.8527 32.7126 -2.747 0.00608 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129.6 on 1702 degrees of freedom
## Multiple R-squared: 0.125, Adjusted R-squared: 0.1214
## F-statistic: 34.75 on 7 and 1702 DF, p-value: < 2.2e-16
All assumptions of the multiple linear regression model in part 5 is not met. The assumptions of a MLR model is that (1) the average error term is equal to 0 given the combination of the independent variables, (2) the errors should follow a normal distribution with a mean of 0 and constant variance. (3) There shroud be constant variance, (4) no perfect multidisciplinary between independent variables, (5) uncorrelated error terms. All 5 assumptions underlying a well-functioning MLR regression model have not been met. Notably, there is a significant amount of errors outside of the 95% confidence interval bound not considered white noise, there is a disproportionate number of residuals located well above a mean of 0 as seen with the plot graph displaying residuals against fitted values for the training data set, and there is a noticeable right tail in the distribution chart to the right. All of the factors taken together, along with a disappointingly low adjusted R^2 tell a story indicating that the model should be adjusted using transformations or alternative modeling techniques. When analyzing the residual patterns in tandem with a 0.124 adjusted R^2 for MLR demonstrates that there could be nonlinear patterns in the MLR regression model underlying data or significant predictors omitted. Further analysis and trial and error will be needed to best construct a model the predicts airbnb listing price in NYC.
checkresiduals(mlr) #Examine residuals to ensure assumptions of regression are met
##
## Breusch-Godfrey test for serial correlation of order up to 11
##
## data: Residuals
## LM test = 11.905, df = 11, p-value = 0.3708
# Plot residuals against fitted values for the training dataset
plot(fitted(mlr), residuals(mlr),
main="Residuals vs Fitted Values (Training Set)",
xlab="Fitted Values", ylab="Residuals")
abline(h = 0, col = "red")
vif(mlr) #Determine multicollinearity between indpendent variablse in mlr
## GVIF Df GVIF^(1/(2*Df))
## minimum_nights 1.009196 1 1.004588
## neighbourhood_group 1.026154 4 1.003232
## room_type 1.033131 2 1.008182
price better?Using the testing data and comparing MSE and MAPE among the 5 4 models (3 SLR and 1 MLR) generated, the multiple linear regression model has the lowest MAPE and MSE, 7497.631 and 42792, respectively. The MLR model best predicts price and should be used, but as seen with the residual analysis, should be adjusted. The SLR models perform worse than MLR. Individual predictors do not capture enough information to predict price accurately.
#MLR accuracy test
mlr <- lm(price~minimum_nights+neighbourhood_group+room_type, data=train)
pred2 <- predict(mlr,data.frame(test))
ape2 <- abs(test$price-pred2)
se2 <- (test$price-pred2)^2
mape_MLR <- mean(ape2)*100
mse_MLR <- mean(se2)
#neighborhood_group SLR model accuracy test
reg_neighborhood_group <- lm(price ~ neighbourhood_group, data = airbnb)
pred3 <- predict(reg_neighborhood_group,data.frame(test))
ape3 <- abs(test$price-pred3)
se3 <- (test$price-pred3)^2
mape_neighborhood_group <- mean(ape3)*100
mse_neighborhood_group <- mean(se3)
#room_type SLR model accuracy test
reg_room_type <- lm(price ~ room_type, data = airbnb)
pred4 <- predict(reg_room_type,data.frame(test))
ape4 <- abs(test$price-pred4)
se4<- (test$price-pred4)^2
mape_room_type <- mean(ape4)*100
mse_room_type <- mean(se4)
#minimum_nights SLR model accuracy test
reg_minimum_nights <- lm(price ~ minimum_nights, data=train)
pred5 <- predict(reg_minimum_nights,data.frame(test))
ape5 <- abs(test$price-pred5)
se5 <- (test$price-pred5)^2
mape_minimum_nights <- mean(ape5)*100
mse_minimum_nights <- mean(se5)
# Create data frame with model names and calculated MAPE and MSE values
accuracy_comparison <- data.frame(
Model = c("Multiple Linear Regression (MLR)",
"Simple Linear Regression - Neighbourhood Group",
"Simple Linear Regression - Room Type",
"Simple Linear Regression - Minimum Nights"),
MAPE = c(mape_MLR, mape_neighborhood_group, mape_room_type, mape_minimum_nights),
MSE = c(mse_MLR, mse_neighborhood_group, mse_room_type, mse_minimum_nights)
)
# Display table using kable
kable(accuracy_comparison, caption = "Comparison of MAPE and MSE for Different Models")
| Model | MAPE | MSE |
|---|---|---|
| Multiple Linear Regression (MLR) | 7497.632 | 42792.07 |
| Simple Linear Regression - Neighbourhood Group | 8971.379 | 46524.06 |
| Simple Linear Regression - Room Type | 7509.803 | 43583.36 |
| Simple Linear Regression - Minimum Nights | 9526.168 | 48232.67 |
Would you recommend Airbnb to use your model for predicting prices of their listings? No, the five underlying assumptions are not met for MLR, the adjusted R^2 is low for MLR, MAPE and MSE (although lower than the SLR models) are relatively high, and there are a variety of predictors that are not even statistically significant and perhaps show a nonlinear relationship (i.e - certain neighborhood groups do not accurately predict price, Staten Island and Queens).
Do you believe your predictions are accurate? No, I do not believe the predictions are accurate based on the underlying error assumptions not being met, the low unadjusted R^2, and the high MAPE and MSE accuracy tests using the untainted test data.
If you could have any data you wanted, which other variable(s) you would have liked to include in your regression? I would like to incorporate number of reviews. It is likely that number of reviews has a higher coefficient of determination and relationship with NYC airbnb listing price. Typically, more reviews indicates extremely good or bas experiences with an airbnb experience.