Data

Consider the data set ‘airbnbdata.csv’. This data is a simplified version of Kaggle data set (https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data)

We are trying to predict per night price of Airbnb’s in NYC.

Variables are as follows:

Questions

  1. (10 points) Data is imported to R for you, and and named airbnb.

Please separate the testing and training sets:

- set a seed
- separate 5% of your data into testing set
airbnb <- read.csv("C:/Users/cmart/Downloads/airbnbdata.csv", header = TRUE)

#Hint: airbnb=read.table("C:/Downloads/airbnbdata.csv", header = TRUE, sep=",", dec=".")


head(airbnb)
##     id neighbourhood_group       room_type price minimum_nights
## 1 2539            Brooklyn    Private room   149              1
## 2 2595           Manhattan Entire home/apt   225              1
## 3 3647           Manhattan    Private room   150              3
## 4 3831            Brooklyn Entire home/apt    89              1
## 5 5022           Manhattan Entire home/apt    80             10
## 6 5099           Manhattan Entire home/apt   200              3
##   number_of_reviews
## 1                 9
## 2                45
## 3                 0
## 4               270
## 5                 9
## 6                74
#Specify neighborhood_group and room_type variables as categorical variables
airbnb$neighbourhood_group <- as.factor(airbnb$neighbourhood_group)
airbnb$room_type <- as.factor(airbnb$room_type)
#Total number of rows in airbnbdata - test and train set rows should sum to 1800
nrow(airbnb)
## [1] 1800
## 95% of the sample size
smp_size <- floor(0.95 * nrow(airbnb))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(airbnb)), size = smp_size) #seq_len ensures vector starts at 1

train <- airbnb[train_ind, ]
test <- airbnb[-train_ind, ]

nrow(train)
## [1] 1710
nrow(test)
## [1] 90
head(train)
##         id neighbourhood_group       room_type price minimum_nights
## 415 140133            Brooklyn Entire home/apt   249              3
## 463 163627            Brooklyn    Private room    89              3
## 179  45393           Manhattan Entire home/apt   150             26
## 526 190968            Brooklyn Entire home/apt   215              3
## 195  51485           Manhattan    Private room    83              1
## 938 353317           Manhattan Entire home/apt   129              3
##     number_of_reviews
## 415               150
## 463               205
## 179                38
## 526                33
## 195               285
## 938               132
head(test)
##        id neighbourhood_group       room_type price minimum_nights
## 3    3647           Manhattan    Private room   150              3
## 21   7801            Brooklyn Entire home/apt   299              3
## 43  12303            Brooklyn    Private room   120              7
## 66  16421           Manhattan    Private room    52             30
## 107 25235            Brooklyn Entire home/apt   125             90
## 109 25696           Manhattan    Private room   100              2
##     number_of_reviews
## 3                   0
## 21                  9
## 43                 25
## 66                191
## 107               162
## 109               170
#Note that number of rows in train(1710) and test(90) reconcile to total airbnb rows (1800)

For questions 2 to 6 use training set*

  1. (10 points) Explore if a linear relationship is viable between price (dependent variable) and number_of_reviews and between price and minimum_nights by obtaining the scatter plots and the respective correlations.
# Scatter plot of price vs number_of_reviews
plot(train$number_of_reviews, train$price,
     main = "Scatter Plot of Price vs Number of Reviews",
     xlab = "Number of Reviews",
     ylab = "Price",
     col = "blue", pch = 19)

NOTE —> There is no noticeable relationship between number of reviews and price in the above scatterpllot. This conforms to the correlation matrix generated. Number of reviews and price have a coeficcient determination of negative 0.091 suggesting a weak relationship, an increase of 1 review for an airbnb listing is not an excellent predictor of change in price of airbnb listing. This relationship will be important to be mindful of as simple linear and multiple linear regression models are formulated.

# Scatter plot of price vs minimum_nights
plot(train$minimum_nights, train$price,
     main = "Scatter Plot of Price vs Minimum Nights",
     xlab = "Minimum Nights",
     ylab = "Price",
     col = "green", pch = 19)

NOTE —> There is no clear relat5ionship between minimum nights and price in the airbnb training dataset. This is confirmed by the correlation matricx and plot below. Price and minimum nights have a coefficient of determination of negatice 0.0081 suggesting that there is a weak relationship between the two variables. Minimum nights required to occupy the airbnb cannot be used to accurately predict a change in price for said airbnb listing. This preliminary finding will be important as simple and multiple variable regression models are constructed and analyzed. Note that price and number of reviews, while a weak association, has a stonger correlation than price and price and minimum nights.

# Selecting numeric columns and creating the correlation matrix
numeric_airbnb_train <- train[, sapply(train, is.numeric)]
cor_matrix <- cor(numeric_airbnb_train, use = "complete.obs")
print(cor_matrix)
##                            id        price minimum_nights number_of_reviews
## id                 1.00000000  0.015717237    0.028183216       -0.15206104
## price              0.01571724  1.000000000   -0.008132282       -0.09072152
## minimum_nights     0.02818322 -0.008132282    1.000000000       -0.13004988
## number_of_reviews -0.15206104 -0.090721516   -0.130049884        1.00000000
#create correlation matrix plot visualization
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(RColorBrewer)

corrplot(cor_matrix)

  1. (10 points) Explore if a relationship is viable between price (dependent variable) and neighbourhood_group by obtaining the box plot.
# Box plot of price by neighborhood_group (note that ~ denotes relationship among 2 variables)
boxplot(train$price ~ train$neighbourhood_group,
        main = "Box Plot of Price by Neighbourhood Group",
        xlab = "Neighbourhood Group",
        ylab = "Price",
        col = "lightblue",
        outline = FALSE)

# Box plot of price by room_type (note that ~ denotes relationship among 2 variables)
boxplot(train$price ~ train$room_type,
        main = "Box Plot of Price by Room Type",
        xlab = "Room Type",
        ylab = "Price",
        col = "grey",
        outline = FALSE)

NOTE —> Neighborhood Group is a categorical variable, meaning that a coefficient determination cannot be generated between neighborhood group and price in the above correlation matrix and plot unless the train_airbnb data set is cleansed and modified to include dummy variables that somehow quantify neighborhood’s effect on price or change in price. A box plot can be used for examining how a categorical variable explains the change in price, a numerical dependent variable. It appears that there would be a moderate association between price and neighborhood group given the different median price and spread in max-min/Q3-Q1 for each of the box plots representing the five different boroughs in NYC. The relationship between both variables will need to be carefully examined and treated with special procedures when constructing and analyzing any regression model.

  1. (20 points) Estimate 3 simple linear regression models between price (as the dependent variable) and the following variables: minimum_nights, neighbourhood_group, and room_type (recall to specify categorical variables as a factor) separately. Report your regression models. Comment on the significance of the slopes.
#Simple Linear Regression - Continuous Indp. Variable (minimum_nights)
reg_minimum_nights <- lm(price ~ minimum_nights, data=train)
summary(reg_minimum_nights)
## 
## Call:
## lm(formula = price ~ minimum_nights, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137.38  -72.38  -31.21   37.62 2842.79 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    157.50968    3.55944  44.251   <2e-16 ***
## minimum_nights  -0.04345    0.12928  -0.336    0.737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.3 on 1708 degrees of freedom
## Multiple R-squared:  6.613e-05,  Adjusted R-squared:  -0.0005193 
## F-statistic: 0.113 on 1 and 1708 DF,  p-value: 0.7368
confint(reg_minimum_nights)
##                      2.5 %      97.5 %
## (Intercept)    150.5283701 164.4909958
## minimum_nights  -0.2970118   0.2101103
# Plot the relationship between minimum_nights and price
plot(train$price, train$minimum_nights, 
     main = "Minimum Nights vs. Price",
     xlab = "Price",
     ylab = "Minimum Nights",
     col = "red", pch = 19)
abline(reg_minimum_nights, col = "green") #Add the regression line to the plot to analyze fit

#Simple Linear Regression - Categorical Indp. Variables (neighborhood_group, room_type)

reg_neighborhood_group <- lm(price ~ neighbourhood_group, data = airbnb) #neighborhood_group
reg_room_type <- lm(price ~ room_type, data = airbnb) #room_type

summary(reg_neighborhood_group)
## 
## Call:
## lm(formula = price ~ neighbourhood_group, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.81  -70.81  -27.37   28.68 2819.19 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         71.57      29.48   2.428 0.015295 *  
## neighbourhood_groupBrooklyn         74.75      29.89   2.501 0.012482 *  
## neighbourhood_groupManhattan       109.24      29.88   3.656 0.000264 ***
## neighbourhood_groupQueens           25.81      32.49   0.794 0.427143    
## neighbourhood_groupStaten Island    41.03      43.23   0.949 0.342581    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.4 on 1795 degrees of freedom
## Multiple R-squared:  0.03109,    Adjusted R-squared:  0.02893 
## F-statistic:  14.4 on 4 and 1795 DF,  p-value: 1.416e-11
summary(reg_room_type)
## 
## Call:
## lm(formula = price ~ room_type, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -146.32  -55.32  -21.93   18.07 2899.07 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            195.319      4.121  47.395  < 2e-16 ***
## room_typePrivate room  -94.394      6.601 -14.300  < 2e-16 ***
## room_typeShared room   -88.319     33.226  -2.658  0.00793 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135.9 on 1797 degrees of freedom
## Multiple R-squared:  0.1033, Adjusted R-squared:  0.1023 
## F-statistic: 103.5 on 2 and 1797 DF,  p-value: < 2.2e-16
  1. (15 points) Estimate a multiple linear regression model between `price (as the dependent variable) and the following variables: minimum_nights, neighbourhood_group, and room_type. Report your regression model. Comment on the significance of the partial effects

The global or model-wide effects, as denoted by the f-statistic of 34.75 and p-value of 2.2e-16, for mlr regression model suggest that, taken together, minimum nights, neighborhood group, and room type are statistically significant. H0 assumes that there is no relationship between the three independent variables in mlr and price. In other words, a change in neighborhood, room type, or number of minimum nights needed to rent a specific airbnb would have 0 impact on the final price listed for the underlying airbnb. Although the adjusted R^2 (0.1214) suggests that the 3 independent variables in mlr has a weak positive relationship; the f-statistic (34.75) demonstrates that the overall model is statistically significant. Thus, H0 can be rejected and Ha can be accepted on a global or model-wide basis as the 3 independent variables are statistically significant and a change in price can be explained, albeit weakly, by a change in minimum nights needed to rent, room type available, and neighborhood when all are collectively considered. The weak. adjusted R^2 suggests that there is only 1 or 2 individual variables driving the change in price in the model.

To understand the partial effects of the mlr regression model given the generated output below, it is important to understand that the y-intercept, because room_type and neighborhood_group are categorical independent variables, represents the reference category for both neighborhood_group and room_type. In the above mlr regression model, the reference categories represents the Bronx for the neighborhood group and entire home/apt for room type. As such, the coefficients for the other neighborhood categories (Brooklyn, Manhattan, Queens, Staten Island) and room types (private room, shared room) represent the difference in price relative to the reference categories (the Bronx for neighborhood group and entire home/apt for room type). The y-intercept, representing an airbnb listing for an entire home/apt in the Bronx, is statistically significant with a p-value of 8.17e-06 and B coefficient of 124,91. This indicates that, holding all other variables constant at zero, the price of renting an entire home/apt as an airbnb in the Bronx for 0 nights will cost 124.91. While this has little practical meaning, you cannot rent an airbnb for 0 nights, the intercept is a foundation for the model and a starting point for further analysis. The partial effects can be further analyzed as follows:

  1. The coefficient of 57.394 suggests that listings in Brooklyn are estimated to cost 57.394 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.04099 demonstrates that the difference in price from the reference (the Bronx) is significant and changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Brooklyn), all else being held constant, is statistically significant and cannot be explained solely by mere happenstance, despite a weak adjusted R^2 for the mlr regression model and also individually for the SLR regression model for neighborhood_group impact on price.

  2. The coefficient of 87.513 suggests that listings in Manhattan are estimated to cost 87.513 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.00186 demonstrates that the difference in price from the reference (the Bronx) is significant and changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Manhattan), all else being held constant, is statistically significant and cannot be explained solely by mere happenstance, despite a weak adjusted R^2 for the mlr regression model and also individually for the SLR regression model for neighborhood_group impact on price.

  3. The coefficient of 29.658 suggests that listings in Queens are estimated to cost 29.658 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.33214 demonstrates that the difference in price from the reference (the Bronx) is not statistically significant. Changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Queens), all else being held constant, is statistically insignificant. The H0 cannot be rejected and the incremental price of 29.658 USD over 124.9139 USD for a Queens airbnb listing should not be relied upon. The model should be updated to better reflect the independent variable impact on price.

  4. The coefficient of 54.871 suggests that listings in Staten Island are estimated to cost 54.871 USD more than listings in the Bronx (the y-intercept category), holding constant minimum_nights and room_type. The p-value of 0.17103 demonstrates that the difference in price from the reference (the Bronx) is not statistically significant. Changes in price as explained by changes in neighborhood of the respective airbnb listing (from the Bronx to Staten Island), all else being held constant, is statistically insignificant. The H0 cannot be rejected and the incremental price of 54.8705 USD over 124.9139 USD for a Staten Island airbnb listing should not be relied upon. The model should be updated to better reflect the independent variable impact on price.

  5. The coefficient of -88.074 suggests that listings classified as private room are estimated to cost 88.074 USD less than listings classified as entire home/apt category, holding minimum_nights constant and the neighborhood group constant. The p-value of 2e-16 indicates that H0, the incremental increase or decrease resulting from a change from rent of entire home to private room is not statistically significant, can be rejected. While the adjusted R^2 for both the mlr regression model and the SLR regression model for room_type and price is low, the results are statistically significant. Reject H0, accept Ha.

  6. The coefficient of -89.523 suggests that listings classified as shared room are estimated to cost 89.523 USD less than listings classified as entire home/apt category, holding minimum_nights constant and the neighborhood group constant.The p-value of 0.00608 indicates that H0, the incremental increase or decrease resulting from a change from rent of entire home to shared room is not statistically significant, can be rejected. While the adjusted R^2 for both the mlr regression model and the SLR regression model for room_type and price is low, the results are statistically significant. Reject H0, accept Ha.

mlr <- lm(price~minimum_nights+neighbourhood_group+room_type, data=train)
summary(mlr)
## 
## Call:
## lm(formula = price ~ minimum_nights + neighbourhood_group + room_type, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.01  -53.29  -22.89   15.41 2877.11 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      124.9139    27.9181   4.474 8.17e-06 ***
## minimum_nights                    -0.2090     0.1217  -1.717  0.08608 .  
## neighbourhood_groupBrooklyn       57.3945    28.0630   2.045  0.04099 *  
## neighbourhood_groupManhattan      87.5130    28.0756   3.117  0.00186 ** 
## neighbourhood_groupQueens         29.6584    30.5732   0.970  0.33214    
## neighbourhood_groupStaten Island  54.8705    40.0669   1.369  0.17103    
## room_typePrivate room            -88.0743     6.5575 -13.431  < 2e-16 ***
## room_typeShared room             -89.8527    32.7126  -2.747  0.00608 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129.6 on 1702 degrees of freedom
## Multiple R-squared:  0.125,  Adjusted R-squared:  0.1214 
## F-statistic: 34.75 on 7 and 1702 DF,  p-value: < 2.2e-16
  1. (10 points) Check the residuals of your model in part 5. Do you believe that the assumptions of regression are met?

All assumptions of the multiple linear regression model in part 5 is not met. The assumptions of a MLR model is that (1) the average error term is equal to 0 given the combination of the independent variables, (2) the errors should follow a normal distribution with a mean of 0 and constant variance. (3) There shroud be constant variance, (4) no perfect multidisciplinary between independent variables, (5) uncorrelated error terms. All 5 assumptions underlying a well-functioning MLR regression model have not been met. Notably, there is a significant amount of errors outside of the 95% confidence interval bound not considered white noise, there is a disproportionate number of residuals located well above a mean of 0 as seen with the plot graph displaying residuals against fitted values for the training data set, and there is a noticeable right tail in the distribution chart to the right. All of the factors taken together, along with a disappointingly low adjusted R^2 tell a story indicating that the model should be adjusted using transformations or alternative modeling techniques. When analyzing the residual patterns in tandem with a 0.124 adjusted R^2 for MLR demonstrates that there could be nonlinear patterns in the MLR regression model underlying data or significant predictors omitted. Further analysis and trial and error will be needed to best construct a model the predicts airbnb listing price in NYC.

checkresiduals(mlr) #Examine residuals to ensure assumptions of regression are met

## 
##  Breusch-Godfrey test for serial correlation of order up to 11
## 
## data:  Residuals
## LM test = 11.905, df = 11, p-value = 0.3708
# Plot residuals against fitted values for the training dataset

plot(fitted(mlr), residuals(mlr), 
     main="Residuals vs Fitted Values (Training Set)", 
     xlab="Fitted Values", ylab="Residuals")
abline(h = 0, col = "red")

vif(mlr) #Determine multicollinearity between indpendent variablse in mlr
##                         GVIF Df GVIF^(1/(2*Df))
## minimum_nights      1.009196  1        1.004588
## neighbourhood_group 1.026154  4        1.003232
## room_type           1.033131  2        1.008182
  1. (20 points) Calculate the accuracy models for all four models you have estimated, using the testing data. Which model predicts the price better?

Using the testing data and comparing MSE and MAPE among the 5 4 models (3 SLR and 1 MLR) generated, the multiple linear regression model has the lowest MAPE and MSE, 7497.631 and 42792, respectively. The MLR model best predicts price and should be used, but as seen with the residual analysis, should be adjusted. The SLR models perform worse than MLR. Individual predictors do not capture enough information to predict price accurately.

#MLR accuracy test
mlr <- lm(price~minimum_nights+neighbourhood_group+room_type, data=train)
pred2 <- predict(mlr,data.frame(test))
ape2 <- abs(test$price-pred2)
se2 <- (test$price-pred2)^2

mape_MLR <- mean(ape2)*100
mse_MLR <- mean(se2)

#neighborhood_group SLR model accuracy test
reg_neighborhood_group <- lm(price ~ neighbourhood_group, data = airbnb) 
pred3 <- predict(reg_neighborhood_group,data.frame(test))
ape3 <- abs(test$price-pred3)
se3 <- (test$price-pred3)^2

mape_neighborhood_group <- mean(ape3)*100
mse_neighborhood_group <- mean(se3)

#room_type SLR model accuracy test
reg_room_type <- lm(price ~ room_type, data = airbnb) 
pred4 <- predict(reg_room_type,data.frame(test))
ape4 <- abs(test$price-pred4)
se4<- (test$price-pred4)^2

mape_room_type <- mean(ape4)*100
mse_room_type <- mean(se4)

#minimum_nights SLR model accuracy test
reg_minimum_nights <- lm(price ~ minimum_nights, data=train)
pred5 <- predict(reg_minimum_nights,data.frame(test))
ape5 <- abs(test$price-pred5)
se5 <- (test$price-pred5)^2

mape_minimum_nights <- mean(ape5)*100
mse_minimum_nights <- mean(se5)

# Create data frame with model names and calculated MAPE and MSE values
accuracy_comparison <- data.frame(
  Model = c("Multiple Linear Regression (MLR)", 
            "Simple Linear Regression - Neighbourhood Group", 
            "Simple Linear Regression - Room Type", 
            "Simple Linear Regression - Minimum Nights"),
  MAPE = c(mape_MLR, mape_neighborhood_group, mape_room_type, mape_minimum_nights),
  MSE = c(mse_MLR, mse_neighborhood_group, mse_room_type, mse_minimum_nights)
)

# Display table using kable
kable(accuracy_comparison, caption = "Comparison of MAPE and MSE for Different Models")
Comparison of MAPE and MSE for Different Models
Model MAPE MSE
Multiple Linear Regression (MLR) 7497.632 42792.07
Simple Linear Regression - Neighbourhood Group 8971.379 46524.06
Simple Linear Regression - Room Type 7509.803 43583.36
Simple Linear Regression - Minimum Nights 9526.168 48232.67
  1. (5 points) Comment on your findings in terms of: