1. Abstract

Our final project covers exploratory data analysis, data visualizations, data transformations, modeling and model selection of New York City Airbnb data. Airbnb is a company that allows individuals to rent out their own real estate holdings to tourists or other customers for lodging or tourism. Our data set consists of just under 49,000 Airbnb listings within the 5 boroughs, with 16 variables (12 of which will be used in our analysis). Using techniques learned in this course and as data scientists, we have conducted a thorough analysis of the listings mentioned above and have created a model that seeks to predict the price of any Airbnb listing in New York City.

 

1.1 Problem Statement:

What are the most important factors that determine the price of an Airbnb in NYC, and how can we use those factors to predict the price of any Airbnb in NYC.

 

1.2 Research Questions:

How much does the borough the Airbnb is located in affect the price of an airbnb?

Is having an entire house/apartment worth it to customers to pay more?

How much do reviews (specifically the sample size of reviews) impact the price of an Airbnb?

 
 

2. Key Words

Pricing

Modeling

Stepwise

 
 

3. Introduction

In the problem being addressed, we aim to gleam insight into AirBnB operations in NYC. The information to be found will likely be most useful to property owners that’re considering their options regarding asset optimization. The findings will be valuable to consumers looking for a house or apartment to stay in for a short period of time by providing information as to what prices they could expect to pay. However, since most AirBnB consumers are likely price takers almost entirely dependent on available locations, the answer to the problem being solved here will generate more value to property owners as previously discussed. The main economic motivation for building this model stems from the potential business venture of renting or purchasing a home/apartment to use as an AirBnB. Coming up with an accurate way of predicting the income of a property on AirBnB before purchasing would be greatly impactful to a person looking to enter the market. Another potential motivation for finding this solution is largely rooted in opportunity-cost comparisons. If a property owner has a spare room in their house and they’re unsure whether they’d like to rent it out full-time or to host it on AirBnB, the solution we aim to find would provide them with insight that could help guide their decision.

 
 
 
 

4. Literature Review

Many researchers have used linear regression models in applications similar to the project that we have taken on. Stephen Mak, Lennon Choy, and Winky Ho used a quantile regression to predict real estate prices in Hong Kong (2009). Quantile regression works complementarily to least squares, which is what we are using for our project.

While a multiple linear regression works by estimating a regression coefficient that would represent a one unit change in a predictor variable, a quantile regression works differently. Quantile regressions measure the change in the response variable (the price of a piece of real estate in Hong Kong) based on a change in the quartile of a predictor variable. The two types of models are incredibly similar, and using different quartiles is something that we have not typically used in this class, but is something that we as a group found easy to understand. The quartiles can be used very efficiently, especially when allocating for outliers when data is not transformed, as outliers simply are put in the first or fourth quartiles. However, this could also lead to inaccurate predictions, as there a change in quartile could have a large change on a predictor, when the observation value between the two is very close, and vice versa. Two observations could be separated by only one or two points but could have drastically different coefficients because they are separated by a quartile.

Similar to our models, a group in Kansas used Ordinary Least Squares in order to determine the value of land based on a number of variables (Taylor, 2015). This paper and their project uses the same type of model that we are using, with the only difference being that instead of predicting the price of a stay on a piece of land, they are predicting the price of the land. The advantage that a quantile approach has over linear regression is that it can be used to explain the determinants of the dependent variable at any distribution (Mark, 2010). The use of a regression model to predict price has been applied to far more than just real estate, but to understand how our topic relates to those of the state of the art, we wanted to compare the data sets. The Hong Kong group sampled their data from an area in which there are similar location specific characteristics in a place called City One (Sha Tin) (Mark, 2010). The data itself came from a real estate firm that compiled the data from government real estate records. This is similar to our own data set, which was taken from the 5 boroughs of NYC. The data from the Kansas group was taken over several years (2012-2014) and comprises various types of land (not similar to the Hong Kong or NYC data sets) (Taylor, 2015). This means that there can be other factors playing into the Kansas data set other than purely just the characteristics of the land. There are a plethora of outside factors that cannot be accounted for in the regression model, as real estate prices can fluctuate greatly.

 
 
 
 

5. Methodology

What our project is looking to predict is the price of an airbnb based on a number of factors including the neighborhood, latitudinal and longitudinal coordinates, room type, reviews, minimum nights, host listings count, and availability. We sought to create a model that uses these (12) variables to predict a price for an airbnb. As a group, we discussed how we all thought each variable would affect the model of the price. A listing in Manhattan would almost certainly be priced higher than a listing in Staten Island or Queens. Similar to the table in our assignments, we created one for all the variables in our data set and their theoretical effects on the models that we will build:

 
 
 
 

6. Experimentation and Results

 

6.1 Data Exploration

 
 

 
 

 
 

 
 

 
 
 

6.2 Data Preparation

As seen in the R code appendix, there were several steps taken in the data preparation part of the project. Several variables were created using various methods. Total_months was created by dividing the number of reviews by the number of reviews per month. This provided us with how long the AirBnB was posted online for. Total_days was created to show the number of days an AirBnB was posted instead of months. This was easily shown by simply multiplying Total_months by 30. A Opportunity variable was also created to show how long an AirBnB was posted relative to the number of nights it was actually available to be rented out. A variable was created to show the percentage of time per year an AirBnb was available for rent, as well as quantity supplied and quantity demanded variables. Quantity demanded was created by Quantity supplied was derived by multiplying the yr_avbl_prop variable we created by the quantity demanded variable. The group also changed the variable last_review from a date in the form of “2011-03-28” to the amount of days from the most recent date to make it easier to work with. Finally, avgQs was created to show the average quantity supplied and demanded (Qs+Qd)/2. Filters were used to take out outliers in the price (above 2500) and minimum nights (abov 180). With our new variables created and outliers taken care of, we moved on to our indicator variable creation. Indicator variables were created for the 5 boroughs, as well as for the different types of rooms: private rooms, entire house/apt, and shared rooms. The focus of the data preparation was to create a few extra variables to help increase the fit of our model, as well as create indicator variables for our categorical variables and eliminate outliers.

 
 
 

6.3 Model Building

In building our models, it was decided to select each model’s variables via backward stepwise selection. The coefficients for all following models are the final variables selected for each model. The models that have been created compare various data transformations with a Multiple Linear Regression model. Of all models studied, a linear regression was decided to be the best model type to predict our continuous target variable, price. The models that follow are:

6.3.1—Multiple Linear Regression (MLR)

6.3.2—MLR with BoxCox transformation on the Predictor variables

6.3.3—MLR with BoxCox transformation on the Response variable

6.3.4—MLR with BoxCox transformation on both the Predictor and Response variables

6.3.5—MLR with log transformation on the Predictor variables

6.3.6—MLR with log transformation on the Response variable

6.3.7—MLR with log transformation on both the Predictor and Response variables

 

6.3.1 Multiple Linear Regression

## 
## Call:
## lm(formula = price ~ latitude + longitude + minimum_nights + 
##     number_of_reviews + days_since_last_review + reviews_per_month + 
##     availability_365 + total_months + opportunity + Qd + manhattan + 
##     brooklyn + staten.island + home + pvt_room, data = data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -255.20  -54.24  -18.58   19.25 2102.11 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.645e+04  1.659e+03 -15.939  < 2e-16 ***
## latitude               -1.666e+02  1.517e+01 -10.983  < 2e-16 ***
## longitude              -4.503e+02  1.930e+01 -23.337  < 2e-16 ***
## minimum_nights         -1.315e+00  6.068e-02 -21.675  < 2e-16 ***
## number_of_reviews      -4.668e-02  2.302e-02  -2.028   0.0425 *  
## days_since_last_review  4.765e-03  2.023e-03   2.355   0.0185 *  
## reviews_per_month      -3.915e+00  5.930e-01  -6.602 4.10e-11 ***
## availability_365        1.795e-01  4.713e-03  38.091  < 2e-16 ***
## total_months           -5.885e-01  4.257e-02 -13.824  < 2e-16 ***
## opportunity            -2.009e+01  2.765e+00  -7.263 3.83e-13 ***
## Qd                      1.263e-02  2.835e-03   4.455 8.41e-06 ***
## manhattan               3.083e+01  2.604e+00  11.840  < 2e-16 ***
## brooklyn               -2.315e+01  2.367e+00  -9.776  < 2e-16 ***
## staten.island          -1.406e+02  7.997e+00 -17.585  < 2e-16 ***
## home                    1.335e+02  3.739e+00  35.704  < 2e-16 ***
## pvt_room                3.100e+01  3.729e+00   8.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.2 on 48667 degrees of freedom
## Multiple R-squared:  0.2468, Adjusted R-squared:  0.2465 
## F-statistic:  1063 on 15 and 48667 DF,  p-value: < 2.2e-16

Despite a relatively low R-squared value, the coefficients of this model showed a relationship that was empirically expected. For example, if we take the “total_months” variable, we can see that the longer a property is listed, the lower the price will command. However, some other coefficients didn’t make sense. This result and other variables helped us continue on this path and tweak the data and consider further transformations.

 
 

6.3.2 MLR 2 - Predictor Transformation

## 
## Call:
## lm(formula = price ~ latitude + longitude + minimum_nights + 
##     days_since_last_review + reviews_per_month + availability_365 + 
##     total_months + total_days + opportunity + yr_avbl_prop + 
##     Qd + Qs + avgQs + manhattan + brooklyn + staten.island + 
##     home + pvt_room, data = data.train.bc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -279.54  -54.05  -17.72   20.54 2096.32 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -27306.876   1645.462 -16.595  < 2e-16 ***
## latitude                 -161.875     15.064 -10.746  < 2e-16 ***
## longitude                -455.602     19.144 -23.799  < 2e-16 ***
## minimum_nights            149.072      7.503  19.868  < 2e-16 ***
## days_since_last_review    -47.684      6.698  -7.119 1.10e-12 ***
## reviews_per_month         102.073     11.370   8.978  < 2e-16 ***
## availability_365          -24.609      6.078  -4.049 5.16e-05 ***
## total_months              148.849     17.491   8.510  < 2e-16 ***
## total_days                 84.087     15.100   5.569 2.58e-08 ***
## opportunity               530.680     40.217  13.195  < 2e-16 ***
## yr_avbl_prop             -485.110     39.021 -12.432  < 2e-16 ***
## Qd                       -559.046     83.097  -6.728 1.74e-11 ***
## Qs                        -25.389      6.699  -3.790 0.000151 ***
## avgQs                     395.483     85.486   4.626 3.73e-06 ***
## manhattan                  31.821      2.586  12.304  < 2e-16 ***
## brooklyn                  -20.967      2.353  -8.911  < 2e-16 ***
## staten.island            -139.256      7.943 -17.532  < 2e-16 ***
## home                      138.604      3.720  37.258  < 2e-16 ***
## pvt_room                   33.687      3.704   9.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122.3 on 48664 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2571 
## F-statistic: 936.7 on 18 and 48664 DF,  p-value: < 2.2e-16

The coefficients for this model also got mixed results. A positive example was “days_since_last_review,” which indicated a negative relationship. The fewer days since the property’s last review, the higher the price.

 
 

6.3.3 MLR 3 - Response Transformation

## 
## Call:
## lm(formula = price^(bcVal) ~ latitude + longitude + minimum_nights + 
##     number_of_reviews + reviews_per_month + availability_365 + 
##     total_months + opportunity + Qd + manhattan + queens + staten.island + 
##     home + pvt_room, data = data.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.248036 -0.019658  0.002199  0.022483  0.275866 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.624e+01  4.976e-01  32.645  < 2e-16 ***
## latitude           3.681e-02  4.144e-03   8.882  < 2e-16 ***
## longitude          2.331e-01  5.565e-03  41.896  < 2e-16 ***
## minimum_nights     6.481e-04  1.735e-05  37.346  < 2e-16 ***
## number_of_reviews  1.044e-05  6.107e-06   1.709 0.087484 .  
## reviews_per_month  1.145e-03  1.702e-04   6.726 1.77e-11 ***
## availability_365  -6.193e-05  1.302e-06 -47.571  < 2e-16 ***
## total_months       1.264e-04  1.111e-05  11.378  < 2e-16 ***
## opportunity        5.486e-03  7.930e-04   6.919 4.61e-12 ***
## Qd                -2.990e-06  8.057e-07  -3.711 0.000207 ***
## manhattan         -2.268e-02  5.135e-04 -44.165  < 2e-16 ***
## queens            -8.674e-03  6.541e-04 -13.261  < 2e-16 ***
## staten.island      6.144e-02  2.037e-03  30.167  < 2e-16 ***
## home              -9.445e-02  1.073e-03 -88.014  < 2e-16 ***
## pvt_room          -3.541e-02  1.070e-03 -33.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03535 on 48668 degrees of freedom
## Multiple R-squared:  0.5363, Adjusted R-squared:  0.5361 
## F-statistic:  4020 on 14 and 48668 DF,  p-value: < 2.2e-16

The coefficients for this model made considerably more sense than the previous two. A couple of examples suggest this was comparatively a more robust model. For instance, the more reviews a property had, the higher the price it would command.

 
 

6.3.4 MLR 4 - Both Transformed

## 
## Call:
## lm(formula = price^(bcVal) ~ latitude + longitude + minimum_nights + 
##     days_since_last_review + reviews_per_month + calculated_host_listings_count + 
##     availability_365 + total_months + total_days + opportunity + 
##     yr_avbl_prop + Qd + avgQs + manhattan + queens + staten.island + 
##     home + pvt_room, data = data.train.bc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.238984 -0.019621  0.002126  0.022105  0.276349 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    16.9276476  0.4906546  34.500  < 2e-16 ***
## latitude                        0.0363964  0.0040940   8.890  < 2e-16 ***
## longitude                       0.2394870  0.0054874  43.643  < 2e-16 ***
## minimum_nights                 -0.0685785  0.0021823 -31.425  < 2e-16 ***
## days_since_last_review          0.0092901  0.0018635   4.985 6.20e-07 ***
## reviews_per_month              -0.0331827  0.0032315 -10.269  < 2e-16 ***
## calculated_host_listings_count -0.0040737  0.0013526  -3.012   0.0026 ** 
## availability_365                0.0194957  0.0012308  15.840  < 2e-16 ***
## total_months                   -0.0522117  0.0049005 -10.654  < 2e-16 ***
## total_days                     -0.0356181  0.0040770  -8.736  < 2e-16 ***
## opportunity                    -0.2499638  0.0115379 -21.665  < 2e-16 ***
## yr_avbl_prop                    0.1321449  0.0102729  12.863  < 2e-16 ***
## Qd                              0.1655164  0.0187800   8.813  < 2e-16 ***
## avgQs                          -0.0839344  0.0188435  -4.454 8.44e-06 ***
## manhattan                      -0.0223943  0.0005087 -44.020  < 2e-16 ***
## queens                         -0.0083596  0.0006464 -12.933  < 2e-16 ***
## staten.island                   0.0628206  0.0020103  31.249  < 2e-16 ***
## home                           -0.0960416  0.0010647 -90.206  < 2e-16 ***
## pvt_room                       -0.0363709  0.0010577 -34.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03489 on 48664 degrees of freedom
## Multiple R-squared:  0.5484, Adjusted R-squared:  0.5482 
## F-statistic:  3283 on 18 and 48664 DF,  p-value: < 2.2e-16

The coefficients for this model suggested that the higher the opportunity to rent, the lower the price. Also, the higher the number of minimum nights, the lower the price. However, the relationship between the number of reviews per month relative to price was negative, contrary to what intuitively one would think.

 
 

6.3.5 MLR 5 - Log Transformation on Predictors

## 
## Call:
## lm(formula = price ~ latitude + longitude + minimum_nights + 
##     number_of_reviews + days_since_last_review + reviews_per_month + 
##     calculated_host_listings_count + total_months + total_days + 
##     opportunity + yr_avbl_prop + Qd + Qs + avgQs + manhattan + 
##     brooklyn + staten.island + home + pvt_room, data = data.train.log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272.74  -54.06  -18.20   20.26 2091.99 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.691e+04  1.650e+03 -16.311  < 2e-16 ***
## latitude                       -1.627e+02  1.514e+01 -10.751  < 2e-16 ***
## longitude                      -4.544e+02  1.923e+01 -23.635  < 2e-16 ***
## minimum_nights                 -1.030e+00  8.359e-02 -12.320  < 2e-16 ***
## number_of_reviews               5.378e+00  1.939e+00   2.774  0.00554 ** 
## days_since_last_review          2.520e-02  3.017e-03   8.352  < 2e-16 ***
## reviews_per_month              -2.700e+01  3.660e+00  -7.378 1.64e-13 ***
## calculated_host_listings_count -5.913e-02  1.847e-02  -3.202  0.00137 ** 
## total_months                   -1.752e+01  4.103e+00  -4.270 1.96e-05 ***
## total_days                     -8.205e+00  1.648e+00  -4.980 6.38e-07 ***
## opportunity                    -2.226e+01  8.737e+00  -2.548  0.01083 *  
## yr_avbl_prop                    1.053e+02  4.883e+00  21.567  < 2e-16 ***
## Qd                              7.194e+01  6.932e+00  10.378  < 2e-16 ***
## Qs                              7.901e+00  6.925e-01  11.409  < 2e-16 ***
## avgQs                          -6.816e+01  7.418e+00  -9.188  < 2e-16 ***
## manhattan                       3.161e+01  2.592e+00  12.196  < 2e-16 ***
## brooklyn                       -2.168e+01  2.364e+00  -9.173  < 2e-16 ***
## staten.island                  -1.398e+02  7.976e+00 -17.529  < 2e-16 ***
## home                            1.367e+02  3.729e+00  36.668  < 2e-16 ***
## pvt_room                        3.255e+01  3.713e+00   8.765  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122.5 on 48663 degrees of freedom
## Multiple R-squared:  0.2544, Adjusted R-squared:  0.2541 
## F-statistic: 873.7 on 19 and 48663 DF,  p-value: < 2.2e-16

Similarly, the coefficients for this model were mixed. The number of reviews per month also was negatively correlated. However, when it came to location, it pointed at a higher price if the property was located in Manhattan.

 
 

6.3.6 MLR 6 - Log Transformation on Response

## 
## Call:
## lm(formula = log(price) ~ latitude + longitude + minimum_nights + 
##     number_of_reviews + reviews_per_month + total_months + opportunity + 
##     yr_avbl_prop + Qd + manhattan + brooklyn + staten.island + 
##     home + pvt_room, data = data.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0603 -0.3032 -0.0480  0.2414  3.7295 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.851e+02  6.392e+00 -28.953  < 2e-16 ***
## latitude          -7.297e-01  5.844e-02 -12.485  < 2e-16 ***
## longitude         -2.957e+00  7.434e-02 -39.784  < 2e-16 ***
## minimum_nights    -8.403e-03  2.329e-04 -36.081  < 2e-16 ***
## number_of_reviews -1.842e-04  8.197e-05  -2.247   0.0247 *  
## reviews_per_month -1.592e-02  2.285e-03  -6.969 3.23e-12 ***
## total_months      -1.902e-03  1.491e-04 -12.749  < 2e-16 ***
## opportunity       -8.061e-02  1.064e-02  -7.573 3.72e-14 ***
## yr_avbl_prop       3.079e-01  6.391e-03  48.185  < 2e-16 ***
## Qd                 4.616e-05  1.081e-05   4.269 1.97e-05 ***
## manhattan          2.106e-01  1.003e-02  20.993  < 2e-16 ***
## brooklyn          -1.084e-01  9.120e-03 -11.883  < 2e-16 ***
## staten.island     -9.048e-01  3.080e-02 -29.372  < 2e-16 ***
## home               1.171e+00  1.440e-02  81.322  < 2e-16 ***
## pvt_room           4.088e-01  1.437e-02  28.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4745 on 48668 degrees of freedom
## Multiple R-squared:  0.5183, Adjusted R-squared:  0.5181 
## F-statistic:  3740 on 14 and 48668 DF,  p-value: < 2.2e-16

The coefficients for this model made sense for the most part. In terms of location, this model suggested that if the property were in Staten Island, the lower the price. However, the number of reviews was negatively correlated.

 
 

6.3.7 MLR 7 - Log Transformation Both variable types

## 
## Call:
## lm(formula = log(price) ~ latitude + longitude + minimum_nights + 
##     number_of_reviews + days_since_last_review + reviews_per_month + 
##     availability_365 + total_months + total_days + opportunity + 
##     yr_avbl_prop + Qd + Qs + avgQs + manhattan + queens + staten.island + 
##     home + pvt_room, data = data.train.log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1547 -0.3008 -0.0461  0.2398  3.6059 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.011e+02  6.614e+00 -30.412  < 2e-16 ***
## latitude               -5.060e-01  5.513e-02  -9.179  < 2e-16 ***
## longitude              -3.050e+00  7.397e-02 -41.234  < 2e-16 ***
## minimum_nights         -7.283e-03  3.162e-04 -23.034  < 2e-16 ***
## number_of_reviews       2.230e-02  7.440e-03   2.997 0.002727 ** 
## days_since_last_review  9.216e-05  1.161e-05   7.940 2.07e-15 ***
## reviews_per_month      -1.139e-01  1.405e-02  -8.108 5.26e-16 ***
## availability_365        1.930e-02  3.312e-03   5.828 5.64e-09 ***
## total_months           -5.910e-02  1.586e-02  -3.726 0.000194 ***
## total_days             -4.595e-02  6.531e-03  -7.036 2.01e-12 ***
## opportunity            -1.124e-01  3.356e-02  -3.348 0.000814 ***
## yr_avbl_prop            3.497e-01  3.422e-02  10.219  < 2e-16 ***
## Qd                      2.793e-01  3.432e-02   8.138 4.12e-16 ***
## Qs                      2.591e-02  3.965e-03   6.534 6.45e-11 ***
## avgQs                  -2.494e-01  3.754e-02  -6.643 3.10e-11 ***
## manhattan               2.952e-01  6.839e-03  43.165  < 2e-16 ***
## queens                  1.085e-01  8.709e-03  12.453  < 2e-16 ***
## staten.island          -7.977e-01  2.710e-02 -29.441  < 2e-16 ***
## home                    1.183e+00  1.430e-02  82.710  < 2e-16 ***
## pvt_room                4.130e-01  1.424e-02  28.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4703 on 48663 degrees of freedom
## Multiple R-squared:  0.5268, Adjusted R-squared:  0.5266 
## F-statistic:  2851 on 19 and 48663 DF,  p-value: < 2.2e-16

One of the coefficients that did not make much sense in this model was the number of days since the last review. This coefficient was positively correlated, suggesting that the longer a property went without a review, the higher the price. Although this difference wasn’t significant, it also went against what we thought intuitively.

 
 
 
 
 

6.4 Model Selection

 

In comparing the summary of various models, it is apparent that the chosen model will be between models:

 

6.3.3—MLR with BoxCox transformation on the Response variable,
R^2= 0.5363, F-stat= 4020

6.3.4—MLR with BoxCox transformation on both the Predictor and Response variables,
R^2= 0.5484, F-stat = 3283

6.3.6—MLR with log transformation on the Response variable,
R^2= 0.5183, F-stat= 3740

6.3.7—MLR with log transformation on both the Predictor and Response variables,
R^2= 0.5268, F-stat= 2851

 

In comparing the two BoxCox transformations, while they are close, the better model is Model 6.3.3. This is determined by the highter F-statistic. The R^2 values for both models are relatively close, yet the slight difference does not warrant the need to forgo the greater difference with the F-statistic.

In comparing the two log transformations, the better model–determined with the same metrics as above– is Model 6.3.6.

 

With the chosen model to be determined from these two models, one is clearly better than the other in both metrics. The selected model is Model 6.3.3–Multiple Linear Regression (Response Transformation).

 
 
 
 

6.5 Model Evaluation

idfinal_predprice
2.54e+0368149
3.65e+0376150
5.8e+03 7189
6.85e+03132140
8.5e+03 6960
9.36e+03194150
9.66e+03198180
9.78e+037950
1.22e+049068
1.23e+04147120
1.29e+0468130
1.3e+04 128115
1.43e+04125228
1.65e+04148225
1.66e+04139275
1.68e+049099
1.71e+048951
1.77e+047665
1.86e+046795
1.87e+0494150
1.92e+04202285
1.93e+04140130
2.08e+046698
2.09e+04134100
2.16e+047789
2.29e+04122125
2.29e+047860
2.6e+04 150200
2.68e+04151120
2.75e+047199
2.83e+046675
3.16e+04210115
3.86e+04174219
3.87e+04145475
4.15e+0416780
4.4e+04 5750
4.42e+0467110
4.45e+04168165
4.59e+04134200
4.67e+046390
4.74e+04129175
4.79e+04203275
4.87e+04202299
5.14e+0480130
5.32e+046080
5.35e+0411798
5.55e+04126140
5.6e+04 6369
5.85e+0480120
5.91e+04125140

 
 
 
 

6.6 Interpretation and Discussion

The predictions made by our model are sufficient for estimation. This model should not be fully relied upon, however it does a good job of estimating close to the actual value a majority of the time. Given the nature of AirBnB prices in NYC along with the data we had to begin with, we are pleased with the model that has been created. Should it be the only tool used by those we had in mind when creating it? No. But it is one tool among many that will help provide valuable insight into whoever may be looking.

 
 
 
 

7. Discussion and Conclusion

Throughout the process of completing this project, the group worked cohesively to not only analyze the data and understand the problem, but all group members provided valuable insight and approaches. Working as a team, the group decided on how to approach the problem, first breaking the data into boroughs, as well as room types. The model we ended up selecting used a log of the prices as the target variable, which Angel uses frequently in his professional analysis. The findings of our model show that we can, with a fair amount of accuracy, predict the price of an AirBnB in NYC based on what the group considered to be a limited amount of data. The group unanimously agreed that there were many outside factors and unaccounted for variables that could have affected the price of any given AirBnB. Factors that include the season, the ratings of a host/hostess, the view associated with the AirBnB, proximity to tourist attractions and restaurants, and more were all discussed as things that could have a positive or negative impact on the price, but were not included or were unable to be accounted for from the data set that we used. The underlying pillars of the model that the group did build however, could be implemented in a variety of similar areas of pricing. As we saw in the literature review, our model could just as easily be used to price a piece of land, house, or building. Additionally, this could be applied to other large items such as cars, planes, trucks, and any other durable goods that are available for purchase. Asset pricing models are incredibly common in the business world, and our project expands on that idea and brings in real estate and the potential for a lower level investor to predict a potential return from an AirBnB.

 
 
 
 
 
 

8. References

Mark, S., Choy, L., & Ho, W. (2010). Quantile Regression Estimates of Hong Kong Real Estate Prices. Urban Studies, 47(11), 2461–2472. http://www.jstor.org/stable/43080240

Taylor, M., Schurle, B., Rundel, B., & Wilson, B. (2015). Determining Land Values Using Ordinary Least Squares Regression. Journal of ASFMRA, 75–86. http://www.jstor.org/stable/jasfmra.2015.75