Estimating AirBnB Prices

Background Information on the Dataset

AirBnB is an online marketplace that allows members to offer or arrange lodging (primarily homestays) or tourism experiences. There are millions of listings across in cities across the world, such as London, Paris, and New York. In this problem, we would like to understand the factors that influence the price of a listing.

To derive insights and answer these questions, we take a look at listing data released by AirBnB (downloaded in September 2018 from http://insideairbnb.com/get-the-data.html). We specifically focus on apartments listed in six representative neighborhoods of Boston, MA. Our data has a total of 12 columns and 1693 observations, split across a training set (1187 observations) and a test set (506 observations). Each observation corresponds to a different listing.

Training data: airbnb-train.csv Test data: airbnb-test.csv

Here is a detailed description of the variables:

  • id: A number that uniquely identifies the listing.

  • host_is_superhost: Whether a host is a “superhost,” meaning they satisfy AirBnB’s criteria for high-quality listings, high response rate, and reliability.

  • host_identity_verified: Whether the host has verified their identity with AirBnB, which is intended to promote trust between hosts and guests. neighborhood: The neighborhood that the listing is located in (Allston, Back Bay, Beacon Hill, Brighton Downtown, or South End.

  • room_type: The type of room provided in the listing (Entire home/apt, Private room, or Shared room).

  • accommodates: The number of people that the listing can accommodate.

  • bathrooms: The number of bathrooms in the listing.

  • bedrooms: The number of bedrooms in the listing.

  • beds: The number of beds in the listing. price: The price to stay in the listing for one night.

  • logprice: The natural logarithm of the price variable.

  • logacc: The natural logarithm of the accommodates variable.

Exploratory Data Analysis

Load airbnb-train.csv into a data frame called train.

# Read in the  training dataset
train = read.csv("airbnb-train.csv")

How many rows are in the training dataset?

# Calculate the number of rows in the training dataset
nrow(train)
## [1] 1187

1187 rows.

What is the mean price in the training dataset?

# Find the mean price in the training set
mean(train$price)
## [1] 212.0868

212.0868 is the mean price in the training dataset.

What is the maximum price in the training dataset?

# Find the max price in the training set
max(train$price)
## [1] 999

999 is the max price in the training dataset.

What is the neighborhood with the highest number of listings in the training dataset?

# Tabulate the number of listings for each neighborhood
z = table(train$neighborhood)
kable(z)
Var1 Freq
Allston 176
BackBay 279
BeaconHill 155
Brighton 135
Downtown 208
SouthEnd 234

Back Bay has the highest number of listings in the training dataset.

What is the neighborhood with the highest average price in the training dataset?

# Tabulate the neighborhood with the highest average price in the training dataset
tapply(train$price, train$neighborhood, mean)
##    Allston    BackBay BeaconHill   Brighton   Downtown   SouthEnd 
##   142.7330   248.5699   187.2903   113.8444   289.4663   225.0726

Downtown has the highest average pricing in the training dataset.

Simple Linear Regression

For the rest of this problem, we will be working with log(price) and log(accommodates), which helps us manage the outliers with excessively large prices and accommodations. The values of log(price) and log(accommodates) are found in the columns logprice and logacc, respectively.

Load airbnb-test.csv into a data frame called test.

# Load testing dataset
test = read.csv("airbnb-test.csv")

What is our “baseline” linear regression model?

Our baseline model is the mean log(price) of the training set

What is the value of log(price) that our baseline model predicts?

# Baseline model prediction
baseline = mean(train$logprice)
baseline
## [1] 5.158113

5.158113 is the value our baseline model predicts.

What is the correlation between log(price) and log(accommodates) in the training set?

# Compute the correlation between log(price) and log(accomodates)
cor(train$logacc, train$logprice)
## [1] 0.5366265

0.5366265 is the correlation between log(price) and log(accommodates)

Create a linear model that predicts log(price) using log(accommodates). What is the coefficient of log(accommodates)?

# Linear regression model
lreg1 = lm(logprice ~ logacc, data = train)
# Summary of linear regression model
summary(lreg1)
## 
## Call:
## lm(formula = logprice ~ logacc, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3773 -0.3531  0.0431  0.3902  2.1317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.41943    0.03738  118.22   <2e-16 ***
## logacc       0.70348    0.03213   21.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5544 on 1185 degrees of freedom
## Multiple R-squared:  0.288,  Adjusted R-squared:  0.2874 
## F-statistic: 479.3 on 1 and 1185 DF,  p-value: < 2.2e-16

0.70348 is the coefficient of log(accommodates)

What is the R2 on the test set?

# Make predictions using the linear regression model on the test set
predTest = predict(lreg1, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 148.3648
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.5414897
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.2956862

R2 = 0.2956862.

Adding More Variables

As good practice, it is always helpful to first check for multicollinearity before running larger models.

Examine the correlation between the following variables:

# Compute correlations amongst all variables
cor(train$host_is_superhost, train$host_identity_verified)
## [1] 0.1735286
cor(train$host_is_superhost, train$bedrooms)
## [1] -0.01513584
cor(train$bedrooms, train$logacc)
## [1] 0.6515139
cor(train$beds, train$logacc)
## [1] 0.7781562
cor(train$bathrooms, train$logacc)
## [1] 0.3858965
cor(train$bedrooms, train$bathrooms)
## [1] 0.5154763
cor(train$bedrooms, train$beds)
## [1] 0.7245359

Create a linear model that predicts log(price) using the following variables:

log(accommodates), host_identity_verified, host_is_superhost, bedrooms, bathrooms, room_type, and neighborhood.

We have removed beds because of concerns about multicollinearity.

# Create linear regression model
lreg2 = lm(logprice ~ logacc + host_identity_verified + host_is_superhost + bedrooms + bathrooms + room_type + neighborhood, data = train)
# Summary of linear regression model
summary(lreg2)
## 
## Call:
## lm(formula = logprice ~ logacc + host_identity_verified + host_is_superhost + 
##     bedrooms + bathrooms + room_type + neighborhood, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85657 -0.24703  0.01275  0.26600  2.10264 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.45230    0.06094  73.057  < 2e-16 ***
## logacc                  0.04371    0.03963   1.103 0.270240    
## host_identity_verified -0.09107    0.02632  -3.460 0.000559 ***
## host_is_superhost      -0.09784    0.03568  -2.742 0.006190 ** 
## bedrooms                0.21591    0.02205   9.791  < 2e-16 ***
## bathrooms               0.19670    0.03696   5.323 1.22e-07 ***
## room_typePrivateroom   -0.64155    0.04028 -15.926  < 2e-16 ***
## room_typeSharedroom    -0.82158    0.15035  -5.464 5.67e-08 ***
## neighborhoodBackBay     0.52385    0.04311  12.151  < 2e-16 ***
## neighborhoodBeaconHill  0.31858    0.04886   6.520 1.04e-10 ***
## neighborhoodBrighton   -0.04643    0.04851  -0.957 0.338716    
## neighborhoodDowntown    0.59520    0.04610  12.912  < 2e-16 ***
## neighborhoodSouthEnd    0.44098    0.04444   9.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4173 on 1174 degrees of freedom
## Multiple R-squared:  0.6003, Adjusted R-squared:  0.5962 
## F-statistic: 146.9 on 12 and 1174 DF,  p-value: < 2.2e-16

What is the value of the intercept?

Intercept = 4.45230

What is the R2 on the test set

# Make predictions using the linear regression model on the test set
predTest = predict(lreg2, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 82.72458
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.4043356
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.6072919

R2 = 0.6072919

Interpreting Linear Regression

Which of the following variables are significant at a level of 0.001 (p-value below 0.001)?

# Summary of linear regression model
summary(lreg2)
## 
## Call:
## lm(formula = logprice ~ logacc + host_identity_verified + host_is_superhost + 
##     bedrooms + bathrooms + room_type + neighborhood, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85657 -0.24703  0.01275  0.26600  2.10264 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.45230    0.06094  73.057  < 2e-16 ***
## logacc                  0.04371    0.03963   1.103 0.270240    
## host_identity_verified -0.09107    0.02632  -3.460 0.000559 ***
## host_is_superhost      -0.09784    0.03568  -2.742 0.006190 ** 
## bedrooms                0.21591    0.02205   9.791  < 2e-16 ***
## bathrooms               0.19670    0.03696   5.323 1.22e-07 ***
## room_typePrivateroom   -0.64155    0.04028 -15.926  < 2e-16 ***
## room_typeSharedroom    -0.82158    0.15035  -5.464 5.67e-08 ***
## neighborhoodBackBay     0.52385    0.04311  12.151  < 2e-16 ***
## neighborhoodBeaconHill  0.31858    0.04886   6.520 1.04e-10 ***
## neighborhoodBrighton   -0.04643    0.04851  -0.957 0.338716    
## neighborhoodDowntown    0.59520    0.04610  12.912  < 2e-16 ***
## neighborhoodSouthEnd    0.44098    0.04444   9.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4173 on 1174 degrees of freedom
## Multiple R-squared:  0.6003, Adjusted R-squared:  0.5962 
## F-statistic: 146.9 on 12 and 1174 DF,  p-value: < 2.2e-16

host_identity_verified , bedrooms , and bathrooms are significant.

How would you interpret the coefficient of “host_is_superhost”?

All else being equal, being a superhost is associated with a 0.09784 decrease in log(price).

Which of the three room types will be predicted to have the highest price, all else being equal?

Entire home/apt

How would you interpret the coefficient of “neighborhoodBackBay”?

Compared to a listing that is in Allston but is otherwise identical, a Back Bay apartment will have a higher log(price) by 0.52385

CART and Random Forest

Create a simple CART model using bedrooms to predict log(price), with a cp value of 0.001.

# Implement CART model
library(rpart)
library(rpart.plot)
CARTmodel1 = rpart(logprice ~ bedrooms, data = train, cp =0.001)
prp(CARTmodel1)


# Make predictions
predTest = predict(CARTmodel1, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 163.5809
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.5685793
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.2234527
What value of log(price) would this model predict for a two-bedroom listing?

5.6 from the CART tree.

What is the R2 of this model on the test set?

R2 = 0.2234527

Create a CART model that predicts log(price) using the following variables: log(accommodates), host_identity_verified, host_is_superhost, bedrooms, bathrooms, room_type, and neighborhood.

Again, use cp = 0.001.

# Implement CART model
library(rpart)
library(rpart.plot)
CARTmodel2 = rpart(logprice ~ logacc + host_identity_verified + host_is_superhost + bedrooms + bathrooms + room_type + neighborhood, data = train, cp =0.001)
prp(CARTmodel1)


# Make predictions
predTest = predict(CARTmodel2, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 82.47188
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.4037176
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.6084916
What is the R2 of this new model on the test set

R2 = 0.6084916

Create a random forest model that predicts log(price) using the same variables as the CART model, with nodesize = 20 and ntree = 200. Set the random seed to 1.

# Implement CART model
library(randomForest)
set.seed(1)
RFmodel1 = randomForest(logprice ~ logacc + host_identity_verified + host_is_superhost + bedrooms + bathrooms + room_type + neighborhood, data = train, nodesize = 20, ntree = 200)

# Make predictions
predTest = predict(RFmodel1, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 74.71968
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.3842751
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.6452926
What is the R2 of this new model on the test set?

R2 = 0.6452926