AirBnB is an online marketplace that allows members to offer or arrange lodging (primarily homestays) or tourism experiences. There are millions of listings across in cities across the world, such as London, Paris, and New York. In this problem, we would like to understand the factors that influence the price of a listing.
To derive insights and answer these questions, we take a look at listing data released by AirBnB (downloaded in September 2018 from http://insideairbnb.com/get-the-data.html). We specifically focus on apartments listed in six representative neighborhoods of Boston, MA. Our data has a total of 12 columns and 1693 observations, split across a training set (1187 observations) and a test set (506 observations). Each observation corresponds to a different listing.
Training data: airbnb-train.csv Test data: airbnb-test.csv
Here is a detailed description of the variables:
id: A number that uniquely identifies the listing.
host_is_superhost: Whether a host is a “superhost,” meaning they satisfy AirBnB’s criteria for high-quality listings, high response rate, and reliability.
host_identity_verified: Whether the host has verified their identity with AirBnB, which is intended to promote trust between hosts and guests. neighborhood: The neighborhood that the listing is located in (Allston, Back Bay, Beacon Hill, Brighton Downtown, or South End.
room_type: The type of room provided in the listing (Entire home/apt, Private room, or Shared room).
accommodates: The number of people that the listing can accommodate.
bathrooms: The number of bathrooms in the listing.
bedrooms: The number of bedrooms in the listing.
beds: The number of beds in the listing. price: The price to stay in the listing for one night.
logprice: The natural logarithm of the price variable.
logacc: The natural logarithm of the accommodates variable.
Load airbnb-train.csv into a data frame called train.
# Read in the training dataset
train = read.csv("airbnb-train.csv")# Calculate the number of rows in the training dataset
nrow(train)
## [1] 11871187 rows.
# Find the mean price in the training set
mean(train$price)
## [1] 212.0868212.0868 is the mean price in the training dataset.
# Find the max price in the training set
max(train$price)
## [1] 999999 is the max price in the training dataset.
# Tabulate the number of listings for each neighborhood
z = table(train$neighborhood)
kable(z)| Var1 | Freq |
|---|---|
| Allston | 176 |
| BackBay | 279 |
| BeaconHill | 155 |
| Brighton | 135 |
| Downtown | 208 |
| SouthEnd | 234 |
Back Bay has the highest number of listings in the training dataset.
# Tabulate the neighborhood with the highest average price in the training dataset
tapply(train$price, train$neighborhood, mean)
## Allston BackBay BeaconHill Brighton Downtown SouthEnd
## 142.7330 248.5699 187.2903 113.8444 289.4663 225.0726Downtown has the highest average pricing in the training dataset.
For the rest of this problem, we will be working with log(price) and log(accommodates), which helps us manage the outliers with excessively large prices and accommodations. The values of log(price) and log(accommodates) are found in the columns logprice and logacc, respectively.
Load airbnb-test.csv into a data frame called test.
# Load testing dataset
test = read.csv("airbnb-test.csv")Our baseline model is the mean log(price) of the training set
# Baseline model prediction
baseline = mean(train$logprice)
baseline
## [1] 5.1581135.158113 is the value our baseline model predicts.
# Compute the correlation between log(price) and log(accomodates)
cor(train$logacc, train$logprice)
## [1] 0.53662650.5366265 is the correlation between log(price) and log(accommodates)
# Linear regression model
lreg1 = lm(logprice ~ logacc, data = train)
# Summary of linear regression model
summary(lreg1)
##
## Call:
## lm(formula = logprice ~ logacc, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3773 -0.3531 0.0431 0.3902 2.1317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.41943 0.03738 118.22 <2e-16 ***
## logacc 0.70348 0.03213 21.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5544 on 1185 degrees of freedom
## Multiple R-squared: 0.288, Adjusted R-squared: 0.2874
## F-statistic: 479.3 on 1 and 1185 DF, p-value: < 2.2e-160.70348 is the coefficient of log(accommodates)
# Make predictions using the linear regression model on the test set
predTest = predict(lreg1, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 148.3648
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.5414897
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.2956862R2 = 0.2956862.
As good practice, it is always helpful to first check for multicollinearity before running larger models.
Examine the correlation between the following variables:
# Compute correlations amongst all variables
cor(train$host_is_superhost, train$host_identity_verified)
## [1] 0.1735286
cor(train$host_is_superhost, train$bedrooms)
## [1] -0.01513584
cor(train$bedrooms, train$logacc)
## [1] 0.6515139
cor(train$beds, train$logacc)
## [1] 0.7781562
cor(train$bathrooms, train$logacc)
## [1] 0.3858965
cor(train$bedrooms, train$bathrooms)
## [1] 0.5154763
cor(train$bedrooms, train$beds)
## [1] 0.7245359Create a linear model that predicts log(price) using the following variables:
log(accommodates), host_identity_verified, host_is_superhost, bedrooms, bathrooms, room_type, and neighborhood.
We have removed beds because of concerns about multicollinearity.
# Create linear regression model
lreg2 = lm(logprice ~ logacc + host_identity_verified + host_is_superhost + bedrooms + bathrooms + room_type + neighborhood, data = train)
# Summary of linear regression model
summary(lreg2)
##
## Call:
## lm(formula = logprice ~ logacc + host_identity_verified + host_is_superhost +
## bedrooms + bathrooms + room_type + neighborhood, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.85657 -0.24703 0.01275 0.26600 2.10264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45230 0.06094 73.057 < 2e-16 ***
## logacc 0.04371 0.03963 1.103 0.270240
## host_identity_verified -0.09107 0.02632 -3.460 0.000559 ***
## host_is_superhost -0.09784 0.03568 -2.742 0.006190 **
## bedrooms 0.21591 0.02205 9.791 < 2e-16 ***
## bathrooms 0.19670 0.03696 5.323 1.22e-07 ***
## room_typePrivateroom -0.64155 0.04028 -15.926 < 2e-16 ***
## room_typeSharedroom -0.82158 0.15035 -5.464 5.67e-08 ***
## neighborhoodBackBay 0.52385 0.04311 12.151 < 2e-16 ***
## neighborhoodBeaconHill 0.31858 0.04886 6.520 1.04e-10 ***
## neighborhoodBrighton -0.04643 0.04851 -0.957 0.338716
## neighborhoodDowntown 0.59520 0.04610 12.912 < 2e-16 ***
## neighborhoodSouthEnd 0.44098 0.04444 9.922 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4173 on 1174 degrees of freedom
## Multiple R-squared: 0.6003, Adjusted R-squared: 0.5962
## F-statistic: 146.9 on 12 and 1174 DF, p-value: < 2.2e-16Intercept = 4.45230
# Make predictions using the linear regression model on the test set
predTest = predict(lreg2, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 82.72458
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.4043356
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.6072919R2 = 0.6072919
# Summary of linear regression model
summary(lreg2)
##
## Call:
## lm(formula = logprice ~ logacc + host_identity_verified + host_is_superhost +
## bedrooms + bathrooms + room_type + neighborhood, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.85657 -0.24703 0.01275 0.26600 2.10264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45230 0.06094 73.057 < 2e-16 ***
## logacc 0.04371 0.03963 1.103 0.270240
## host_identity_verified -0.09107 0.02632 -3.460 0.000559 ***
## host_is_superhost -0.09784 0.03568 -2.742 0.006190 **
## bedrooms 0.21591 0.02205 9.791 < 2e-16 ***
## bathrooms 0.19670 0.03696 5.323 1.22e-07 ***
## room_typePrivateroom -0.64155 0.04028 -15.926 < 2e-16 ***
## room_typeSharedroom -0.82158 0.15035 -5.464 5.67e-08 ***
## neighborhoodBackBay 0.52385 0.04311 12.151 < 2e-16 ***
## neighborhoodBeaconHill 0.31858 0.04886 6.520 1.04e-10 ***
## neighborhoodBrighton -0.04643 0.04851 -0.957 0.338716
## neighborhoodDowntown 0.59520 0.04610 12.912 < 2e-16 ***
## neighborhoodSouthEnd 0.44098 0.04444 9.922 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4173 on 1174 degrees of freedom
## Multiple R-squared: 0.6003, Adjusted R-squared: 0.5962
## F-statistic: 146.9 on 12 and 1174 DF, p-value: < 2.2e-16host_identity_verified , bedrooms , and bathrooms are significant.
All else being equal, being a superhost is associated with a 0.09784 decrease in log(price).
Entire home/apt
Compared to a listing that is in Allston but is otherwise identical, a Back Bay apartment will have a higher log(price) by 0.52385
# Implement CART model
library(rpart)
library(rpart.plot)
CARTmodel1 = rpart(logprice ~ bedrooms, data = train, cp =0.001)
prp(CARTmodel1)
# Make predictions
predTest = predict(CARTmodel1, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 163.5809
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.5685793
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.22345275.6 from the CART tree.
R2 = 0.2234527
Again, use cp = 0.001.
# Implement CART model
library(rpart)
library(rpart.plot)
CARTmodel2 = rpart(logprice ~ logacc + host_identity_verified + host_is_superhost + bedrooms + bathrooms + room_type + neighborhood, data = train, cp =0.001)
prp(CARTmodel1)
# Make predictions
predTest = predict(CARTmodel2, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 82.47188
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.4037176
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.6084916R2 = 0.6084916
# Implement CART model
library(randomForest)
set.seed(1)
RFmodel1 = randomForest(logprice ~ logacc + host_identity_verified + host_is_superhost + bedrooms + bathrooms + room_type + neighborhood, data = train, nodesize = 20, ntree = 200)
# Make predictions
predTest = predict(RFmodel1, newdata = test)
# SSE
SSE = sum((predTest - test$logprice)^2)
SSE
## [1] 74.71968
# RMSE
RMSE = sqrt(mean((predTest - test$logprice)^2))
RMSE
## [1] 0.3842751
# Baseline
baseline = mean(train$logprice)
baseline
## [1] 5.158113
# SSE of baseline model on testing set
SSEb = sum((baseline - test$logprice)^2)
SSEb
## [1] 210.6516
# R^2
Rsquared = 1 - SSE/SSEb
Rsquared
## [1] 0.6452926R2 = 0.6452926