We will be going over Airbnb listings data scraped from Airbnb.com. The data was scraped by insideairbnb.com.
The primary analyses performed are:
This dataset is very rich, but here we will explore “the expected return” of investing in property to be used for occasional renting (through Airbnb). Our analysis will focus on 2 bedroom listings. To understand better what the expected return is with respect to property value itself, we will also use housing price data from Zillow.
We will start with looking at data from San Diego. First we specify the paths of our datasets:
airbnb_data_path = "./sd/listings.csv"
num_bedrooms = "2"
zillow_data_path = "~/Downloads/Neighborhood_Zhvi_2bedroom.csv"
zillow_city = "San Diego"
zillow_rent_data = "~/Downloads/Neighborhood_MedianRentalPrice_2Bedroom.csv"
While there are over 90 attributes in this dataset, in this analysis we will only use the variables “neighbourhood”, “bedrooms”, and “price.” Here is what a random sample of these variables looks like:
df.listings[sample(nrow(df.listings), 10),c("neighbourhood", "bedrooms", "price")]
## neighbourhood bedrooms price
## 3700 3 $389.00
## 6213 2 $200.00
## 47 1 $59.00
## 5682 1 $200.00
## 2865 Harborview 2 $330.00
## 6124 Golden Hill 2 $100.00
## 781 College West 2 $110.00
## 4971 Old Town 1 $99.00
## 3894 Mission Beach 1 $150.00
## 89 1 $59.00
Note that there may be some neighborhoods that are blank. This may be because either the scraper didn’t parse out the neighborhood field from the HTML properly or that it was hidden by Airbnb.com or the owner of the listing. We will eventually drop neighborhoods that are missing or have too few listings (since too few listings do not yield reliable statistics).
We next load housing price data on 2 bedroom units from Zillow.
z.df = read.csv(zillow_data_path, header = TRUE, stringsAsFactors = FALSE)
city.df = subset(z.df, City == zillow_city)
We are assuming that what Airbnb refers to as “neighbourhood” is called “RegionName” in Zillow. In the Zillow dataset, for San Diego we have a total of 82 neighbourhoods. On the other hand, in the Airbnb dataset, we have a total of 105 neighbourhoods in San Diego. In addition to the number of neighbourhoods not being the same, there are also neighbourhoods that don’t appear in one data or the other. We next merge the Zillow and Airbnb datasets:
airbnb.zillow = merge(df.listings, city.df[, c("RegionName","X2016.08")], by.x = "neighbourhood", by.y = "RegionName")
After merging the Zillow with Airbnb datasets, we have 0 neighbourhoods and so have “lost”" several neighbourhoods. This may be due to different naming/spelling conventions Airbnb and Zillow uses (for example, one may refer to the UTC/UCSD by the Gilman Dr 5 Freeway exit area as “La Jolla Village Colony” while the other as just “La Jolla Village”). In this quick analysis we have not taken care to deal with these issues, but we can refine and include these cases for future work.
Still, we have a large number of neighbourhoods at our disposal to perform some analyses.
We are interested in knowing what is our expected yearly return. But the Airbnb listing price is based on a nightly basis. Since we do not know how often a listing is occupied (number of times per year a listing is actually used), it is difficult/impossible to determine a yearly return on a unit (unless, of course, one has company insider access to Airbnb data).
Currently, there are city governments that are regulating the use of Airbnb. This is often done by preventing a listing to be listed on Airbnb year round. In light of such regulations, we will make the assumption that a listing is occupied for half of the year. This may be an optimistic upper bound, but many Airbnb advocates (namely, those that are investing in property solely for the purposes of short-term renting as opposed to traditional renting models) argue that any such limits hurts their returns.
In other words, from the point of view of the Airbnb investor, a listing being occupied for half a year may actually be overtly pessimistic. What matters at the end of the day, however, is how does the yearly return from Airbnb compare to traditional renting (which we will view shortly). Note that the Zillow housing prices we have are from August 2016.
percent.occupied = .5 # how often of 365 days listing is occupied
airbnb.zillow$expected.nightly.return = 100*percent.occupied*365*airbnb.zillow$price.numeric/airbnb.zillow$X2016.08
Having estimated the yearly return on Airbnb properties (with respect to median 2 Bedrooms housing prices as obtained from Zillow), we next compute the expected return for each neighbourhood and the associated risk (only for 2 Bedroom listings).
We compute the expected return by simply taking the median of the yearly returns in each neighbourhood. We compute risk by computing the standard deviation of the yearly returns for each neighbourhood.
Charts like this are often used in portfolio theory for determining the distribution of allocating funds and computing the “efficient frontier.” The way to read this is like this: you can see that Marina and Mission Valley East have roughly the same risk (same x-value of 4.27 and 4.29 respectively). But, Mission Valley East has a significantly higher return of 10.6% compared to Marina with a return of 6.08%. So this means that if you are willing to tolerate a risk of ~4.28, then you should definitely go with investing in the Mission Valley East neighborhood since it has much higher return.
There is a huge caveat here however. If you hover your for each neighborhood, you’ll note that we have also included the number of listings (referred to as units in the chart) that were used to calculate expected return and risk. So you can see in our Mission Valley East and Marina example that there were “only” 25 and 14 listings used respectively. If you are skeptical that we do not have enough units to get reliable estimates, then consider Ocean Beach and Mission Beach for example (both having over 50 listings).
It would be nice to have a way of ranking neighbourhoods that both maximize return and minimize risk simultaneously. One way of doing is by computing the ratio between return and risk (this is analogous to the Sharpe ratio used in portfolio optimization). Thus, ranking the neighbourhoods this way using the return-to-risk ratio yields the following neighborhoods from highest to lowest:
## Neighbourhood Return Risk Return.to.Risk
## 277 University Heights 6.302616 1.780899 3.5390074
## 197 Ocean Beach 6.131851 1.829928 3.3508695
## 174 Normal Heights 4.311363 1.334131 3.2315894
## 137 Midtown 7.382255 2.317153 3.1859161
## 18 Bay Park 7.966111 2.510854 3.1726697
## 271 University City 6.672180 2.475312 2.6954904
## 157 Mission Beach 4.909081 1.830866 2.6812893
## 251 South Park 5.617671 2.209856 2.5420976
## 163 Mission Valley East 10.604730 4.286891 2.4737577
## 84 Golden Hill 9.387860 3.854253 2.4357143
## 68 Del Mar Heights 4.155563 2.092764 1.9856818
## 188 North Park 6.772159 3.529343 1.9188155
## 203 Pacific Beach 6.136516 3.312763 1.8523860
## 61 Core-Columbia 7.970979 4.407653 1.8084409
## 103 La Jolla 4.386597 2.603161 1.6851042
## 98 Kensington 5.314674 3.255741 1.6324009
## 216 Point Loma Heights 6.187839 3.798243 1.6291321
## 114 Linda Vista 8.027167 5.392047 1.4887049
## 134 Marina 6.084822 4.271964 1.4243618
## 74 East Village 7.419250 5.306229 1.3982152
## 93 Hillcrest 6.398859 6.044677 1.0585940
## 210 Park West 6.784513 6.603938 1.0273436
## 120 Little Italy 9.256441 9.541240 0.9701508
##
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return,
## data = hood.return.subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8128 -1.0594 -0.1978 0.5261 3.6155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9781 0.6471 7.693 1.53e-07 ***
## sd.expected.nightly.return 0.4691 0.1565 2.998 0.00686 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.419 on 21 degrees of freedom
## Multiple R-squared: 0.2997, Adjusted R-squared: 0.2664
## F-statistic: 8.987 on 1 and 21 DF, p-value: 0.006855
The coefficient of this model tells us for every “unit” or risk that we are willing to tolerate, how much return should we expect. In the case of San Diego, this coefficient is roughly 0.4691 with a p value that is small enough for publication. This suggests, at the very least, that as we increase our risk we should also expect our return to go up. It would be interesting to compare how this coefficient changes for different cities (higher means we should expect more return, and lower means we should expect less).
It would be nice to know how do returns from Airbnb compare to traditional renting. Zillow also provides some rent data that we can use to get an understanding of what typical returns are from renting.
While the Zillow housing price data has 82 neighbourhoods listed for San Diego, the Zillow rent data has only 11 neighbourhoods.
The Zillow rent data gives us the median monthly rent data for each neighbourhood. Since we would like to understand yearly return, we again have to extrapolate for an entire. We will assume that the unit is rented for the whole year so the extrapolation simply multiplies the rent data by 12
house.rent$rent.return = 100*12*house.rent$rent.value/house.rent$price.value
Having estimated what the expected yearly return on rental properties are, let’s now visualize how this has changed in the course of the last few years:
As this chart shows, the return of rental properties is comparable to Airbnb returns. More over, while we do not explicitly compute it here, the variance (or risk) of these returns is relatively stable over time (of course, notion of stability depends on your tolerance of risk).
Risk model for San Diego
##
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return,
## data = sd.risk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8128 -1.0594 -0.1978 0.5261 3.6155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9781 0.6471 7.693 1.53e-07 ***
## sd.expected.nightly.return 0.4691 0.1565 2.998 0.00686 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.419 on 21 degrees of freedom
## Multiple R-squared: 0.2997, Adjusted R-squared: 0.2664
## F-statistic: 8.987 on 1 and 21 DF, p-value: 0.006855
Risk model for Los Angeles:
##
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return,
## data = la.risk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0838 -0.4672 -0.1161 0.6777 2.4919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8131 0.4984 7.651 1.23e-07 ***
## sd.expected.nightly.return 0.2526 0.1794 1.408 0.173
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 22 degrees of freedom
## Multiple R-squared: 0.08271, Adjusted R-squared: 0.04101
## F-statistic: 1.984 on 1 and 22 DF, p-value: 0.173
Risk model for San Francisco:
##
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return,
## data = sf.risk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64858 -0.82316 0.01408 0.69900 2.02440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.18658 0.43629 9.596 6.3e-09 ***
## sd.expected.nightly.return 0.09933 0.10216 0.972 0.343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.021 on 20 degrees of freedom
## Multiple R-squared: 0.04513, Adjusted R-squared: -0.002608
## F-statistic: 0.9454 on 1 and 20 DF, p-value: 0.3425