We will be going over Airbnb listings data scraped from Airbnb.com. The data was scraped by insideairbnb.com.

The primary analyses performed are:

  1. Estimating/extrapolating annual expected returns and risks from Airbnb listing for 2 bedroom units for each neighborhood
  2. Exploring the relationship between expected returns and risks with Return-vs-Risk chart and a regression model
  3. Comparing return and risk relationships between several cities

Preparing the Airbnb dataset

This dataset is very rich, but here we will explore “the expected return” of investing in property to be used for occasional renting (through Airbnb). Our analysis will focus on 2 bedroom listings. To understand better what the expected return is with respect to property value itself, we will also use housing price data from Zillow.

We will start with looking at data from San Diego. First we specify the paths of our datasets:

airbnb_data_path = "./sd/listings.csv"
num_bedrooms = "2"
zillow_data_path = "~/Downloads/Neighborhood_Zhvi_2bedroom.csv"
zillow_city = "San Diego"
zillow_rent_data = "~/Downloads/Neighborhood_MedianRentalPrice_2Bedroom.csv"

While there are over 90 attributes in this dataset, in this analysis we will only use the variables “neighbourhood”, “bedrooms”, and “price.” Here is what a random sample of these variables looks like:

df.listings[sample(nrow(df.listings), 10),c("neighbourhood", "bedrooms", "price")]
##      neighbourhood bedrooms   price
## 3700                      3 $389.00
## 6213                      2 $200.00
## 47                        1  $59.00
## 5682                      1 $200.00
## 2865    Harborview        2 $330.00
## 6124   Golden Hill        2 $100.00
## 781   College West        2 $110.00
## 4971      Old Town        1  $99.00
## 3894 Mission Beach        1 $150.00
## 89                        1  $59.00

Note that there may be some neighborhoods that are blank. This may be because either the scraper didn’t parse out the neighborhood field from the HTML properly or that it was hidden by Airbnb.com or the owner of the listing. We will eventually drop neighborhoods that are missing or have too few listings (since too few listings do not yield reliable statistics).

Preparing the Zillow housing dataset and merging with Airbnb data

We next load housing price data on 2 bedroom units from Zillow.

z.df = read.csv(zillow_data_path, header = TRUE, stringsAsFactors = FALSE)
city.df = subset(z.df, City == zillow_city)

We are assuming that what Airbnb refers to as “neighbourhood” is called “RegionName” in Zillow. In the Zillow dataset, for San Diego we have a total of 82 neighbourhoods. On the other hand, in the Airbnb dataset, we have a total of 105 neighbourhoods in San Diego. In addition to the number of neighbourhoods not being the same, there are also neighbourhoods that don’t appear in one data or the other. We next merge the Zillow and Airbnb datasets:

airbnb.zillow = merge(df.listings, city.df[, c("RegionName","X2016.08")], by.x = "neighbourhood", by.y = "RegionName")

After merging the Zillow with Airbnb datasets, we have 0 neighbourhoods and so have “lost”" several neighbourhoods. This may be due to different naming/spelling conventions Airbnb and Zillow uses (for example, one may refer to the UTC/UCSD by the Gilman Dr 5 Freeway exit area as “La Jolla Village Colony” while the other as just “La Jolla Village”). In this quick analysis we have not taken care to deal with these issues, but we can refine and include these cases for future work.

Still, we have a large number of neighbourhoods at our disposal to perform some analyses.

Estimating expected annual returns and risks for 2 bedroom units by neighborhood

We are interested in knowing what is our expected yearly return. But the Airbnb listing price is based on a nightly basis. Since we do not know how often a listing is occupied (number of times per year a listing is actually used), it is difficult/impossible to determine a yearly return on a unit (unless, of course, one has company insider access to Airbnb data).

Currently, there are city governments that are regulating the use of Airbnb. This is often done by preventing a listing to be listed on Airbnb year round. In light of such regulations, we will make the assumption that a listing is occupied for half of the year. This may be an optimistic upper bound, but many Airbnb advocates (namely, those that are investing in property solely for the purposes of short-term renting as opposed to traditional renting models) argue that any such limits hurts their returns.

In other words, from the point of view of the Airbnb investor, a listing being occupied for half a year may actually be overtly pessimistic. What matters at the end of the day, however, is how does the yearly return from Airbnb compare to traditional renting (which we will view shortly). Note that the Zillow housing prices we have are from August 2016.

percent.occupied = .5 # how often of 365 days listing is occupied
airbnb.zillow$expected.nightly.return = 100*percent.occupied*365*airbnb.zillow$price.numeric/airbnb.zillow$X2016.08

Having estimated the yearly return on Airbnb properties (with respect to median 2 Bedrooms housing prices as obtained from Zillow), we next compute the expected return for each neighbourhood and the associated risk (only for 2 Bedroom listings).

We compute the expected return by simply taking the median of the yearly returns in each neighbourhood. We compute risk by computing the standard deviation of the yearly returns for each neighbourhood.

Charts like this are often used in portfolio theory for determining the distribution of allocating funds and computing the “efficient frontier.” The way to read this is like this: you can see that Marina and Mission Valley East have roughly the same risk (same x-value of 4.27 and 4.29 respectively). But, Mission Valley East has a significantly higher return of 10.6% compared to Marina with a return of 6.08%. So this means that if you are willing to tolerate a risk of ~4.28, then you should definitely go with investing in the Mission Valley East neighborhood since it has much higher return.

There is a huge caveat here however. If you hover your for each neighborhood, you’ll note that we have also included the number of listings (referred to as units in the chart) that were used to calculate expected return and risk. So you can see in our Mission Valley East and Marina example that there were “only” 25 and 14 listings used respectively. If you are skeptical that we do not have enough units to get reliable estimates, then consider Ocean Beach and Mission Beach for example (both having over 50 listings).

Neighbourhoods ranked by Return-to-Risk ratio:

It would be nice to have a way of ranking neighbourhoods that both maximize return and minimize risk simultaneously. One way of doing is by computing the ratio between return and risk (this is analogous to the Sharpe ratio used in portfolio optimization). Thus, ranking the neighbourhoods this way using the return-to-risk ratio yields the following neighborhoods from highest to lowest:

##           Neighbourhood    Return     Risk Return.to.Risk
## 277  University Heights  6.302616 1.780899      3.5390074
## 197         Ocean Beach  6.131851 1.829928      3.3508695
## 174      Normal Heights  4.311363 1.334131      3.2315894
## 137             Midtown  7.382255 2.317153      3.1859161
## 18             Bay Park  7.966111 2.510854      3.1726697
## 271     University City  6.672180 2.475312      2.6954904
## 157       Mission Beach  4.909081 1.830866      2.6812893
## 251          South Park  5.617671 2.209856      2.5420976
## 163 Mission Valley East 10.604730 4.286891      2.4737577
## 84          Golden Hill  9.387860 3.854253      2.4357143
## 68      Del Mar Heights  4.155563 2.092764      1.9856818
## 188          North Park  6.772159 3.529343      1.9188155
## 203       Pacific Beach  6.136516 3.312763      1.8523860
## 61        Core-Columbia  7.970979 4.407653      1.8084409
## 103            La Jolla  4.386597 2.603161      1.6851042
## 98           Kensington  5.314674 3.255741      1.6324009
## 216  Point Loma Heights  6.187839 3.798243      1.6291321
## 114         Linda Vista  8.027167 5.392047      1.4887049
## 134              Marina  6.084822 4.271964      1.4243618
## 74         East Village  7.419250 5.306229      1.3982152
## 93            Hillcrest  6.398859 6.044677      1.0585940
## 210           Park West  6.784513 6.603938      1.0273436
## 120        Little Italy  9.256441 9.541240      0.9701508

Regression model quantifying relationship between risk and returns:

## 
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return, 
##     data = hood.return.subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8128 -1.0594 -0.1978  0.5261  3.6155 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.9781     0.6471   7.693 1.53e-07 ***
## sd.expected.nightly.return   0.4691     0.1565   2.998  0.00686 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.419 on 21 degrees of freedom
## Multiple R-squared:  0.2997, Adjusted R-squared:  0.2664 
## F-statistic: 8.987 on 1 and 21 DF,  p-value: 0.006855

The coefficient of this model tells us for every “unit” or risk that we are willing to tolerate, how much return should we expect. In the case of San Diego, this coefficient is roughly 0.4691 with a p value that is small enough for publication. This suggests, at the very least, that as we increase our risk we should also expect our return to go up. It would be interesting to compare how this coefficient changes for different cities (higher means we should expect more return, and lower means we should expect less).

Returns from traditional rental units

It would be nice to know how do returns from Airbnb compare to traditional renting. Zillow also provides some rent data that we can use to get an understanding of what typical returns are from renting.

While the Zillow housing price data has 82 neighbourhoods listed for San Diego, the Zillow rent data has only 11 neighbourhoods.

The Zillow rent data gives us the median monthly rent data for each neighbourhood. Since we would like to understand yearly return, we again have to extrapolate for an entire. We will assume that the unit is rented for the whole year so the extrapolation simply multiplies the rent data by 12

house.rent$rent.return = 100*12*house.rent$rent.value/house.rent$price.value

Having estimated what the expected yearly return on rental properties are, let’s now visualize how this has changed in the course of the last few years:

As this chart shows, the return of rental properties is comparable to Airbnb returns. More over, while we do not explicitly compute it here, the variance (or risk) of these returns is relatively stable over time (of course, notion of stability depends on your tolerance of risk).

TL;DR Returns and Risks from San Diego

Risk model for San Diego

## 
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return, 
##     data = sd.risk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8128 -1.0594 -0.1978  0.5261  3.6155 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.9781     0.6471   7.693 1.53e-07 ***
## sd.expected.nightly.return   0.4691     0.1565   2.998  0.00686 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.419 on 21 degrees of freedom
## Multiple R-squared:  0.2997, Adjusted R-squared:  0.2664 
## F-statistic: 8.987 on 1 and 21 DF,  p-value: 0.006855

TL;DR Returns and Risks from Los Angeles

Risk model for Los Angeles:

## 
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return, 
##     data = la.risk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0838 -0.4672 -0.1161  0.6777  2.4919 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.8131     0.4984   7.651 1.23e-07 ***
## sd.expected.nightly.return   0.2526     0.1794   1.408    0.173    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.087 on 22 degrees of freedom
## Multiple R-squared:  0.08271,    Adjusted R-squared:  0.04101 
## F-statistic: 1.984 on 1 and 22 DF,  p-value: 0.173

TL;DR Returns and Risks from San Francisco

Risk model for San Francisco:

## 
## Call:
## lm(formula = median.expected.nightly.return ~ sd.expected.nightly.return, 
##     data = sf.risk)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64858 -0.82316  0.01408  0.69900  2.02440 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.18658    0.43629   9.596  6.3e-09 ***
## sd.expected.nightly.return  0.09933    0.10216   0.972    0.343    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.021 on 20 degrees of freedom
## Multiple R-squared:  0.04513,    Adjusted R-squared:  -0.002608 
## F-statistic: 0.9454 on 1 and 20 DF,  p-value: 0.3425