Executive Summary

Introduction

In the following case study, I am consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.

The following case study helps them understand which zip codes would generate the most profit on short term rentals within New York City.

Datasets

1. AirBnb(Revenue Data) AirBnB is the medium through which the investor plans to lease out their investment property

2. Zillow(Cost Data) Zillow provides us an estimate of value for two-bedroom properties

Assumptions

  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
  • All properties and all square feet within each locale can be assumed to be homogeneous.
  • The occupancy rate is assumed to be 75%.
  • The average number of days for which the property will be rented is taken as 3. (In a recent survey done by AirBnb, it was found that majority of their stays last about 3 days)

Procedure

  1. Data Cleaning and Merging:
  • Both Revenue and Cost Data have a very large number of columns - 106 and 262 respectively. They were filtered based on predefined assumptions, data quality metrics(low percent of missing values) and keeping the end goal in mind.
  • The datasets were then merged on zipcodes. Any possible outliers were identified and removed and it was ensured all the datatypes and names of the fields were consistent with the documentation.
  • All the steps were carried out in a scalable way so that we don’t have to redo everything in our code if the dataset changes or the company approaches a new market.
  1. Exploratory Data Analysis: The final data was used to create insightful visualizations showing the zipcodes and boroughs where our investor could possibly profit from. The zipcodes were analysed and ranked according to various metrics like Payback period, ROI%(Return on Investment), Number of properties etc. Similar analysis was done for boroughs comparing their prices and property costs.

Key Insights

The zipcodes were analysed using a number of parameters: Payback period, ROI%, Appreciation Rate of Property Price(for the past 5 years) and Number of properties in that zipcode. The following were the key insights:

  1. The zipcodes with least amount of payback period are:

    • 10306(Staten Island) with Payback period of 9.5 years
    • 10303(Staten Island) with Payback period of 11.9 years
    • 11234(Brooklyn) with Payback period of 12.07 years
    • 11434(Queens) with Payback period of 12.77 years
    • 10305(Staten Island) with Payback period of 13.36 years
  2. Coming to Return on Investement(ROI%), these zipcodes performed really well: 11217(Brooklyn), 11231(Brooklyn), 11201(Brooklyn), 11215(Brooklyn), 11434(Queens)

  3. The average growth rate in property price for the 5 year period between 2012-17 has been highest for following: 11217(Brooklyn), 11231(Brooklyn), 11201(Brooklyn), 11215(Brooklyn), 10128(Manhattan)

  4. Out of all the above zipcodes, 11217 and 11215(Brooklyn) have a high number of listings indicating the popularity of the area

  5. Other key insights include :

    • Both prices and property costs are comparatively high in Manhattan.
    • There is no correlation between price(per night) and the population of the area(SizeRank).
    • Queens and Staten Island are cheaper in terms of both property and rental prices as compared to the other two.

Recommendations

I would recommend investing in Staten Island, particularly in zipcodes 10306,10303 and 10305 because each of them has very less payback period and at the same time they guarantee ROI% higher than the average. 11434(Queens) is also a great choice in terms of the above metrics. These zipcodes will involve lower risks as the investor will be able to break even quickly. In case the investor wants to hold the property for certain period of time(suppose for the next 5 years) and have high returns, Brooklyn(Zipcodes : 11217,11215) would be a great pick as these two zipcodes have high returns and they have seen highest increase rate in their property price from 2012-17. I would not personally recommend investing in Manhattan as both property cost and price rents are too high.

To Summarise my Recommendations:

  • Staten Island(Zipcodes 10303,10305 and 10306): Lower Payback Period, Decent ROI%
  • Brooklyn(Zipcodes 11217,11215): High ROI%, High Appreciation Rate in Property Prices(2012-17), High number of properties
  • Queens(Zipcode 11434): Lower Payback Period, Good ROI%
  • Manhattan: Not recommended

Out of all the above, Zipcode 10306 would be my first choice since it has the lowest payback period.

What’s Next

  1. The data for other expenses like maintenance, taxes and repairs is very important since it could help us to predict metrics like Payback period,ROI% etc. with greater accuracy.
  2. The analysis could be further improved if we can have more data for weekly price and monthly price since many people tend to stay for a week or a month.
  3. We can use Time series and Forecasting to better predict the future prices and bringing both the data on the same time scale.
  4. Further variables and factors like Crime rate, distance from tourist places/economic centers and other socio-economic factors could help us further in our analysis.

Installing and Loading required packages

Data Preparation and Analysis

Analysis of Revenue Data (Airbnb Dataset)

## [1] 48895   106

There are a total number of 48895 observations and 106 columns in the bnb dataset.

After analyzing the column names, we have to select the columns which seem relevant to our study. After selecting these columns, we have to check whether of any these columns have a large percent of missing values(greater than 50). Such a large number of missing values can be a problem for our analysis and lead to wrong results. So in this analysis, these columns have been removed accordingly.

The variables which were considered important for analysis are:

“id”,“neighbourhood_cleansed”,“neighbourhood_group_cleansed”,“city”,“zipcode”, “state” “latitude”,“longitude”,“room_type”,“price”,“bedrooms”,“cleaning_fee”,“square_fee t”,“weekly_price”,“monthly_price”,“review_scores_location”,“review_scores_rating”

Common functionality is developed for checking missing values of any dataframe so that it can be used later for other datasets in our analysis.

Checking for the percent of missing values for columns

Calling the above function

Variables Missing_Values
square_feet 99.17
monthly_price 89.27
weekly_price 87.72
review_scores_location 22.66
review_scores_rating 22.54
cleaning_fee 21.77
zipcode 1.06
city 0.13
bedrooms 0.04
state 0.01

Looking at the above values, three columns namely “square_feet”,“weekly_price” and “monthly_price” have a high percent of missing values (each greater than 70). So we have to remove them in our analysis.

Removing Redundant Columns

Since we are looking only for 2 bedroom properties in NYC, so columns like “city”,“state” and “bedrooms” will no longer be required after filtering.

Similar analysis has to be done for “Reviews” related columns to see if they are adding any value to our dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20.00   92.00   96.00   93.93  100.00  100.00   11022
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   9.000  10.000   9.565  10.000  10.000   11082

Apart from 1 or 2 outliers in the lower range both overall ratings and location reviews seem to be positive with both median and mean lying close to the max values in each case. Since all these reviews are positive, they are not adding any value in comparing two listings. Therefore, they are also removed.

Filtering for final Revenue Dataset

Creating a function so that it becomes easy for the client in case they want to have data for different bedroom properties.

Calling the above function

Since the data is going to be merged on zipcodes, we have to check the number of missing values for zipcode column. There are still 50 missing zipcode values in our dataset.

Furthermore it should be checked whether all the zipcodes have consistent length.

## 
##    5 
## 6447

All the zipcodes (except those with missing values) have 5 digits.

Similar to the checking missing values functionality, a common process has been followed for replacing missing values of important fields like zipcodes. These values are imputed by using the median values in the same neighbourhood and the borough as that of missing value.

Calling the above function

Final Revenue Dataset

id neighbourhood_cleansed neighbourhood_group_cleansed zipcode latitude longitude room_type price cleaning_fee
7750 East Harlem Manhattan 10029 40.79685 -73.94872 Entire home/apt $190.00 NA
13050 Bedford-Stuyvesant Brooklyn 11221 40.68554 -73.94090 Entire home/apt $115.00 $85.00
14290 Williamsburg Brooklyn 11206 40.70420 -73.93560 Entire home/apt $228.00 $128.00
15396 Chelsea Manhattan 10001 40.74623 -73.99530 Entire home/apt $375.00 $120.00
15711 Upper East Side Manhattan 10162 40.77065 -73.95269 Entire home/apt $250.00 $200.00
16458 Park Slope Brooklyn 11215 40.67343 -73.98338 Entire home/apt $225.00 NA

Data Quality Insights for Revenue Data: High percent of missing values were present in large number of columns.Data type for price fields is not numeric since they contain the $ sign.State and Zipcodes had many inconsistent/wrong entries but most of them were automatically removed after filtering for 2 bedroom properties.

Analysis of Cost Data (Zillow Dataset)

## [1] 8946  262

The columns contained in the above dataset are:

  • RegionID
  • RegionName
  • Location Details (City,State,Metro,CountyName)
  • SizeRank
  • Property price from 1996-04 to 2017-06

Before merging the datasets we need to bring both the prices on the same time scale so as to make even comparison.

Average Price Trends across the years

Data Quality Insights for Cost Data: As seen in the above graph, a lot of data is missing from 1996 to 2004. So it will not be a good idea to incorporate them for calculating future prices. A better picture can be portrayed by using the data from the last 5 years. (2012-2017)

Cleaning Zillow Dataset

Calling the above function for suitable parameters : for New York City

Since there are a lot of fluctuations in the monthly data, we have to average out the prices for past each year. Using these prices, then we compute the average appreciation rate for the past 5 years. The function has been created in such a way that we can get growth rate for any past ‘n’ number of years no matter the latest data. So it would be useful in case the zillow data gets available even for the next two/three years after 2017.

Creating Final Cost Data

A look at the Final Cost data

zipcode SizeRank propertyCost avg_growth
10025 1 1431000 0.0965309
10023 3 2142300 0.0892191
10128 14 1787100 0.0975620
10011 15 2480400 0.0920633
10003 21 2147000 0.0850568
11201 32 1420700 0.1261907

Merging the two datasets

## 'data.frame':    1569 obs. of  12 variables:
##  $ zipcode                     : chr  "10003" "10003" "10003" "10003" ...
##  $ id                          : int  19288937 13866112 31835260 5207973 17843998 4126452 35415489 2150328 31408487 3984168 ...
##  $ neighbourhood_cleansed      : Factor w/ 221 levels "Allerton","Arden Heights",..: 65 65 87 65 65 87 87 65 65 65 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ latitude                    : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ longitude                   : num  -74 -74 -74 -74 -74 ...
##  $ room_type                   : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ price                       : Factor w/ 674 levels "$0.00","$1,000.00",..: 244 129 208 222 244 189 244 635 640 134 ...
##  $ cleaning_fee                : Factor w/ 206 levels "$0.00","$1,000.00",..: 12 192 21 162 181 25 25 4 4 43 ...
##  $ SizeRank                    : int  21 21 21 21 21 21 21 21 21 21 ...
##  $ propertyCost                : int  2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 ...
##  $ avg_growth                  : num  0.0851 0.0851 0.0851 0.0851 0.0851 ...

Data Cleaning for the final dataset

Checking Missing Values in the final data

##              Missing_Values
## cleaning_fee           13.7

Cleaning Fee has about 13.7% missing values. Since it is an important parameter in determining the returns for a listing so we have to impute it.

Impute Missing values of the Cleaning fee

##              Missing_Values
## cleaning_fee           0.13

Checking outliers in rent

From the boxplot, it is visible that there are 3 listings with quite a high price(greater than 2000$/day): 2 in Manhattan and 1 in Brooklyn. Such high prices could affect our analysis as they could be a result of some data entry error and there are not many listings with such astronomically high charges so it would be better to remove those observations.

Removing outliers in rent

Checking whether zipcodes are unique for a Neighbourhood group

## # A tibble: 1 x 2
##   zipcode total
##   <fct>   <int>
## 1 10013       2

10013 zipcode is present in two neighbourhoods - Brooklyn and Manhattan. However it is an area belonging to Lower Manhattan. So updating the neighbourhood of that observation from Brooklyn to Manhattan.

A glimpse at the final dataset

zipcode id neighbourhood_cleansed neighbourhood_group_cleansed latitude longitude room_type price cleaning_fee SizeRank propertyCost avg_growth
10003 19288937 East Village Manhattan 40.72784 -73.98883 Entire home/apt 250 110 21 2147000 0.0850568
10003 13866112 East Village Manhattan 40.72394 -73.98872 Entire home/apt 165 85 21 2147000 0.0850568
10003 31835260 Gramercy Manhattan 40.73782 -73.98547 Entire home/apt 218 120 21 2147000 0.0850568
10003 5207973 East Village Manhattan 40.72950 -73.98623 Entire home/apt 230 60 21 2147000 0.0850568
10003 17843998 East Village Manhattan 40.72571 -73.98789 Entire home/apt 250 75 21 2147000 0.0850568
10003 4126452 Gramercy Manhattan 40.73327 -73.98503 Entire home/apt 200 125 21 2147000 0.0850568

Structure of the final data

## 'data.frame':    1566 obs. of  12 variables:
##  $ zipcode                     : Factor w/ 24 levels "10003","10011",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id                          : chr  "19288937" "13866112" "31835260" "5207973" ...
##  $ neighbourhood_cleansed      : Factor w/ 221 levels "Allerton","Arden Heights",..: 65 65 87 65 65 87 87 65 65 65 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ latitude                    : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ longitude                   : num  -74 -74 -74 -74 -74 ...
##  $ room_type                   : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ price                       : num  250 165 218 230 250 200 250 850 88 170 ...
##  $ cleaning_fee                : num  110 85 120 60 75 125 125 100 100 150 ...
##  $ SizeRank                    : Factor w/ 24 levels "1","3","14","15",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ propertyCost                : int  2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 ...
##  $ avg_growth                  : num  0.0851 0.0851 0.0851 0.0851 0.0851 ...

Visualization & Insights

Zipcode Analysis

Payback period

The zipcodes with less payback period will be better for investment as the investor will be able to quickly recover cash from their initial investment.

Zipcodes with least payback period
zipcode years
10306 9.478526
10303 11.966507
11234 12.077044
11434 12.778073
10305 13.361443

Zipcodes : 10306,10303,11234,11434 and 10305 have the least amount of payback period.(Dashed blue line shows the average)

Return on Investment(ROI%)

ROI = (Gain from Investment - Cost of Investment)/Cost of Investment

We can use simple ROI% calculation here since the investor will be paying for the property in cash. Suppose they are going to hold the property for 5 years and want to get high ROI%.

Zipcodes : 11217,11231,11201,11215 and 11434 have the highest ROI% among all the others.(Dashed blue line shows the average)

Number of properties by zipcodes

Zipcodes : 11215,10036,10003,10025 and 11217 have the highest number of property listings indicating the popularity of the area.

Zipcodes with Highest Property Appreciation Rate(from 2012-17)
zipcode neighbourhood_group_cleansed growthPercent number_of_properties
11217 Brooklyn 14.18 123
11231 Brooklyn 13.78 93
11201 Brooklyn 12.62 86
11215 Brooklyn 11.82 190
10128 Manhattan 9.76 65

Out of the above, both 11217 and 11215 had high property appreciation rates and also large number of properties.

Borough Analysis

Price(per night) distribution across neighbourhood

Property Cost distribution across neighbourhood

  • Manhattan has high property prices along with high rental prices as compared to the other three.
  • Queens and Staten Island have comparatively low prices in terms of both property and rent as compared to Brooklyn and Manhattan.

Further Insights

Does population of an area(SizeRank) affect Price(per night)?

Most of the prices lie in the region below 1000$ mark indicating there is not any noticeable relation between price per night and the population of that area.A more visible relation was seen in the earlier graph when the price distribution of all the boroughs was compared showing that Manhattan and Brooklyn had higher prices as compared to the other two.

Do Queens and Staten Island have lower prices because of higher number of private rooms?

Since private rooms are much cheaper as compared to renting an entire home, we wanted to know whether low prices in Staten Island and Queens are because they contain higher proportion of private rooms. However, from the graph it is easily visible that prices of both room types tend to follow same pattern across the neighbourhoods. Hence, we can conclude that rental prices are low in Staten Island and Queens irrespective of whether it’s a private room or an entire home.