Exploring Property Listings within New York City

Introduction


Brief Description

We have been provided Airbnb and Zillow datasets to anlayse and generate meanigful insights regarding properties price and rent within New York City as part of Capital One Data Challenge. Airbnb is based out of San Francisco and provides an online platform where users can either book or offer lodging services. Airbnb takes a commission for letting the user list any property on their platform. As a consequence, Airbnb collects all the property related data that gets listed on their platform. The data is publicly available and contains all the information that a user sees related to a any property on Airbnb webiste. The details include the daily rent that property is going to charge, number of bedrooms, number of bathrooms, amenities, number of reviews, review score, decsription about the property, etc. Next is the dataset that is made publicly available by Zillow. Zillow is the real estate online platform that provides information about property value and other property related information. It is headquartered in Seattle. Zillow also comes up with Zestimate which is Zillow’s estimated market value for an individual home. Cool! Data provided by Zillow contains zipcode wise prices of the properties for the last 20 years. And, so my task will be analyse the wonderful data that is made available thanks to Airbnb and Zillow and try to figure out the ideal plcaes to make an investment. Simple, right? Let the analysis begin. The programming languaage/ tools used to analyse the data were R and R Shiny.


Problem Statement

To analyse the datasets of Airbnb and Zillow to answer the following question:

  • Which zip codes are the most profitable to invest in within New York City for two bed room properties

Approach

Before we begin with our analysis, it is always important to have a look at the data and get a feel of it. By feeling, I mean to say one should be extremely clear what the each and every variable means. Then, I’ll be by taking a look at the structure and summary of the data after importing the required datasets in R. It is also improtant visualize the data to see if the distribution of the variables, relationship among the variables and see if there any outliers. Looking at the data helps you see what is wrong with the data and the methods that need to be applied to clean the data. Cleaning the data is an essential part of coming up with meaningful insights. Once the data is cleaned, we will move on to the exploratory data analysis and apply two different approaches to calculate the most profiatble zipcodes to invest in. I have used both static and interactive visuals to convey the property landscape in NYC.


Packages Required

Following are the packages required to analyse the given datasets:

  • readr - It provides a fast and friendly way to read rectangular (csv, tsv) data
  • dplyr - It provides verbs that helps in data manipulation of objects like data frame
  • ggplot2 - It is a package to to visualize the data
  • ggrepel - It is used for character manipulation, pattern matching in strings, etc.
  • DT - Helps in displaying data objects as tables on HTML pages
  • tidyverse - tidyverse is a set of packages. It is designed to tidy up the data and provides various functioanalities
  • corrplot - Provides functionlaity to plot correlation among the the predictors
  • ggalt - Includes functionality to plot lollipop charts, dumbbell charts, the ability to encircle points and coordinate-system-based text annotations
  • ggmap - It includes collection of functions to visualize spatial data and models
  • sp - Includes methods for spatial data
  • rgdal - Provides bindings to the ‘Geospatial’ Data Abstraction Library (‘GDAL’)
  • tmap - It offers a flexible, layer-based, and easy to use approach to create thematic maps, such as choropleths and bubble maps
  • scales - Provides the internal scaling infrastructure to ggplot2
  • RColorBrewer - Provides color schemes for maps
  • viridisLite - Provides color schemes for maps

Data Description & Preparation

Data Description

Airbnb dataset - It contains 96 features providing property related information within New York City. My task is to only look for 2 bedroom peroperties in NYC.

Some of the important Variables and their description

  • host_id - Uniquely identfies host. One host can list multiple properties
  • id - Uniquely identifies property listed
  • Zipcode - Zipcode where each property is listed by the host
  • neighbourhood_group_cleansed - Identifies the five boroughs within NYC
  • neighbourhood_cleansed - Identifies neighbourhood within one of the 5 boroughs
  • price - Daily rent charged by the host for the prorperty
  • availability_365 - Number of days the property is available in the coming 365 days
  • review_scores_location - Score provided by the guest to the host out of 10 based on the location

Zillow dataset - It contains monthly information of the property price (cost) for the past 20 years

Some of the important Variables and their description

  • RegionName - Zipcode where the property is present
  • 1996-04:2017-06 - Price of each property updated monthly from 1996-04 to 2017-06

Glimpses of the data

  • 4894 - Number of 2 bedroom properties listed on Airbnb in NYC
  • 531 - Total number of superhosts
  • 92508 - Total number of reviews posted by the guest

Data Importing and Cleaning

After improting the data, it is necessary to visulaize the data to see if there are any outliers or mistakes due to incorrect entry of the data. I found out that the daily price had incorrect entries/ outliers which had to be removed like 9999. There was also a property which was listed under the neighboruhood Manhattan but the state was mentioned incorrectly as ‘New Jersey’. Also, the price conatined $ and , which had to removed. Finally, I didn’t consider the variables which I felt were not required like varibles containing text, State, etc. After all the necessary transformations, it was time to do some analysis.

Data Analysis

Neighborhood Clustering

The below plot depicts the clustering of neighborhoods based on ‘average yearly booking count’ for each neighbourhood, ‘average daily rent’ for each neighbourhood and ‘average location review score’ for each neighbourhood. Booking count was arrived at by subtracting the availability of the property in the next 365 days by 365. Booking count indicates the number of days the property is booked in the coming year. There were few other variables that were similar to booking count for next 365 days like booking count for 30 days, 60 days, etc. but they were highly correlated. I also considered location review score instead of overall review score as the focus is on the location the investment needs to be done. Overall review score considers a lot of factors which are specific to host and host’s property other than the location. The algorithm used here is K-means clustering and the variables have been scaled so that comparison can be made on a similar scale. We can see three clusteres being formed; one cluster represented by blue color has high average booking count and high average daily price. Another one represented by green color has high average booking count and low average daily price and the last one represented by red color has low average daily price and low average booking count. Blue circles represent neighbourhoods like Flatiron District, SoHo, Midtown, Chelsea, etc. Green circles represent neighborhoods like Concourse Village, Clifton, Fort Hamilton, Highbridge, etc. Finally, blue circles represent South Ozone Park, Unionport, Springfield Gardens, etc.

You can click on the below link to see interactive clustering using RShiny.


Appreciation of Properties

Property appreciation is one of the most important factors that is considered when anyone wants to take a decison on purchasing the property. The plot below shows the 5 year annualized return for holding the property. The return of the properties have been averaged based on the neighborhood they belong. The annualized return has been calculated based on the past 5 years since returns of the past 10 years might not have portrayed a true picture given the financial crisis that happened in the year 2008. The mean annualized 5 year return of all the neighbourhhods displayed in the plot came out to be 11.4%. Approximately 20 out of 50 neighborhoods generated above average return. Prospect Heights, Fort Greene and Boerum Hill were the top neighborhoods.


Properties, Properties and more Properties!

It might be fair to say that no one would like to buy a property in a deserted area. But at the same time if we expect any neighbourhood is gaining popularity due to some reason then we can see a rapid rise in the property count in that particular neighbourhood. We are reading in the news daily that companies are thinking of moving their headquarters to places like Austin where the property costs are not sky high like San Francisco. Here, we take a look at the borough wise neighbourhoods having the highest poperty counts. We won’t be able to figure if any particular area is gaining popularity as we don’t have count of properties from the past years in the dataset provided. But, we will use the cost of the property and the rent it is generating to see how much time it will take for a property to breakeven later.


Where are the Super Hosts?

Competition can be good and bad at the same time. If there is too much competition it might be good for no player. It is important to look where the property hosts are and specially the super hosts. As per Airbnb, super hosts are someone who provide a shining example for other hosts, and provide extraordinary experiences for their guests. So, it is important to look where our competition is before we start investing. The map below shows the superhost count segregated based on the zip code. The darker shade of blue indicates higher number of super hosts. We can clearly see a lot of areas in Brookyln and Manhattan having high number of super hosts.


Price vs Demand!

High price, low demand. We have heard and seen about it all our lives. But, it isn’t the case always. You might have heard about Giffen goods in Economics which are goods for which the demand increases as the price is increased. It means there are people out there who are willing to pay higher as price increases. This behavious can be because of numerous factors. The factor might be quality of good which in our case is property or another factor can be necessity, etc. Here, we will look at the average rent charged for the properties based on the zip code. Blue color represents the zip codes having the highest property rents and yellow indicates the lowest. Manhattan obviously comes on top. At the same time, we will look at the demand i.e average yearly booking of the properties by zip code. Blue represents high booking and yellow represents least booking. Again, we see Manhattan does very good.


What’s the review count?

The below chart depicts the number of reviews each property has received. More the reviews, more popular the place is and hopefully for good reasons. We can always find it out through word cloud for different neighborhoods whether the review is positive or negative. But, for that additional data is required which we don’t possess. At this moment, we will assume that the reviews are psoitive if the review count is higher. High review count is indicated by red and purple color.


Approach 1

Property purchased to be sold after 5 years

Calculating Profitability Index (PI) & Net Present Value (NPV) for each neighbourhood


In this approach, I am assuming that property will be sold after 5 years so both revenue (rent) and terminal price of the property at the end of 5 years is being considered. The discount factor has been assumed to be 0%. Future Growth rate is the 5 year annualized growth rate of the past for appraising the proeprty price. The profitablity index (PI) tells us whether the property is profitable or not. If it is greater than 1 then it means the property is profitable else not. Profitability Index is arrived at by calcualing the total incoming and outgoing cashflow (including termianl price of the property) divided by the initial investment outlay. Outgoing cashflows are assumed to be zero. Revenue has been calculated by using daily rent charged multiplied by the occupancy rate i.e. 75% (assumed same for all the proeprties) and multiplying it by 365 days. After calculating the profiatbility index, I calculated the mean of the profitability index for each neighbourhood. PI helps to make apples to apples comparison as it is a ratio rather than actual figure. From the below chart we can see that Prospect Heights has highest PI and so on. Below the PI plot, I have also calculated Net Present Value (NPV) which is the total cash flows from year 1 minus the initial outlay. It gives the actual figures rather than a ratio. The approach again was to calculate NPV of each property and then mean NPV for each neighbourhood. We can see that Chelsea has highest NPV and so on. So, we can see that order NPV order is different from PI.


Net Present Value by Zipcode

Similary, as above I calculated NPV and PI of each property and then based on that I calcualted the mean of NPV and PI by zipcode.


Approach 2

Holding the property indefinitely

Payback Period for each zipcode


The first approach considered not only considered the revenue (rent) generated from the property but it also captured the terminal price of the property after 5 years. In this approach, we consider that property is going to be held indefinitely and so the focus will only be on the revenues getting generated and the initial outlay. Discount factor willa gain be 0%. Our approach here is to calculate the number of years it will take for the property to breakeven i.e. the number of years it will take to recover the initial outlay.In finance parlance, it is also called payback period. The payback period is being calculated based on the mean property price of each zipcode. And, obviously lesser the payback period better it is. We can see from the plot, maximum amount of time taken by a zipcode to breakeven is 34 years.


Finally, I have tried to show the net cumulative sum of revenues year on year for all the zipcodes. So, you will see some zipcodes breaking even earlier than another zipcodes. Red color inidcates loss because initial outlay is more than the cumulative sum of revenues (rent). Once the amount is fully recovered and the zipcode breaks even, the color of the bar will change to blue.