Identifying Profitable Areas in NYC for Short-term Rental Investments

Executive Summary

Introduction

In the following case study, I am consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.

The following case study helps them understand which zip codes would generate the most profit on short term rentals within New York City.

Datasets

1. AirBnb(Revenue Data) AirBnB is the medium through which the investor plans to lease out their investment property

2. Zillow(Cost Data) Zillow provides us an estimate of value for two-bedroom properties

Assumptions

The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous.
The occupancy rate is assumed to be 75%.
The average number of days for which the property will be rented is taken as 3. (In a recent survey done by AirBnb, it was found that majority of their stays last about 3 days)

Procedure

Data Cleaning and Merging:

Both Revenue and Cost Data have a very large number of columns - 106 and 262 respectively. They were filtered based on predefined assumptions, data quality metrics(low percent of missing values) and keeping the end goal in mind.
The datasets were then merged on zipcodes. Any possible outliers were identified and removed and it was ensured all the datatypes and names of the fields were consistent with the documentation.
All the steps were carried out in a scalable way so that we don’t have to redo everything in our code if the dataset changes or the company approaches a new market.

Exploratory Data Analysis: The final data was used to create insightful visualizations showing the zipcodes and boroughs where our investor could possibly profit from. The zipcodes were analysed and ranked according to various metrics like Payback period, ROI%(Return on Investment), Number of properties etc. Similar analysis was done for boroughs comparing their prices and property costs.

Key Insights

The zipcodes were analysed using a number of parameters: Payback period, ROI%, Appreciation Rate of Property Price(for the past 5 years) and Number of properties in that zipcode. The following were the key insights:

The zipcodes with least amount of payback period are:
- 10306(Staten Island) with Payback period of 9.5 years
- 10303(Staten Island) with Payback period of 11.9 years
- 11234(Brooklyn) with Payback period of 12.07 years
- 11434(Queens) with Payback period of 12.77 years
- 10305(Staten Island) with Payback period of 13.36 years
Coming to Return on Investement(ROI%), these zipcodes performed really well: 11217(Brooklyn), 11231(Brooklyn), 11201(Brooklyn), 11215(Brooklyn), 11434(Queens)
The average growth rate in property price for the 5 year period between 2012-17 has been highest for following: 11217(Brooklyn), 11231(Brooklyn), 11201(Brooklyn), 11215(Brooklyn), 10128(Manhattan)
Out of all the above zipcodes, 11217 and 11215(Brooklyn) have a high number of listings indicating the popularity of the area
Other key insights include :
- Both prices and property costs are comparatively high in Manhattan.
- There is no correlation between price(per night) and the population of the area(SizeRank).
- Queens and Staten Island are cheaper in terms of both property and rental prices as compared to the other two.

Recommendations

I would recommend investing in Staten Island, particularly in zipcodes 10306,10303 and 10305 because each of them has very less payback period and at the same time they guarantee ROI% higher than the average. 11434(Queens) is also a great choice in terms of the above metrics. These zipcodes will involve lower risks as the investor will be able to break even quickly. In case the investor wants to hold the property for certain period of time(suppose for the next 5 years) and have high returns, Brooklyn(Zipcodes : 11217,11215) would be a great pick as these two zipcodes have high returns and they have seen highest increase rate in their property price from 2012-17. I would not personally recommend investing in Manhattan as both property cost and price rents are too high.

To Summarise my Recommendations:

Staten Island(Zipcodes 10303,10305 and 10306): Lower Payback Period, Decent ROI%
Brooklyn(Zipcodes 11217,11215): High ROI%, High Appreciation Rate in Property Prices(2012-17), High number of properties
Queens(Zipcode 11434): Lower Payback Period, Good ROI%
Manhattan: Not recommended

Out of all the above, Zipcode 10306 would be my first choice since it has the lowest payback period.

What’s Next

The data for other expenses like maintenance, taxes and repairs is very important since it could help us to predict metrics like Payback period,ROI% etc. with greater accuracy.
The analysis could be further improved if we can have more data for weekly price and monthly price since many people tend to stay for a week or a month.
We can use Time series and Forecasting to better predict the future prices and bringing both the data on the same time scale.
Further variables and factors like Crime rate, distance from tourist places/economic centers and other socio-economic factors could help us further in our analysis.

Installing and Loading required packages

## prerequisites (Please install these packages before loading)
## The commented code below install the list of packages 
#pkgs <- c("ggplot2","dplyr","tidyverse","plotly","knitr","kableExtra")
 
## Install required (CRAN) packages
#for (pkg in pkgs) {
#   if (!(pkg %in% installed.packages()[, "Package"])) {
#     install.packages(pkg)
#   }
# }

# Load the above packages
library(ggplot2)
library(dplyr)
library(tidyverse)
library(plotly)
library(knitr)
library(kableExtra)

Data Preparation and Analysis

Reading the data

fil <- gzfile("listings.csv.gz")                           # load the .gz file
bnbData <- read.csv(fil,header = T,na.strings=c("", "NA")) # read csv file
zillowData <- read.csv("Zip_Zhvi_2bedroom.csv",na.strings=c("", "NA"))

Analysis of Revenue Data (Airbnb Dataset)

dim(bnbData)

## [1] 48895   106

There are a total number of 48895 observations and 106 columns in the bnb dataset.

After analyzing the column names, we have to select the columns which seem relevant to our study. After selecting these columns, we have to check whether of any these columns have a large percent of missing values(greater than 50). Such a large number of missing values can be a problem for our analysis and lead to wrong results. So in this analysis, these columns have been removed accordingly.

The variables which were considered important for analysis are:

“id”,“neighbourhood_cleansed”,“neighbourhood_group_cleansed”,“city”,“zipcode”, “state” “latitude”,“longitude”,“room_type”,“price”,“bedrooms”,“cleaning_fee”,“square_fee t”,“weekly_price”,“monthly_price”,“review_scores_location”,“review_scores_rating”

bnbColNames <- c("id","neighbourhood_cleansed","neighbourhood_group_cleansed","city","state","zipcode","latitude",
                 "longitude","room_type","bedrooms","price","cleaning_fee","square_feet",
                 "weekly_price","monthly_price","review_scores_location","review_scores_rating")

Common functionality is developed for checking missing values of any dataframe so that it can be used later for other datasets in our analysis.

Checking for the percent of missing values for columns

checkMiss <- function(columnNames,df) {
missData <-  df %>%
             select(columnNames) %>%
            sapply(function(x) sum(is.na(x))*100/length(x))
missDf <- as.data.frame(round(missData[missData > 0],2))
colnames(missDf) <- "Missing_Values"
return(missDf)
}

Calling the above function

bnbMissDf <- checkMiss(bnbColNames,bnbData)
kable(bnbMissDf %>%
  rownames_to_column(var = "Variables") %>%
  arrange(desc(Missing_Values)))

Variables	Missing_Values
square_feet	99.17
monthly_price	89.27
weekly_price	87.72
review_scores_location	22.66
review_scores_rating	22.54
cleaning_fee	21.77
zipcode	1.06
city	0.13
bedrooms	0.04
state	0.01

Looking at the above values, three columns namely “square_feet”,“weekly_price” and “monthly_price” have a high percent of missing values (each greater than 70). So we have to remove them in our analysis.

bnbMissDf1 <- rownames_to_column(bnbMissDf,var = "Variables")
bnbColMiss <- bnbMissDf1[bnbMissDf1$Missing_Values > 70,"Variables"]  # Cols with > 70% missing values are stored

Removing Redundant Columns

Since we are looking only for 2 bedroom properties in NYC, so columns like “city”,“state” and “bedrooms” will no longer be required after filtering.

Similar analysis has to be done for “Reviews” related columns to see if they are adding any value to our dataset.

summary(bnbData$review_scores_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   20.00   92.00   96.00   93.93  100.00  100.00   11022

summary(bnbData$review_scores_location)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   9.000  10.000   9.565  10.000  10.000   11082

Apart from 1 or 2 outliers in the lower range both overall ratings and location reviews seem to be positive with both median and mean lying close to the max values in each case. Since all these reviews are positive, they are not adding any value in comparing two listings. Therefore, they are also removed.

bnbColsRedundant <- c("city","state","review_scores_location","review_scores_rating","bedrooms")
bnbColsRemoved <- c(bnbColsRedundant,bnbColMiss) # Final Cols which are getting removed

Filtering for final Revenue Dataset

Creating a function so that it becomes easy for the client in case they want to have data for different bedroom properties.

filterRevData <- function(bed,cols) {
  data <- bnbData %>%
            filter(bedrooms == bed) %>%
            select(cols)
  return(data)
}

Calling the above function

noBedroom <- 2
cols <- bnbColNames[-match(bnbColsRemoved,bnbColNames)]  #Removing Columns containing high missing values% and those are no longer required 
filteredBnbData <- filterRevData(noBedroom,cols)

Since the data is going to be merged on zipcodes, we have to check the number of missing values for zipcode column. There are still 50 missing zipcode values in our dataset.

Furthermore it should be checked whether all the zipcodes have consistent length.

table(sapply(as.character(filteredBnbData$zipcode),function(x) nchar(x)))

## 
##    5 
## 6447

All the zipcodes (except those with missing values) have 5 digits.

Similar to the checking missing values functionality, a common process has been followed for replacing missing values of important fields like zipcodes. These values are imputed by using the median values in the same neighbourhood and the borough as that of missing value.

replaceMissing <- function(df,missCol) {
sum_df <- data.frame(df)

data <- sum_df %>%
  select("id","neighbourhood_group_cleansed","neighbourhood_cleansed",missCol) %>%
  group_by(neighbourhood_group_cleansed,neighbourhood_cleansed) %>%
  mutate(replaceMissing = median(as.numeric(get(missCol)),na.rm = TRUE))

for(i in 1:nrow(df))
{
  if(is.na(df[i,missCol]))
  {
    df[i,missCol] = ifelse(is.numeric(df[i,missCol]),data[i,"replaceMissing"],as.character(data[i,"replaceMissing"]))
  }
}
return(df)
}

Calling the above function

filteredBnbData$zipcode <- as.character(filteredBnbData$zipcode) 
zipReplacedData <- replaceMissing(filteredBnbData,'zipcode')

Final Revenue Dataset

finalRevData <- data.frame(zipReplacedData)
kable(head(finalRevData), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

id	neighbourhood_cleansed	neighbourhood_group_cleansed	zipcode	latitude	longitude	room_type	price	cleaning_fee
7750	East Harlem	Manhattan	10029	40.79685	-73.94872	Entire home/apt	$190.00	NA
13050	Bedford-Stuyvesant	Brooklyn	11221	40.68554	-73.94090	Entire home/apt	$115.00	$85.00
14290	Williamsburg	Brooklyn	11206	40.70420	-73.93560	Entire home/apt	$228.00	$128.00
15396	Chelsea	Manhattan	10001	40.74623	-73.99530	Entire home/apt	$375.00	$120.00
15711	Upper East Side	Manhattan	10162	40.77065	-73.95269	Entire home/apt	$250.00	$200.00
16458	Park Slope	Brooklyn	11215	40.67343	-73.98338	Entire home/apt	$225.00	NA

Data Quality Insights for Revenue Data: High percent of missing values were present in large number of columns.Data type for price fields is not numeric since they contain the $ sign.State and Zipcodes had many inconsistent/wrong entries but most of them were automatically removed after filtering for 2 bedroom properties.

Analysis of Cost Data (Zillow Dataset)

dim(zillowData)

## [1] 8946  262

The columns contained in the above dataset are:

RegionID
RegionName
Location Details (City,State,Metro,CountyName)
SizeRank
Property price from 1996-04 to 2017-06

Before merging the datasets we need to bring both the prices on the same time scale so as to make even comparison.

Average Price Trends across the years

## The following code was used to produce time series line chart in Tableau
zTime <- zillowData %>%
  filter(City == 'New York',State == 'NY') %>%
gather(key = "time",value ="price",-c("RegionID","RegionName","City","State","Metro","CountyName","SizeRank"))

zTime$time = gsub("[X]","",zTime$time)
zTime$time =gsub("[.]","-",zTime$time)
zTime$time = as.Date(paste(zTime$time,"01",sep = "-"))
# Write 'zTime' dataframe in csv and use it in Tableau
# write.csv(zTime,file = "timeData.csv")

Data Quality Insights for Cost Data: As seen in the above graph, a lot of data is missing from 1996 to 2004. So it will not be a good idea to incorporate them for calculating future prices. A better picture can be portrayed by using the data from the last 5 years. (2012-2017)

Cleaning Zillow Dataset

cleanZillow <- function(cityname,statename){
  
data <- zillowData %>%
        filter(City == cityname,State == statename)
return(data)
}

Calling the above function for suitable parameters : for New York City

city <- 'New York'
state <- 'NY'
filteredZillow <- cleanZillow(city,state)

Since there are a lot of fluctuations in the monthly data, we have to average out the prices for past each year. Using these prices, then we compute the average appreciation rate for the past 5 years. The function has been created in such a way that we can get growth rate for any past ‘n’ number of years no matter the latest data. So it would be useful in case the zillow data gets available even for the next two/three years after 2017.

growthRate <- function(df,y) {                                      # df = dataset,y = no. of years
costdf <- df %>%
               select((ncol(df) - y*12):(ncol(df)-1))               # Selecting price columns for previous 'y' years
for(i in 1:y){
  costdf[,paste('cost',i,sep = '_')] = rowMeans(costdf[,(12*(i-1)+1):(12*i)],na.rm = TRUE)  # calculating price means
}
costdf <- costdf[,-c(1:(y*12))]                                     # Retaining only price means for 'y' years
for(j in 1:(y-1)){
costdf[,paste('growth',j,sep = '_')] <- (costdf[,paste('cost',j+1,sep = '_')])/(costdf[,paste('cost',j,sep = '_')])-1 #Calculating appreciation rate from j to j+1 year
}
avg_growth <- rowMeans(costdf[grepl("(growth)", colnames(costdf))]) # taking means of appreciation rate calculated above
return(avg_growth)
}

Creating Final Cost Data

colnames(filteredZillow)[colnames(filteredZillow)=="RegionName"] <- "zipcode"  # Rename to zipcode
colnames(filteredZillow)[ncol(filteredZillow)] <- "propertyCost" # Renaming the most recent cost as the propertyCost
zillowCols <- c('zipcode','SizeRank','propertyCost')             # Columns considered important for our analysis

finalCostData <- filteredZillow %>%
                select(zillowCols)

n <- 5      # Number of past years data we are considering for analysis
finalCostData$avg_growth <- growthRate(filteredZillow,n)  # Combining the avg growth rate computed for the past 'n' years

A look at the Final Cost data

kable(head(finalCostData))

zipcode	SizeRank	propertyCost	avg_growth
10025	1	1431000	0.0965309
10023	3	2142300	0.0892191
10128	14	1787100	0.0975620
10011	15	2480400	0.0920633
10003	21	2147000	0.0850568
11201	32	1420700	0.1261907

Merging the two datasets

finalData <- merge(finalRevData,finalCostData,by="zipcode")
str(finalData)

## 'data.frame':    1569 obs. of  12 variables:
##  $ zipcode                     : chr  "10003" "10003" "10003" "10003" ...
##  $ id                          : int  19288937 13866112 31835260 5207973 17843998 4126452 35415489 2150328 31408487 3984168 ...
##  $ neighbourhood_cleansed      : Factor w/ 221 levels "Allerton","Arden Heights",..: 65 65 87 65 65 87 87 65 65 65 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ latitude                    : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ longitude                   : num  -74 -74 -74 -74 -74 ...
##  $ room_type                   : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ price                       : Factor w/ 674 levels "$0.00","$1,000.00",..: 244 129 208 222 244 189 244 635 640 134 ...
##  $ cleaning_fee                : Factor w/ 206 levels "$0.00","$1,000.00",..: 12 192 21 162 181 25 25 4 4 43 ...
##  $ SizeRank                    : int  21 21 21 21 21 21 21 21 21 21 ...
##  $ propertyCost                : int  2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 ...
##  $ avg_growth                  : num  0.0851 0.0851 0.0851 0.0851 0.0851 ...

Data Cleaning for the final dataset

finalData$price = as.numeric(gsub("[$,]","",as.character(finalData$price)))
finalData$cleaning_fee = as.numeric(gsub("[$,]","",as.character(finalData$cleaning_fee)))
finalData$zipcode = as.factor(finalData$zipcode)
finalData$id = as.character(finalData$id)
finalData$SizeRank = as.factor(finalData$SizeRank)

Checking Missing Values in the final data

checkMiss(colnames(finalData),finalData)    #Using the function that was created earlier

##              Missing_Values
## cleaning_fee           13.7

Cleaning Fee has about 13.7% missing values. Since it is an important parameter in determining the returns for a listing so we have to impute it.

Impute Missing values of the Cleaning fee

finalData_V1 <- replaceMissing(finalData,'cleaning_fee') #Using the function we created earlier for replacing zipcodes
checkMiss('cleaning_fee',finalData_V1)

##              Missing_Values
## cleaning_fee           0.13

finalData_V1[which(finalData_V1$cleaning_fee %in% NA),'cleaning_fee'] <- 0 # Since there are only 1.3% missing values we
                                                                           # can impute them as 0

Checking outliers in rent

p1 <- ggplot(finalData_V1,aes(x = neighbourhood_group_cleansed,y = price,fill = neighbourhood_group_cleansed)) +
      geom_boxplot() + theme(legend.position = "none")

ggplotly(p1)

From the boxplot, it is visible that there are 3 listings with quite a high price(greater than 2000$/day): 2 in Manhattan and 1 in Brooklyn. Such high prices could affect our analysis as they could be a result of some data entry error and there are not many listings with such astronomically high charges so it would be better to remove those observations.

Removing outliers in rent

finalData_V2 <- finalData_V1 %>%
                filter(price <= 2000)

Checking whether zipcodes are unique for a Neighbourhood group

finalData_V2 %>%
  select("zipcode","neighbourhood_group_cleansed") %>% group_by(zipcode) %>%
  summarise(total = n_distinct(neighbourhood_group_cleansed)) %>% filter(total > 1)

## # A tibble: 1 x 2
##   zipcode total
##   <fct>   <int>
## 1 10013       2

10013 zipcode is present in two neighbourhoods - Brooklyn and Manhattan. However it is an area belonging to Lower Manhattan. So updating the neighbourhood of that observation from Brooklyn to Manhattan.

index <- which(finalData_V2$zipcode %in% 10013 & finalData_V2$neighbourhood_group_cleansed %in% 'Brooklyn')
finalData_V2[index,'neighbourhood_group_cleansed'] <- 'Manhattan'

A glimpse at the final dataset

kable(head(finalData_V2), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")

zipcode	id	neighbourhood_cleansed	neighbourhood_group_cleansed	latitude	longitude	room_type	price	cleaning_fee	SizeRank	propertyCost	avg_growth
10003	19288937	East Village	Manhattan	40.72784	-73.98883	Entire home/apt	250	110	21	2147000	0.0850568
10003	13866112	East Village	Manhattan	40.72394	-73.98872	Entire home/apt	165	85	21	2147000	0.0850568
10003	31835260	Gramercy	Manhattan	40.73782	-73.98547	Entire home/apt	218	120	21	2147000	0.0850568
10003	5207973	East Village	Manhattan	40.72950	-73.98623	Entire home/apt	230	60	21	2147000	0.0850568
10003	17843998	East Village	Manhattan	40.72571	-73.98789	Entire home/apt	250	75	21	2147000	0.0850568
10003	4126452	Gramercy	Manhattan	40.73327	-73.98503	Entire home/apt	200	125	21	2147000	0.0850568

Structure of the final data

str(finalData_V2)

## 'data.frame':    1566 obs. of  12 variables:
##  $ zipcode                     : Factor w/ 24 levels "10003","10011",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ id                          : chr  "19288937" "13866112" "31835260" "5207973" ...
##  $ neighbourhood_cleansed      : Factor w/ 221 levels "Allerton","Arden Heights",..: 65 65 87 65 65 87 87 65 65 65 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ latitude                    : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ longitude                   : num  -74 -74 -74 -74 -74 ...
##  $ room_type                   : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ price                       : num  250 165 218 230 250 200 250 850 88 170 ...
##  $ cleaning_fee                : num  110 85 120 60 75 125 125 100 100 150 ...
##  $ SizeRank                    : Factor w/ 24 levels "1","3","14","15",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ propertyCost                : int  2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 ...
##  $ avg_growth                  : num  0.0851 0.0851 0.0851 0.0851 0.0851 ...

Visualization & Insights

Zipcode Analysis

Payback period

The zipcodes with less payback period will be better for investment as the investor will be able to quickly recover cash from their initial investment.

occupancy <- 0.75
nodays <- 3 # Number of days the apartment will be rented  

finalData_v3 <- finalData_V2 %>%
                  mutate(presentCost = propertyCost*(1 + avg_growth)**2) %>%
                  mutate(paybackPeriod = presentCost/(occupancy*price + ((occupancy*cleaning_fee)/nodays)))

  

kable(finalData_v3 %>%
  group_by(zipcode) %>%
  summarise(years = mean(paybackPeriod)/365) %>%
  arrange(years) %>%
  head(n = 5),caption = "Zipcodes with least payback period") %>% kable_styling('striped')

Zipcodes with least payback period
zipcode	years
10306	9.478526
10303	11.966507
11234	12.077044
11434	12.778073
10305	13.361443

finalData_v3 %>%
  group_by(zipcode,neighbourhood_group_cleansed) %>%
  summarise(years = mean(paybackPeriod)/365) %>%
  arrange(years) %>%
  ggplot(aes(x = reorder(zipcode,-years),y = years,fill=neighbourhood_group_cleansed)) + geom_col() + labs(y = "Payback Period",x =  "zipcodes")+ coord_flip() + geom_hline(aes(yintercept = mean(years)), color="blue",linetype="dashed") + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())

Zipcodes : 10306,10303,11234,11434 and 10305 have the least amount of payback period.(Dashed blue line shows the average)

Return on Investment(ROI%)

ROI = (Gain from Investment - Cost of Investment)/Cost of Investment

We can use simple ROI% calculation here since the investor will be paying for the property in cash. Suppose they are going to hold the property for 5 years and want to get high ROI%.

noYears <- 5 #The period for which we are calculating ROI%
finalData_v4 <- finalData_v3 %>%
                 mutate(ROI = (((presentCost*(1+avg_growth)**noYears) + (occupancy*price*365*noYears)                                                +((cleaning_fee/nodays)*occupancy*365*noYears) - presentCost)/presentCost)*100)

finalData_v4 %>%
  group_by(zipcode,neighbourhood_group_cleansed) %>%
  summarise(avg_ROI = mean(ROI)) %>%
  arrange(desc(avg_ROI)) %>%
  ggplot(aes(x = reorder(zipcode,avg_ROI),y = avg_ROI,fill = neighbourhood_group_cleansed)) + geom_col() + coord_flip() +   labs(y = "ROI%",x = "zipcodes") + geom_hline(aes(yintercept = mean(avg_ROI)), color="blue",linetype="dashed") + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())

Zipcodes : 11217,11231,11201,11215 and 11434 have the highest ROI% among all the others.(Dashed blue line shows the average)

Number of properties by zipcodes

finalData_V2 %>%
  ggplot(aes(x = zipcode,fill = neighbourhood_group_cleansed)) +
  geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
  theme_bw() + labs(y = "Number of Properties") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),panel.grid.major = element_blank(),panel.grid.minor = element_blank(),axis.ticks.y = element_blank(),axis.text.y = element_blank(),panel.border = element_blank()
        ,axis.ticks.x = element_blank())

Zipcodes : 11215,10036,10003,10025 and 11217 have the highest number of property listings indicating the popularity of the area.

kable(finalData_V2 %>%
  group_by(zipcode,neighbourhood_group_cleansed) %>%
  summarise(growthPercent = round(mean(avg_growth),4)*100,number_of_properties = n()) %>%
  arrange(desc(growthPercent)) %>%
  head(n = 5),caption = "Zipcodes with Highest Property Appreciation Rate(from 2012-17)")

Zipcodes with Highest Property Appreciation Rate(from 2012-17)
zipcode	neighbourhood_group_cleansed	growthPercent	number_of_properties
11217	Brooklyn	14.18	123
11231	Brooklyn	13.78	93
11201	Brooklyn	12.62	86
11215	Brooklyn	11.82	190
10128	Manhattan	9.76	65

Out of the above, both 11217 and 11215 had high property appreciation rates and also large number of properties.

Borough Analysis

Price(per night) distribution across neighbourhood

p2 <- ggplot(finalData_v3,aes(price,fill = neighbourhood_group_cleansed)) + geom_density(alpha = 0.2) + 
  scale_x_log10() + ggtitle("Price(per night) distribution")
ggplotly(p2)

Property Cost distribution across neighbourhood

finalData_v3 %>%
  group_by(neighbourhood_group_cleansed) %>%
  summarise(avg_cost = mean(presentCost)) %>%
ggplot(aes(neighbourhood_group_cleansed,avg_cost,fill = neighbourhood_group_cleansed)) + geom_col(position = "dodge") + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank()) + labs(y = "Avg Property Cost")

Manhattan has high property prices along with high rental prices as compared to the other three.
Queens and Staten Island have comparatively low prices in terms of both property and rent as compared to Brooklyn and Manhattan.

Further Insights

Does population of an area(SizeRank) affect Price(per night)?

p3 <- ggplot(finalData_V2,aes(x=SizeRank,y = price,col = neighbourhood_group_cleansed)) +
      geom_point() +
      geom_jitter() + theme(legend.title = element_blank(),axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
ggplotly(p3)

Most of the prices lie in the region below 1000$ mark indicating there is not any noticeable relation between price per night and the population of that area.A more visible relation was seen in the earlier graph when the price distribution of all the boroughs was compared showing that Manhattan and Brooklyn had higher prices as compared to the other two.

Do Queens and Staten Island have lower prices because of higher number of private rooms?

p4 <- finalData_V2 %>%
ggplot(aes(neighbourhood_group_cleansed,price,fill = room_type)) + geom_col(position = "dodge",width = 0.8) + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank()) + labs(y = "Price(per night)")
ggplotly(p4)

Since private rooms are much cheaper as compared to renting an entire home, we wanted to know whether low prices in Staten Island and Queens are because they contain higher proportion of private rooms. However, from the graph it is easily visible that prices of both room types tend to follow same pattern across the neighbourhoods. Hence, we can conclude that rental prices are low in Staten Island and Queens irrespective of whether it’s a private room or an entire home.

Identifying Profitable Areas in NYC for Short-term Rental Investments

Ashutosh Singh

1/27/2020

Executive Summary

Introduction

Datasets

Assumptions

Procedure

Key Insights

Recommendations

What’s Next

Data Preparation and Analysis

Reading the data

Analysis of Revenue Data (Airbnb Dataset)

Analysis of Cost Data (Zillow Dataset)

A look at the Final Cost data

Merging the two datasets

Data Cleaning for the final dataset

Visualization & Insights

Zipcode Analysis

Borough Analysis

Further Insights