Introduction and Approach

My goal here is to recommend zip-codes in New York City which will generate the most profit on short term rentals to a real estate company interested in investing in two-bedroom properties. My recommendations will be based on publicly available data from Zillow and AirBnB. There are certain assumptions and instructions which I would like to mention below:

Assumptions and Known Information

Given Assumptions:

  • Cost data: Zillow provides an estimate of value for two-bedroom properties
  • Revenue data: AirBnB is the medium through which the investor plans to lease out their investment property. We are able to see how much properties in certain neighbourhoods rent out for in New York City
  • Assuming an occupancy rate of 75%
  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for)
  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now)
  • All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)

Additional Assumptions

I have made a few additional assumptions apart from given assumptions. Some of them which were very specific to the dataset are mentioned in their respective Data Processing tabs. Below, I am stating the assumptions which are critical and decides the overall approach of the project:

  • Time Value of Money:

    • It is given that the $1 today is worth the same 100 years from now. Here, I am starting the time frame of 0% discount rate since August 2019 as it was the date when AirBnb data was scrapped.
    • The Zillow dataset gives house prices till June 2017. So I am forecasting the price for August 2019 based on previous trends.
  • Occupancy Assumptions:

    • For calculating profitability I will be calculating median net operating income for a year. Hence I will be interested in the number of days the property is occupied in a year. For this purpose, I will consider availability_365 column which indicates the property is available for rent within the next 365 days
    • There are a large number of listings which have zero or near-zero values in availability_365 column. Here I am assuming that the property is pre-booked for the rest of the year. So for any given listing the property will always be pre-booked for 365 - availability_365 days
    • Number of days the property is occupied in a year is calculated based on the following formula:
      \[ occupancy = availability\_365*0.75 + (365 - availability\_365)\]
    • Whenever a 2 BR property is occupied number of guest in that property is 3

Approach

I will first clean and explore both the dataset and then merge them in subsequent tabs. Depending on my findings of the exploration I will decide the metric to gauge profitability and suggest the most profitable zipcodes in New York City.

Profitability Metrics

Please find below the meaning of profitability metrics in brief. More explanation is given in Data Merge and Processing tab where they are calculated

Yearly Earning:
\[ yearly\_earning = 3*rent\_per\_day*med\_occupancy \]

Capitalization Rate: Ratio of net yearly income by properties market value. \[cap\_rate = (yearly\_earning/cost)*100 \]

Profitability Segment: Based on capitalization rate properties are divided into High, Medium and Low Segments

Breakeven Period: Number of years required to recover initial investment assuming the prices to remain constant \[breakeven\_period = cost/yearly\_earning\]

Please review the tabs from left to right. Also, have a look at the interactive map section where you could get all the profitability information about the zipcodes in the New York City map.

Initial Setup

Packages Required

The following packages were used to arrive at the recommendations:

  • tidyverse : Used in data processing and data transformation as well as for data visualization
  • readr : Used for importing data CSV files
  • DT : Used for displaying a table in HTML
  • leaflet : Used to add Interactive Map
  • glue : Used for Concatenating Name and Count in Graphs
  • forecast : Used for time-series forecasting and visualizing time-series
  • scales : Scaling variables for better visualization
# Checking if Library is present or not
# Check for missing dependencies and load necessary R packages
if (!require(tidyverse)) {install.packages('tidyverse')}; 
if (!require(readr)) {install.packages('readr')}; 
if (!require(DT)) {install.packages('DT')};
if (!require(leaflet)) {install.packages('leaflet')};
if (!require(glue)) {install.packages('glue')};
if (!require(forecast)) {install.packages('forecast')};
if (!require(scales)) {install.packages('scales')};

# Loading the Libraries 
library(tidyverse) # Used in data processing and data transformation as well as for data visualization 
library(readr)    # Used for importing data CSV files
library(DT)       # Used for displaying table in HTML
library(leaflet)  # Used to add Interactive Map
library(glue)     # Used for Concatenating Name and Count in Graphs
library(forecast) # For computing and visualizing basic time series components 
library(scales)   # Scales function for visualization

Setting Up Working Directory

Setting the working directory to device’s project location. Please edit the directory based on your suited location to run files on your device.

# Setting Working Directory to the local path. Please change it according to your device location
#setwd('')

Data Import

Cost Data: Zillow Dataset

# Importing Data Set
zillow_dataset <- readr::read_csv('Data/Zip_Zhvi_2bedroom.csv')

Revenue Data: AirBnb Dataset

While importing there was a warning of parsing failures. After checking with problems() function, it was found that the problem is with zipcode, jurisdiction_names and license variable. I won’t be using jurisdiction_names and license variable anywhere in my analysis. But zipcode is critical variable. Hence while importing the datatype of zipcode is specified as a character

# Importing Data Set
airbnb <- readr::read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2019-07-08/data/listings.csv.gz',col_types = cols(zipcode = col_character()))

NYC: Zipcode and Neighbourhood mapping

The ZIP-Neighbourhood data scrapped from the following website: nycbynatives.com is imported.

I have not imported the .csv file here, as I felt it will be inconvenient to import additional .csv file everytime you run the code.

zip_by_neighbourhood <- data.frame("zipcode" = c("10001","10002","10003","10004","10005","10006","10007","10009","10010","10011","10012","10013","10014","10015","10016","10017","10018","10019","10020","10021","10022","10023","10024","10025","10026","10027","10028","10029","10030","10031","10032","10033","10034","10035","10036","10037","10038","10039","10040","10041","10044","10045","10048","10055","10060","10069","10090","10095","10098","10099","10103","10104","10105","10106","10107","10110","10111","10112","10115","10118","10119","10120","10121","10122","10123","10128","10151","10152","10153","10154","10155","10158","10161","10162","10165","10166","10167","10168","10169","10170","10171","10172","10173","10174","10175","10176","10177","10178","10199","10270","10271","10278","10279","10280","10281","10282","10301","10302","10303","10304","10305","10306","10307","10308","10309","10310","10311","10312","10314","10451","10452","10453","10454","10455","10456","10457","10458","10459","10460","10461","10462","10463","10464","10465","10466","10467","10468","10469","10470","10471","10472","10473","10474","10475","11004","11101","11102","11103","11104","11105","11106","11109","11201","11203","11204","11205","11206","11207","11208","11209","11210","11211","11212","11213","11214","11215","11216","11217","11218","11219","11220","11221","11222","11223","11224","11225","11226","11228","11229","11230","11231","11232","11233","11234","11235","11236","11237","11238","11239","11241","11242","11243","11249","11252","11256","11351","11354","11355","11356","11357","11358","11359","11360","11361","11362","11363","11364","11365","11366","11367","11368","11369","11370","11371","11372","11373","11374","11375","11377","11378","11379","11385","11411","11412","11413","11414","11415","11416","11417","11418","11419","11420","11421","11422","11423","11426","11427","11428","11429","11430","11432","11433","11434","11435","11436","11691","11692","11693","11694","11697"),
"neighbourhood" = c("Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens"))

# Data Preview
head(zip_by_neighbourhood,10)

Functions used for reproducibility

This section will contain functions which are created for the purpose of reproducibility.

Please note that this function was realized after the particular process of data cleaning, because earlier I had no visibility of data. Due to time constraint, I was only able to create one such functions but while coding the care was taken to make the code user-friendly and have provision to tweak it easily.

In both of the datasets viz. Zillow and Airbnb zipcode is the most important identifier. The end goal is also to identify the most profitable zipcodes hence cleaning zip is an important task which will be performed in every such assignment. Here, I am writing a function to clean the zip column.

This function is inspired from a similar function which was present in zipcode package in R which is now archived. The link of the package can be found here

zipclean <- function(zipcode){
# converting zipcode to character variable type
zipcode <- as.character(zipcode)

# Removing all the whitespaces
zipcode = gsub('\\s', '', zipcode)

# For cases where there is a "-" in zipcode keeping only digits which are before it
zipcode = gsub('^([0-9]+)-[0-9]+', "\\1", zipcode)

# Delete empty strings
zipcode[grepl('^$', zipcode)] = NA
    
# restore leading zeroes to non 5 digit codes:
zipcode <- str_pad(zipcode, width = 5, side = "left", pad = "0")

# Keeping only first 5 digits from starting:
zipcode <- str_sub(zipcode,start = 1,end = 5)

return(zipcode)
}

More such functions can be made, for eg. the qualtity checking is done at several steps hence a function can be created as shown below to perform basic quality checks.

quality_check <- function(dataset){
  cat("The structure of the dataset is as follows:","\n",str(dataset),"\n",
    "The dimensions of the dataset are:","\n",dim(dataset),"\n",
    "The summary of the dataset is as follows:","\n",summary(dataset),"\n",
    "The number of duplicate records present in the dataset are:","\n",nrow(dataset[which(duplicated(dataset) == TRUE),]))
}

Please note that I have not used the above function in my code as it was realized after the complete procedure of data cleaning. Also, the function is suitable for small data where the number of variables is less as it gets difficult to view the structure and summary of data having a large number of variables.

Similarly, more functions can be created but because of the time constraint, I was not able to focus much on this part of the process.

Data Processing

Cost Data

Data Cleaning

From the initial exploration of Zillow dataset, it can be seen that:

  • There are 8946 observations of 262 variables
  • The dataset contains the historical median price of 2 bedroom properties from 1996-04 to 2017-06, of areas having zipcode given by RegionName column. Additionally, it contains information about City,State, Metro and CountyName. It also contains SizeRank which indicates the population of that area, the lower the number the greater the population. RegionID is Zillow’s unique identifier
  • There are missing values present in 2,978 observations
  • The data is unique at both RegionID and RegionName level respectively, and there are 25 unique zipcodes for New York city

First 10 rows of Zillow dataset

# Checking the top 10 rows of the dataset
DT::datatable(head(zillow_dataset,10), extensions = 'FixedColumns',
  options = list(
  dom = 't',
  scrollX = TRUE,
  fixedColumns = TRUE
))
#structure of the dataset along with dimensions
str(zillow_dataset)

#High level Summary of data
summary(zillow_dataset)

#Checking missing values in each column
colSums(is.na(zillow_dataset))
# Checking Record having missing value
count(zillow_dataset[rowSums(is.na(zillow_dataset)) > 0,])

# Checking count of unique records in data
count(unique(zillow_dataset))
# Checking number of duplicate records
nrow(zillow_dataset[which(duplicated(zillow_dataset) == TRUE),]) 

# Checking Unique number of region ID's and Region name's
length(unique(zillow_dataset$RegionID))
length(unique(zillow_dataset$RegionName))

# Chceking the number of records for each city
arrange(as.data.frame(table(zillow_dataset$City)),desc(Freq))
  • All zipcodes have 5 characheters and none of them consists of any non digit charachters
# Checking lenght of ZIP Codes
unique(zillow_dataset %>% transmute(num_digits_zip = str_length(RegionName)))
# Checking for any non digit zipcode
count(zillow_dataset %>% filter(str_detect(RegionName, "[^0-9]")))
# Cleaning Zipcode
# There is no need of cleaning zipcode in this case but I am cleaning it so that it is part of process 
zillow_dataset$RegionName <- zipclean(zillow_dataset$RegionName)

Now, as our firm is specifically interested in renting out properties in New York City, we will only keep the properties of New York City for our further analysis.

# Removing RegionID, State, Metro and CountyName, as I don't feel that I will need them later and if I need I can easily pull them on basis of RegionName
# I am keeping sizerank because I feel that I can use it again later in my analysis
zillow_dataset2 <- subset(zillow_dataset,select = -c(RegionID,State,Metro, CountyName))

# QCing new dataset
head(zillow_dataset2)
str(zillow_dataset2)

# Number of observations are same but number of variable is reduced by 4 which was expected result

# Filtering Data for New York City
nyczillow <- filter(zillow_dataset2,City == "New York")

# QC: Checking the count of filtered dataset
count(unique(nyczillow))

# As the count matches with the count in my above observation we can safely proceed forward 

# QC: Checking missing values
colSums(is.na(nyczillow))

# There are missing values present

It is given that it is time series data. Let us plot the time series. For that it is necessary to convert the existing dataset into time series datatype in R.

# Removing City and Sizerank as we don't need them now for timeseries analysis
temp_nyczillow <- subset(nyczillow, select = -c(City, SizeRank))
dim(temp_nyczillow)

# Converting shape of dataframe from wide to long
longts <- gather(temp_nyczillow,month,median_cost,2:256)
head(longts)
dim(longts)

# Now I want to have different zipcodes as the column names for that I will need to spread the above dataset
widets <- spread(longts,RegionName,median_cost)
head(widets)
dim(widets)


# Converting dataframe to timeseries
nyczillow.ts <- ts(widets[,2:26], start = c(1996, 4), frequency = 12)
head(nyczillow.ts)

Time Series Plot for Cost Data

From the following plot a lot of missing data can be seen in the data for earlier years.

# Time Series Plot
forecast::autoplot(nyczillow.ts) +
  scale_y_continuous(labels = comma) +
labs(title = "How does the historical median prices of two bedroom properties in New York City look?",
     subtitle = "The graph shows variation in Median Prices of 2 bedroom propeties for 25 zipcodes in New York City",
       y = "Median Price of 2 Bedroom Property",
       x = NULL) +
labs(colour = "Zipcodes")

For predicting house prices for August 2019 (the timeperiod in which AirBnB data was scrapped), I will consider the median price of properties, starting from the year 2010.

Time Series Plot for Cost Data form year 2010

# Filtering data only after year 2010
nyczillow.ts.2010 <- window(nyczillow.ts, start = c(2010, 1))
# QC
head(nyczillow.ts.2010)
dim(nyczillow.ts.2010)
# Time Series Plot after 2010
forecast::autoplot(nyczillow.ts.2010) +
  scale_y_continuous(labels = comma) +
labs(title = "Variation in the median prices of two bedroom properties in New York City since 2010",
       y = "Median Price of 2 Bedroom Property",
       x = NULL) +
labs(colour = "Zipcodes")

Predicting housing prices for August 2019

Due to time constraint, I want a quick method to forecast future values. Hence I will be using auto.Arima() to forecast the housing price until August 2019. As auto.Arima() only works for univariate time series, I will predict for each series separately and then predict housing price August-2019. I will add these values in manually created cost column in NYC Zillow dataset which was created earlier.

# Creating new dataset for adding property value for August 2019
zillow_dataset3 <- filter(zillow_dataset2,City == "New York")
# Adding blank column for cost 
zillow_dataset3$cost <- as.integer(NA)
#QC
names(zillow_dataset3)
dim(zillow_dataset3)
# Keeping only relevant columns
zillow_dataset3 <- subset(zillow_dataset3,select = -c(4:ncol(zillow_dataset3)))
# Seprating widets dataset into different time series
for (i in 2:ncol(widets))
{
  temp.nyczillow.ts <- ts(widets[,i], start = c(1996, 4), frequency = 12)
  temp.nyczillow.ts.2010 <- window(temp.nyczillow.ts, start = c(2010, 1))
  model <- auto.arima(temp.nyczillow.ts.2010)
  # August forecast i.e. 26 months ahead
  nycforecast2 <- forecast(model, h = 26)
  # Property Cost for August
  predictedvalue2 <- nycforecast2$mean[26]
  # adding property values for August-2019 in Zillow dataset
  zillow_dataset3$cost[zillow_dataset3$RegionName == colnames(widets[,i])] <- predictedvalue2
}
#QC
dim(zillow_dataset3)
head(zillow_dataset3)
names(zillow_dataset3)

Changing the name of RegionName column to zipcode as we want to use this dataset later, where we will merge this with AirBnB dataset to find the most profitable zipcodes in New York City.

# Changing the column name
colnames(zillow_dataset3)[colnames(zillow_dataset3) == "RegionName"] <- "zipcode"

Rounding off the property cost for future convenience

zillow_dataset3$cost <- round(zillow_dataset3$cost, digits = 0)
# Renaming Dataset
zillow_final <- zillow_dataset3

Data Preview

Our final cleaned cost data with the latest property value looks as follows:

DT::datatable(zillow_final,class = "stripe hover")

EDA

The key takeaway from this section is that for the zipcodes with missing cost price within New York City cannot be imputed with the median cost price of that neighbourhood because of very less sample size (see box plot below).

Let us look at how these 25 zipcodes in New York City compare with respect to each other in terms of Median property cost. I am also adding a size rank metric in the following plot to see the impact of population of the area on the property cost. The bubble size is directly proportional to the population of the area.

  • 10013 zipcode has the maximum median cost of 2 bedroom property. The difference between costliest zipcode and 2nd costliest zipcode is $875,055
  • Cost of 2 bedroom property in most populous zipcode, 10025 in New York City (as well as in the whole US!) is $1,599,344 which is very close to the median of the 2 bedroom house prices in NYC
  • Zipcodes like 10025 seem more useful to us than 10013 as it has very less housing price with more population. Hence we may get less expensive property in 10025 area with high occupancy rate or high demand because of the large population
  • I am also particularly interested in properties which are highlighted in the graph. Most of them have low house price and high population
zillow_final %>%
  mutate(zipcode = fct_reorder(zipcode, cost)) %>%
  ggplot(aes(cost, zipcode, size = SizeRank)) +
  scale_size(trans = 'reverse', range = c(0.1,8), breaks = c(5,50,1000,4000)) +
  geom_point(alpha = 0.8) +
  geom_rect(xmin = 0,xmax = 700000, ymin = 0, ymax = 11, alpha = 0.01) +
  scale_x_continuous(labels = comma) +
  labs(x = "Median cost of 2 bedroom property",
       y = "",
       size = "SizeRank",
       title = "How does the median price of 2 bedroom properties compare for different zipcodes in NYC?",
       subtitle = "Cost price is only available for 25 zipcodes in New York City in Zillow dataset")

Neighbourhoodwise cost analysis

Here, let us see the variation in the cost of 2 bedroom prices across the neighbourhoods.

  • There is no cost data available for Bronx neighbourhood
  • Although cost price in various zipcodes in Manhattan is near the median the variation is considerable, and I cannot impute the median value for zipcodes in Manhattan for which the cost price is not available.
  • Cost price is available only for 25 zipcodes and the sample size further reduces if I divide it across the different neighbourhood and hence as stated above, I would not impute the neighbourhood median cost price for zipcodes which don’t have any cost data
zillow_neighbourhood <- merge(x = zillow_final, y = zip_by_neighbourhood, by.x = "zipcode", by.y = "zipcode", all.x = TRUE)


zillow_neighbourhood %>%
  add_count(neighbourhood) %>%
  mutate(neighbourhood = glue::glue("{ neighbourhood } ({ n })")) %>%
  ggplot(aes(neighbourhood, cost)) +
  geom_boxplot() +
  scale_y_continuous(labels = comma) +
  geom_jitter() +
  labs(title = "How does cost of 2 bedroom houses differs accross different neighbourhoods of New York City?",
       subtitle = "Black dots represents median cost price in each zipcode within neighbourhood")

Moving forward, I will identify the profitability of only those zipcodes for which the cost price is available.

Revenue data

Data Cleaning

From initial exploration of AirBnB dataset, it can be seen that:

  • There are 48895 observations of 106 variables
# For verification
dim(airbnb)

The top 10 rows of the dataset are:

DT::datatable(head(airbnb, 10),extensions = 'FixedColumns',
  options = list(
  dom = 't',
  scrollX = TRUE,
  scrollY = "300px",
  fixedColumns = TRUE
))
  • There are no duplicate records present when checked at all levels
# Checking number of duplicate records
nrow(airbnb[which(duplicated(airbnb) == TRUE),])
## [1] 0
  • There are total 48895 records which contain missing value i.e. every record have at least 1 missing value. Please find below the number of missing values per column
#Checking missing values in each column
DT::datatable(as.data.frame(colSums(is.na(airbnb))),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
# Checking count of records having missing value
count(airbnb[rowSums(is.na(airbnb)) > 0,])

As there are 106 columns out of which only some will be of use to me. Hence before further exploration, I will start deleting the column name based on the description provided in the Metadata

The 106 columns are:

DT::datatable(as.data.frame(colnames(airbnb)),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))

The main interest is in recommending zipcode and not the individual properties and hence first, I will start cleaning with those columns which are specific to the individual listings.

There are a group of fields providing information about the host of the listing like host_id, host_url, host_name, host_response_rate, etc. Similarly, there are a few fields for guest profile. I will remove these fields first, as I will not need host information for my analysis.

# Identifying column names starting with "host_"
airbnb[,str_detect(colnames(airbnb),"^host_")]
# Identifying column names containing "guest"
airbnb[,str_detect(colnames(airbnb),"guest")]
# Droping this columns
airbnb_2 <- airbnb[,!str_detect(colnames(airbnb),"^host_")]
airbnb_3 <- airbnb_2[,!str_detect(colnames(airbnb_2),"guest")]
# Checking the dimensions
dim(airbnb_3)

Similarly, columns which contain any url are not important to us and hence I am removing them.

# Identifying column names containing with "_url"
airbnb_3[,str_detect(colnames(airbnb_3),"_url")]
# Droping this columns
airbnb_4 <- airbnb_3[,!str_detect(colnames(airbnb_3),"_url")]
# Checking the dimensions
dim(airbnb_4)

I want to keep only those columns which I can roll-up at the zipcode level and can arrive at a conclusion at the zip level. For this I will check the number of unique values present in each column and then remove columns accordingly i.e. I will remove values which have zero variance (only 1 unique value) and extreme high variance (> 40000 unique values)

# Number of Unique values per column
unique_values <- airbnb_4 %>% summarise_all(n_distinct)
DT::datatable(as.data.frame(t(unique_values)),class = "stripe hover")
# Removing columns with zero variance and extreme high variance
column_names <- colnames(unique_values[colSums(unique_values) != 1 & colSums(unique_values) < 40000])

Removing the columns detected in above code chunk

# Checking the dataset after removing above identified columns
airbnb_4[,names(airbnb_4) %in% column_names]
# Creating new database
airbnb_5 <- airbnb_4[,names(airbnb_4) %in% column_names]

Now I will remove the column manually based on the metadata and structure of an existing dataset. The columns which are being removed are either specific to that particular airbnb listing or won’t add much value in our zip level analysis like state column.

# Checking the structure of existing dataset. It will aid me to identify the columns to drop
str(airbnb_5)
dim(airbnb_5)
# Creating vector of column names which I want to remove
column_names_2 <- c("space",
                  "neighborhood_overview",
                  "notes", 
                  "transit",
                  "access",
                  "interaction",
                  "house_rules",
                  "street",
                  "neighbourhood",
                  "neighbourhood_cleansed",
                  "city",
                  "state",
                  "market",
                  "smart_location",
                  "is_location_exact",
                  "room_type",
                  "bathrooms",
                  "beds",
                  "extra_people",
                  "instant_bookable",
                  "cancellation_policy"
                  )
# Removing above columns
airbnb_6 <- (airbnb_5[,!(colnames(airbnb_5) %in% column_names_2)])
# Checking Dimensions
dim(airbnb_6)

As mentioned earlier for the purpose, I don’t want to delve deeper into the individual listing as it doesn’t serve any purpose in our analysis right now. Please find below some pointers regarding the retention and deletion of columns:

  • Not interested in property and bed type and how many people host allows
  • Only price per night is considered rest all are dropped
  • availability_365 is retained which may be used later in calculating yearly occupancy rate
  • Only interested in review_scores_location, rest all reviews are dropped
  • reviews_per_month column is kept which may be used in estimating occupancy rate
  • Location Information: Only zipcodes, neighbourhood_group_cleansed, latitude and longitude are retained
  • bedrooms column is retained as it will be used to filter out 2 bedrooms apartments
  • square_feet is retained as it might be used to normalize house prices across the zipcode
# columns to be removed
column_names_3 <- c("last_scraped",
                  "property_type",
                  "accommodates",
                  "bed_type",
                  "weekly_price",
                  "monthly_price",
                  "security_deposit",
                  "cleaning_fee",
                  "minimum_nights",
                  "maximum_nights",
                  "minimum_minimum_nights",
                  "maximum_minimum_nights",
                  "minimum_maximum_nights",
                  "maximum_maximum_nights",
                  "minimum_nights_avg_ntm",
                  "maximum_nights_avg_ntm",
                  "calendar_updated",
                  "availability_30",
                  "availability_60",
                  "availability_90",
                  "calendar_last_scraped",
                  "number_of_reviews",
                  "number_of_reviews_ltm",
                  "first_review",
                  "last_review",
                  "review_scores_rating",
                  "review_scores_accuracy",
                  "review_scores_cleanliness",
                  "review_scores_checkin",
                  "review_scores_communication",
                  "review_scores_value",
                  "calculated_host_listings_count",
                  "calculated_host_listings_count_entire_homes",
                  "calculated_host_listings_count_private_rooms",
                  "calculated_host_listings_count_shared_rooms")
# Removing above columns
airbnb_7 <- (airbnb_6[,!(colnames(airbnb_6) %in% column_names_3)])
# Checking Dimensions
dim(airbnb_7)

Now, let us further explore the quality of these 10 variables to arrive at a dataset which will be merged with zillow dataset.

Let me check the high-level summary of this dataset.

  • There are 523 missing values in zipcode which is a small portion of the total number of observations, hence I’ll be removing them
  • Very high number of missing values in square_feet which leads me to conclude that it won’t make much sense to keep square feet column for our final data processing. Instead, I would assume that the price is constant throughout the zipcode and the cost of an apartment will be the median cost of 2 bedroom property irrespective of its area
  • price variable is a character which I need to convert in numeric in order to process it
summary(airbnb_7)
##  neighbourhood_group_cleansed   zipcode             latitude    
##  Length:48895                 Length:48895       Min.   :40.50  
##  Class :character             Class :character   1st Qu.:40.69  
##  Mode  :character             Mode  :character   Median :40.72  
##                                                  Mean   :40.73  
##                                                  3rd Qu.:40.76  
##                                                  Max.   :40.91  
##                                                                 
##    longitude         bedrooms       square_feet        price          
##  Min.   :-74.24   Min.   : 0.000   Min.   :   0.0   Length:48895      
##  1st Qu.:-73.98   1st Qu.: 1.000   1st Qu.: 315.0   Class :character  
##  Median :-73.96   Median : 1.000   Median : 700.0   Mode  :character  
##  Mean   :-73.95   Mean   : 1.178   Mean   : 708.1                     
##  3rd Qu.:-73.94   3rd Qu.: 1.000   3rd Qu.: 950.0                     
##  Max.   :-73.71   Max.   :14.000   Max.   :5000.0                     
##                   NA's   :22       NA's   :48487                      
##  availability_365 review_scores_location reviews_per_month
##  Min.   :  0.0    Min.   : 2.000         Min.   : 0.010   
##  1st Qu.:  0.0    1st Qu.: 9.000         1st Qu.: 0.190   
##  Median : 45.0    Median :10.000         Median : 0.720   
##  Mean   :112.8    Mean   : 9.565         Mean   : 1.373   
##  3rd Qu.:227.0    3rd Qu.:10.000         3rd Qu.: 2.020   
##  Max.   :365.0    Max.   :10.000         Max.   :58.500   
##                   NA's   :11082          NA's   :10052

Removing square_feet column

# Removing Square_feet column
airbnb_8 <- airbnb_7 %>% subset(select = -(square_feet))

Quality Checking and Cleaning of ZIPCODE:

# Creating backup of previous dataset
airbnb_9 <- airbnb_8 
# Checking Lenght of Zipcodes
unique(airbnb_9 %>% transmute(num_digits_zip = str_length(zipcode)))

It can be seen that there are entries which have more than 5 digit zipcode. Please find below these entries:

DT::datatable(filter(mutate(airbnb_9,num_digits_zip = str_length(zipcode)),num_digits_zip > 5),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))

Checking count of non-digit zipcode and viewing them. These are the same zips presented above.

# Checking count of non digit zipcodes
count(airbnb_9 %>% filter(str_detect(zipcode, "[^0-9]")))
# Viewing non-digit zipcode
DT::datatable(airbnb_9 %>% filter(str_detect(zipcode, "[^0-9]")),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))

Cleaning of zipcode:

# In our case the class of zipcode column is charachter but still I am converting it to charachter as it will be general practice if the type of variable is different in future dataset
#airbnb_9$zipcode <- as.character(airbnb_9$zipcode)

# Removing all the whitespaces
#airbnb_9$zipcode = gsub('\\s', '', airbnb_9$zipcode)

# For cases where there is a "-" in zipcode keeping only digits which are before it
#airbnb_9$zipcode = gsub('^([0-9]+)-[0-9]+', "\\1", airbnb_9$zipcode)

# Delete empty strings
#airbnb_9$zipcode[grepl('^$', airbnb_9$zipcode)] = NA
    
# restore leading zeroes to non 5 digit codes:
#airbnb_9$zipcode <- str_pad(airbnb_9$zipcode, width = 5, side = "left", pad = "0")

# Keeping only first 5 digits from starting:
#airbnb_9$zipcode <- str_sub(airbnb_9$zipcode,start = 1,end = 5)

# Above thing is now realized by zipclean function

airbnb_9$zipcode <- zipclean(airbnb_9$zipcode)
# dim of cleaned dataset
dim(airbnb_9[!is.na(airbnb_9$zipcode),])
## [1] 48378     9
#Storing this value in new dataset
airbnb_10 <- airbnb_9[!is.na(airbnb_9$zipcode),]

It can be seen above that only 517 NA values are removed. Let us repeat the QC which we performed on the zipcodes above. From below QC, it can be seen that the zipcode column is now cleaned.

# Checking count of non digit zipcodes
count(airbnb_10 %>% filter(str_detect(zipcode, "[^0-9]")))
# Checking Lenght of Zipcodes
unique(airbnb_10 %>% transmute(num_digits_zip = str_length(zipcode)))

Exploring the bedrooms variable

Missing value or any other ambiguous values are not present in bedrooms column. Now we will apply the filter of bedrooms == 2 in order to filter out 2-bedroom properties only.

table(airbnb_10$bedrooms)
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    14 
##  4516 34601  6447  2049   518   146    42    21     8     5     3     1     1

Cleaning Price column

Here, I would remove the $ value from the prefix and convert the column to numeric type so that I can perform analysis on it. Price column may contain some unusual values but I will clean it only after I filter the 2 bedroom apartments.

airbnb_10$price <- as.numeric(gsub('[$,]','', airbnb_10$price))

Availablity of Listing

Please find below the table of frequency of availability_365. As per my initial assumption, a large number of ratings have very high pre-booking. Now let us add the column of Occupancy.

DT::datatable(as.data.frame(table(airbnb_10$availability_365)))

Occupancy:

The number of days the property is occupied in a year is calculated based on the following formula: \[ occupancy = availability\_365*0.75 + (365 - availability\_365) \]

airbnb_11 <- mutate(airbnb_10, occupancy = round(availability_365*0.75 + (365 - availability_365),0))
# QCing occupancy column:
DT::datatable(as.data.frame(table(airbnb_11$occupancy)))

Retaining Only 2 Bedroom Properties

Now we will filter the dataset based on the number of bedrooms. Please find below the final cleaned AirBnB dataset:

airbnb_final <- filter(airbnb_11,airbnb_10$bedrooms == 2)

Quality Check of Price column

There is only one value with price == 0. It won’t affect my analysis hence I am removing it.

# QCing Price column in cleaned dataset
DT::datatable(as.data.frame(table(airbnb_final$price)),class = "stripe hover")
# Filtering out that record
airbnb_final <- filter(airbnb_final,price > 0)

Please find the data preview on the next tab.

Data Preview

Please find below the preview of the final AirBnb cleaned dataset:

# Final Dataset
DT::datatable(airbnb_final,class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))

EDA

This section is just to give a high-level overview of the distribution of listing. I am not focusing much here because our main goal is to identify the most profitable zipcodes which will be done in subsequent sections.

The New York City is divided into 5 neighbourhoods and here we can try to look at the distribution of listings in different neighbourhoods. It can give us a general idea that which neighbourhood has a high number of listing and how is the distribution of nightly price in each borough. The plot is zoomed between per night price from $20 to $500 to look at the distribution as anything below 20 dollars or above 500 dollars seems unlikely for property in New York. The listings having that values either must be data entry errors or some special cases like a luxury apartment in Manhattan.

  • Manhattan and Brooklyn have the maximum number of listings
  • The price of listings in Manhattan is higher than those of all other boroughs
airbnb_final %>%
  add_count(neighbourhood_group_cleansed) %>%
  mutate(Boroughs = glue::glue("{ neighbourhood_group_cleansed } ({ n })")) %>%
  ggplot(aes(Boroughs, price)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(20, 500)) +
  labs(title = "How does price of AirBnB listing differs accross different neighbourhoods of New York City?")

Rolling-Up Dataset at ZIPCODE level

Here, first I will roll up the AirBnB dataset at zipcode level and then I will merge it with the Zillow dataset. While rolling up at the zipcode level I will take the mean of latitude and longitude and median of price, occupancy, review_score_location and reviews_per_month. For the neighbourhood group, I will see if there are any duplicate at zipcode and neighbourhood group level and will update the neighbourhood after seeing the observations.

Checking the duplicate at Zipcode and Neighbourhood level

Here, I have found the duplicate records at zipcode and neighbourhood level.

# Checking the values
zip_neighbour <- airbnb_final %>% distinct(zipcode,neighbourhood_group_cleansed)

# Checking Duplicate Zips
dup_zips <- zip_neighbour[which(duplicated(zip_neighbour$zipcode) == TRUE),]$zipcode

# Manually Identifying neighbourhood of these zips

zip_neighbour <- subset(airbnb_final,zipcode %in% dup_zips)

table(zip_neighbour$zipcode,zip_neighbour$neighbourhood_group_cleansed)
##        
##         Bronx Brooklyn Manhattan Queens
##   10013     0        1       104      0
##   10463     8        0         1      0
##   11385     0        1         0     60

Now, we can see that the duplicate is because of only one record of zips being in a different neighbourhood. Here, I will manually replace the neighbourhood of zip i.e. neighbourhood of 10013 will be Manhattan and not Brooklyn, similarly neighbourhood of 10463 and 11385 will be Bronx and Queens respectively.

Updating the neighbourhood group

# Creating Backup of final airbnb dataset
airbnb_backup <- airbnb_final

airbnb_final[airbnb_final$zipcode == "10013",]$neighbourhood_group_cleansed <- "Manhattan"

airbnb_final[airbnb_final$zipcode == "10463",]$neighbourhood_group_cleansed <- "Bronx"

airbnb_final[airbnb_final$zipcode == "11385",]$neighbourhood_group_cleansed <- "Queens"

# For QC
airbnb_final %>% distinct(zipcode,neighbourhood_group_cleansed)

Rolling Up AirBnB dataset

airbnb_zip_lvl <- airbnb_final %>%
  group_by(zipcode,neighbourhood_group_cleansed) %>% 
  summarise(lat = mean(latitude), long = mean(longitude), rent_per_day = median(price), med_loc_review = median(review_scores_location,na.rm = T), avg_rev_per_year = median(12*reviews_per_month, na.rm = T), med_occupancy = median(occupancy), no_of_listings = n())

# Rolled Up Data preview
head(airbnb_zip_lvl,10)
# QC
dim(airbnb_zip_lvl)
## [1] 169   9

Data Merge and Processing

In this section, I will merge both the final dataset and then calculate profitability metrics.

Merging Two Datasets by ZIPCODE

First, I will merge both the datasets so that if later there is a change is a business decision and if median house prices for zipcode are imputed we can use this dataset. For our current analysis purpose, I will only keep those zips for which cost and rent_per_day values are available.

  • 169 unique zipcodes were present in NYC whose rent_per_day of 2BR property is available
  • Cost price of 2BR property was available for 25 zipcodes
  • After merging the two datasets and filtering records having both cost and rent_per_day value only 24 zipcodes are available
zipcode_pricing <- merge(x = airbnb_zip_lvl, y = zillow_final, by.x = "zipcode", by.y = "zipcode", all = TRUE)

final_data <- filter(zipcode_pricing, !is.na(cost)) %>% 
  filter(!is.na(rent_per_day))

# Data Preview of Final Dataset
DT::datatable(final_data,class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))

Before proceeding further let us review the final_data to see if there are any outlier value present in the dataset.

  • Minimum and maximum values in rent_per_day are $73 to $339 which does not seem odd and are probable
  • Minimum and maximum values in cost are $436,761 to $3,706,455 which is also quite possible in a big city like New York as different house prices may vary drastically across different neighbourhoods
str(final_data)
summary(final_data)

Calculating Yearly Earning

I am calculating earnings based on my earlier assumption that only nightly rent contributes to earning. Also, I will assume that average occupancy in terms of the number of guests is 3 guests per 2BR property whenever the property is occupied.

Hence, the yearly earning is given by the following formula: \[ yearly\_earning = 3*rent\_per\_day*med\_occupancy \]

final_data_2 <- mutate(final_data, yearly_earning = round(3*rent_per_day*med_occupancy,0))
# QC
final_data_2[c("zipcode","yearly_earning")]

Capitalization Rate

Capitalization Rate or Cap Rate is going to be our most important metric while calculating profitability. Cap rate is the ratio of net yearly income by properties market value. For simplifying calculation and as per our given assumptions, I am assuming that the market value of the property after 1 year is the same as its current value. Therefore the capitalization rate is given by: \[cap\_rate = (yearly\_earning/cost)*100 \]

final_data_3 <- mutate(final_data_2, cap_rate = round((100*yearly_earning)/cost,2))
# QC
final_data_3[c("zipcode","cap_rate")]

Profitability Segment

Based on the above capitalization rate I will divide the properties into three categories High, Medium, and Low, where properties with High profitability have cap rate >20%, Medium profitability have cap rate between 10% to 20% and Low profitability have cap rate less than 10%.

final_data_4 <- mutate(final_data_3, 
                       profit_segment = if_else(cap_rate > 20,"High",if_else(cap_rate < 10,"Low","Medium"))) 
                         
# QC
final_data_4[c("zipcode","profit_segment")]

Break Even Period

My final metric on deciding which zipcodes to invest in will be break-even period. It is the number of years required to recover initial investment assuming the prices to remain constant. Basically, this is just the reciprocal of Capitalization Rate but it will provide better visualization of the recovery period. The formula for the break-even period is: \[breakeven\_period = cost/yearly\_earning\]

final_data_5 <- mutate(final_data_4, breakeven_period = round(cost/yearly_earning,1))
# QC
final_data_5[c("zipcode","breakeven_period")]

Data Visualization

Data Preview

Please find below the final data with all the profitability matrix. You can sort the data by clicking on the column header to find the best and worst zipcodes according to that profitability matrix. The default sorting order is from Best to Worst zipcodes based on Cap_rate followed by yearly_earnings. You can also filter and view top 5 or top 10 entries by selecting the number of entries you want to view in Show drop-down box.

final_data_5 <- arrange(final_data_5, desc(cap_rate), desc(yearly_earning))
colnames(final_data_5)[colnames(final_data_5) == "neighbourhood_group_cleansed"] <- "neighbourhood"
DT::datatable(final_data_5[c("zipcode","neighbourhood","rent_per_day","cost","no_of_listings","cap_rate","profit_segment","yearly_earning","breakeven_period")], options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

Data Analysis

We can already see the top zipcodes to invest in as per Capitalization Rate and Breakeven Period or Yearly Earnings. Here I will try to provide visualization for these metrics and see how they stand with respect to each other.

Ranking Zipcodes as per Capitalization Rate

The top 5 zipcode along with their neighbourhood and cap_rate to invest are:

  • 11434 in Queens with 30.14% cap. rate
  • 10304 in Staten Island with 23.07% cap. rate
  • 10303 in Staten Island with 22.66% cap. rate
  • 10306 in Staten Island with 21.07% cap. rate
  • 11234 in Brooklyn with 21.04% cap. rate
final_data_5 %>%
  mutate(zipcode = fct_reorder(zipcode, cap_rate)) %>%
  ggplot(aes(x = zipcode, y = cap_rate, fill = neighbourhood)) +
  scale_fill_manual(values = c("#E69F00", "#0072B2", "#009E73", "#D55E00")) +
  geom_bar(stat = "identity") + 
  geom_text(aes(label = cap_rate), color = "black",size = 3) +
  coord_flip() +
  theme(legend.title = element_blank()) + 
  labs(y = "Capitalization Rate",
       x = "",
       title = "Zipcodes arranged in order of Capitalization Rate",
       subtitle = "")

The number of listings was not considered in the above visualization. Let us see the top Zipcodes having atleast 20 listings as they seem to be a better investment than zipcodes having fewer listing as we don’t have enough data to calculate profitability.

The top 5 zipcodes are:

  • 10036 in Manhattan with 15.68% cap. rate
  • 10025 in Manhattan with 15.23% cap. rate
  • 11231 in Brooklyn with 14.93% cap. rate
  • 11215 in Brooklyn with 14.61% cap. rate
  • 11201 in Brooklyn with 14.59% cap. rate
final_data_5 %>%
  filter(no_of_listings >= 20) %>% 
  mutate(zipcode = fct_reorder(zipcode, cap_rate)) %>%
  ggplot(aes(x = zipcode, y = cap_rate, fill = neighbourhood)) +
  scale_fill_manual(values = c("#E69F00", "#0072B2", "#009E73", "#D55E00")) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = cap_rate), color = "black",size = 3) +
  coord_flip() +
  theme(legend.title = element_blank()) +
  labs(y = "Capitalization Rate",
       x = "",
       title = "Zipcodes arranged in order of Capitalization Rate",
       subtitle = "Only zipcodes having atleast 20 listings")

Yearly Earnings vs Property Cost

Here is the distribution of zipcodes wrt to Yearly Earning and Property Cost. We are interested in zipcodes with less property cost and high yearly earning. Basically with high capitalization rate.

final_data_5 %>% 
  ggplot(aes(x = cost, y = yearly_earning, color = neighbourhood)) +
  geom_point(size = 2) +
  theme(legend.title = element_blank()) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  geom_text(aes(label = zipcode), cex = 3, font = 8) +
  geom_rect(xmin = 0,xmax = 2000000, ymin = 200000, ymax = 400000, alpha = 0.01) +
  labs(y = "Yearly Earning (in $)",
       x = "Property Cost (in $)",
       title = "Which are the best zipcodes to invest?",
       subtitle = "Shaded area represent Zipcodes which have relatively low property cost and high yearly earning"
       )

Capitalization Rate vs Location Review

Below is the comparison of our profitability matrix i.e. Capitalization Rate with median location review score. Here definite conclusion cannot be made as all of our Zipcodes have very high median location review.

final_data_5 %>% 
  ggplot(aes(x = med_loc_review, y = cap_rate, color = neighbourhood)) +
  geom_point(size = 2) +
  theme(legend.title = element_blank()) +
  geom_text(aes(label = zipcode), cex = 3, font = 8) +
  labs(y = "Capitalization Rate (in %)",
       x = "Location Review",
       title = "Relationship between Capitalization Rate and Location review",
       subtitle = ""
       )

Interactive Map

Below is the location of properties in which we can invest along with their profile information.
Click on the map pointer to view the zipcode along with its median property cost, median yearly earning, breakeven period, and capitalization rate.

map <- leaflet(final_data_5) %>% addTiles()

profit_segment_color <- c("blue","red","green")[factor(final_data_5$profit_segment)]

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = profit_segment_color
)


map %>% addAwesomeMarkers(lng = ~ final_data_5$long, lat = ~ final_data_5$lat, popup = paste("Zipcode: " ,final_data_5$zipcode , "<br>","Neighbourhood: ", final_data_5$neighbourhood,"<br>","Property Cost: ", dollar(final_data_5$cost),  "<br>", "Yearly Earning: ", dollar(final_data_5$yearly_earning), "<br>", "Bearkeven Period (in y): ", final_data_5$breakeven_period, "<br>", "Capitalization Rate: ", final_data_5$cap_rate, "<br>", "Profitability Segment: ", final_data_5$profit_segment), icon = icons)
  • Blue color for High Profit Segment
  • Green color for Medium Profit Segment
  • Red color for Low Profit Segment

Summary

Executive Summary

Below are the top 5 overall zipcodes and top 5 zipcodes having at least 20 listings, in terms of capitalization rate, along with top 5 highest earning zipcodes and top 5 least expensive zipcodes.

Zipcodes in which investment is to be made will now mostly depend on the real estate companies philosophy and budget. I would recommend investing in the top zipcode from Queens and Staten Island as they have very high return on investment, along with investing in top 2 zipcodes from Manhattan and Brooklyn each because they have very high yearly earning and the data is based on high number of listings so the accuracy is more and investing in them will be a safe bet.

The top 5 overall zipcode to invest are:

final_data_5 %>%
  top_n(5,cap_rate) %>%
  arrange(desc(cap_rate), desc(yearly_earning)) %>%
  mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
  select(c("zipcode","neighbourhood","cap_rate","breakeven_period","yearly_earning","cost","no_of_listings"))

Top 5 Zipcodes having atleast 20 listings:

final_data_5 %>%
  filter(no_of_listings >= 20) %>% 
  top_n(5,cap_rate) %>%
  arrange(desc(cap_rate), desc(yearly_earning)) %>%
  mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
  select(c("zipcode","neighbourhood","cap_rate","breakeven_period","yearly_earning","cost","no_of_listings"))

Top 5 Zipcodes by Yearly Earnings:

final_data_5 %>%
  top_n(5,yearly_earning) %>%
  arrange(desc(yearly_earning)) %>%
  mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
  select(c("zipcode","neighbourhood","yearly_earning","cap_rate","breakeven_period","cost","no_of_listings"))

Top 5 least expensive zipcodes :

final_data_5 %>%
  top_n(-5,cost) %>%
  arrange(cost) %>%
  mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
  select(c("zipcode","neighbourhood","cost","yearly_earning","cap_rate","breakeven_period","no_of_listings"))

Scope for improvement

  • As I was looking at the data for the first time I had to manually look at each column and then decide the action to be taken on it. But if this was a recurring work, most of the process can be automated based on the past experience
  • The Zillow cost data was available till June 2017 and Airbnb Listing data was scrapped on August 2019, hence auto.Arima() was used to forecast the housing values for August 2019. The process increased run time of the code hence some better and quick forecasting algorithm could be developed
  • I have provided the map which shows all the details about zipcode but due to time constraint, I was not able to make it more interactive. A shiny dashboard could be made which will allow end user to filter the zipcodes based on desired metrics and then only show the relevant zipcodes in map
  • Right now the cost data was available for only 24 zipcodes hence the final analysis is based only on these 24 zipcodes. A model could be built to assign the same cost of the property to the neighbouring zipcodes, in this way we could expand the scope of our analysis