My goal here is to recommend zip-codes in New York City which will generate the most profit on short term rentals to a real estate company interested in investing in two-bedroom properties. My recommendations will be based on publicly available data from Zillow and AirBnB. There are certain assumptions and instructions which I would like to mention below:
Assumptions and Known Information
Given Assumptions:
Additional Assumptions
I have made a few additional assumptions apart from given assumptions. Some of them which were very specific to the dataset are mentioned in their respective Data Processing tabs. Below, I am stating the assumptions which are critical and decides the overall approach of the project:
Time Value of Money:
Occupancy Assumptions:
365 - availability_365 daysApproach
I will first clean and explore both the dataset and then merge them in subsequent tabs. Depending on my findings of the exploration I will decide the metric to gauge profitability and suggest the most profitable zipcodes in New York City.
Profitability Metrics
Please find below the meaning of profitability metrics in brief. More explanation is given in Data Merge and Processing tab where they are calculated
Yearly Earning:
\[ yearly\_earning = 3*rent\_per\_day*med\_occupancy \]
Capitalization Rate: Ratio of net yearly income by properties market value. \[cap\_rate = (yearly\_earning/cost)*100 \]
Profitability Segment: Based on capitalization rate properties are divided into High, Medium and Low Segments
Breakeven Period: Number of years required to recover initial investment assuming the prices to remain constant \[breakeven\_period = cost/yearly\_earning\]
Please review the tabs from left to right. Also, have a look at the interactive map section where you could get all the profitability information about the zipcodes in the New York City map.
The following packages were used to arrive at the recommendations:
# Checking if Library is present or not
# Check for missing dependencies and load necessary R packages
if (!require(tidyverse)) {install.packages('tidyverse')};
if (!require(readr)) {install.packages('readr')};
if (!require(DT)) {install.packages('DT')};
if (!require(leaflet)) {install.packages('leaflet')};
if (!require(glue)) {install.packages('glue')};
if (!require(forecast)) {install.packages('forecast')};
if (!require(scales)) {install.packages('scales')};
# Loading the Libraries
library(tidyverse) # Used in data processing and data transformation as well as for data visualization
library(readr) # Used for importing data CSV files
library(DT) # Used for displaying table in HTML
library(leaflet) # Used to add Interactive Map
library(glue) # Used for Concatenating Name and Count in Graphs
library(forecast) # For computing and visualizing basic time series components
library(scales) # Scales function for visualization
Setting the working directory to device’s project location. Please edit the directory based on your suited location to run files on your device.
# Setting Working Directory to the local path. Please change it according to your device location
#setwd('')
Cost Data: Zillow Dataset
# Importing Data Set
zillow_dataset <- readr::read_csv('Data/Zip_Zhvi_2bedroom.csv')
Revenue Data: AirBnb Dataset
While importing there was a warning of parsing failures. After checking with problems() function, it was found that the problem is with zipcode, jurisdiction_names and license variable. I won’t be using jurisdiction_names and license variable anywhere in my analysis. But zipcode is critical variable. Hence while importing the datatype of zipcode is specified as a character
# Importing Data Set
airbnb <- readr::read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2019-07-08/data/listings.csv.gz',col_types = cols(zipcode = col_character()))
NYC: Zipcode and Neighbourhood mapping
The ZIP-Neighbourhood data scrapped from the following website: nycbynatives.com is imported.
I have not imported the .csv file here, as I felt it will be inconvenient to import additional .csv file everytime you run the code.
zip_by_neighbourhood <- data.frame("zipcode" = c("10001","10002","10003","10004","10005","10006","10007","10009","10010","10011","10012","10013","10014","10015","10016","10017","10018","10019","10020","10021","10022","10023","10024","10025","10026","10027","10028","10029","10030","10031","10032","10033","10034","10035","10036","10037","10038","10039","10040","10041","10044","10045","10048","10055","10060","10069","10090","10095","10098","10099","10103","10104","10105","10106","10107","10110","10111","10112","10115","10118","10119","10120","10121","10122","10123","10128","10151","10152","10153","10154","10155","10158","10161","10162","10165","10166","10167","10168","10169","10170","10171","10172","10173","10174","10175","10176","10177","10178","10199","10270","10271","10278","10279","10280","10281","10282","10301","10302","10303","10304","10305","10306","10307","10308","10309","10310","10311","10312","10314","10451","10452","10453","10454","10455","10456","10457","10458","10459","10460","10461","10462","10463","10464","10465","10466","10467","10468","10469","10470","10471","10472","10473","10474","10475","11004","11101","11102","11103","11104","11105","11106","11109","11201","11203","11204","11205","11206","11207","11208","11209","11210","11211","11212","11213","11214","11215","11216","11217","11218","11219","11220","11221","11222","11223","11224","11225","11226","11228","11229","11230","11231","11232","11233","11234","11235","11236","11237","11238","11239","11241","11242","11243","11249","11252","11256","11351","11354","11355","11356","11357","11358","11359","11360","11361","11362","11363","11364","11365","11366","11367","11368","11369","11370","11371","11372","11373","11374","11375","11377","11378","11379","11385","11411","11412","11413","11414","11415","11416","11417","11418","11419","11420","11421","11422","11423","11426","11427","11428","11429","11430","11432","11433","11434","11435","11436","11691","11692","11693","11694","11697"),
"neighbourhood" = c("Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Manhattan","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Staten","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Bronx","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Brooklyn","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens","Queens"))
# Data Preview
head(zip_by_neighbourhood,10)
This section will contain functions which are created for the purpose of reproducibility.
Please note that this function was realized after the particular process of data cleaning, because earlier I had no visibility of data. Due to time constraint, I was only able to create one such functions but while coding the care was taken to make the code user-friendly and have provision to tweak it easily.
In both of the datasets viz. Zillow and Airbnb zipcode is the most important identifier. The end goal is also to identify the most profitable zipcodes hence cleaning zip is an important task which will be performed in every such assignment. Here, I am writing a function to clean the zip column.
This function is inspired from a similar function which was present in zipcode package in R which is now archived. The link of the package can be found here
zipclean <- function(zipcode){
# converting zipcode to character variable type
zipcode <- as.character(zipcode)
# Removing all the whitespaces
zipcode = gsub('\\s', '', zipcode)
# For cases where there is a "-" in zipcode keeping only digits which are before it
zipcode = gsub('^([0-9]+)-[0-9]+', "\\1", zipcode)
# Delete empty strings
zipcode[grepl('^$', zipcode)] = NA
# restore leading zeroes to non 5 digit codes:
zipcode <- str_pad(zipcode, width = 5, side = "left", pad = "0")
# Keeping only first 5 digits from starting:
zipcode <- str_sub(zipcode,start = 1,end = 5)
return(zipcode)
}
More such functions can be made, for eg. the qualtity checking is done at several steps hence a function can be created as shown below to perform basic quality checks.
quality_check <- function(dataset){
cat("The structure of the dataset is as follows:","\n",str(dataset),"\n",
"The dimensions of the dataset are:","\n",dim(dataset),"\n",
"The summary of the dataset is as follows:","\n",summary(dataset),"\n",
"The number of duplicate records present in the dataset are:","\n",nrow(dataset[which(duplicated(dataset) == TRUE),]))
}
Please note that I have not used the above function in my code as it was realized after the complete procedure of data cleaning. Also, the function is suitable for small data where the number of variables is less as it gets difficult to view the structure and summary of data having a large number of variables.
Similarly, more functions can be created but because of the time constraint, I was not able to focus much on this part of the process.
From the initial exploration of Zillow dataset, it can be seen that:
First 10 rows of Zillow dataset
# Checking the top 10 rows of the dataset
DT::datatable(head(zillow_dataset,10), extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
fixedColumns = TRUE
))
#structure of the dataset along with dimensions
str(zillow_dataset)
#High level Summary of data
summary(zillow_dataset)
#Checking missing values in each column
colSums(is.na(zillow_dataset))
# Checking Record having missing value
count(zillow_dataset[rowSums(is.na(zillow_dataset)) > 0,])
# Checking count of unique records in data
count(unique(zillow_dataset))
# Checking number of duplicate records
nrow(zillow_dataset[which(duplicated(zillow_dataset) == TRUE),])
# Checking Unique number of region ID's and Region name's
length(unique(zillow_dataset$RegionID))
length(unique(zillow_dataset$RegionName))
# Chceking the number of records for each city
arrange(as.data.frame(table(zillow_dataset$City)),desc(Freq))
# Checking lenght of ZIP Codes
unique(zillow_dataset %>% transmute(num_digits_zip = str_length(RegionName)))
# Checking for any non digit zipcode
count(zillow_dataset %>% filter(str_detect(RegionName, "[^0-9]")))
# Cleaning Zipcode
# There is no need of cleaning zipcode in this case but I am cleaning it so that it is part of process
zillow_dataset$RegionName <- zipclean(zillow_dataset$RegionName)
Now, as our firm is specifically interested in renting out properties in New York City, we will only keep the properties of New York City for our further analysis.
# Removing RegionID, State, Metro and CountyName, as I don't feel that I will need them later and if I need I can easily pull them on basis of RegionName
# I am keeping sizerank because I feel that I can use it again later in my analysis
zillow_dataset2 <- subset(zillow_dataset,select = -c(RegionID,State,Metro, CountyName))
# QCing new dataset
head(zillow_dataset2)
str(zillow_dataset2)
# Number of observations are same but number of variable is reduced by 4 which was expected result
# Filtering Data for New York City
nyczillow <- filter(zillow_dataset2,City == "New York")
# QC: Checking the count of filtered dataset
count(unique(nyczillow))
# As the count matches with the count in my above observation we can safely proceed forward
# QC: Checking missing values
colSums(is.na(nyczillow))
# There are missing values present
It is given that it is time series data. Let us plot the time series. For that it is necessary to convert the existing dataset into time series datatype in R.
# Removing City and Sizerank as we don't need them now for timeseries analysis
temp_nyczillow <- subset(nyczillow, select = -c(City, SizeRank))
dim(temp_nyczillow)
# Converting shape of dataframe from wide to long
longts <- gather(temp_nyczillow,month,median_cost,2:256)
head(longts)
dim(longts)
# Now I want to have different zipcodes as the column names for that I will need to spread the above dataset
widets <- spread(longts,RegionName,median_cost)
head(widets)
dim(widets)
# Converting dataframe to timeseries
nyczillow.ts <- ts(widets[,2:26], start = c(1996, 4), frequency = 12)
head(nyczillow.ts)
Time Series Plot for Cost Data
From the following plot a lot of missing data can be seen in the data for earlier years.
# Time Series Plot
forecast::autoplot(nyczillow.ts) +
scale_y_continuous(labels = comma) +
labs(title = "How does the historical median prices of two bedroom properties in New York City look?",
subtitle = "The graph shows variation in Median Prices of 2 bedroom propeties for 25 zipcodes in New York City",
y = "Median Price of 2 Bedroom Property",
x = NULL) +
labs(colour = "Zipcodes")
For predicting house prices for August 2019 (the timeperiod in which AirBnB data was scrapped), I will consider the median price of properties, starting from the year 2010.
Time Series Plot for Cost Data form year 2010
# Filtering data only after year 2010
nyczillow.ts.2010 <- window(nyczillow.ts, start = c(2010, 1))
# QC
head(nyczillow.ts.2010)
dim(nyczillow.ts.2010)
# Time Series Plot after 2010
forecast::autoplot(nyczillow.ts.2010) +
scale_y_continuous(labels = comma) +
labs(title = "Variation in the median prices of two bedroom properties in New York City since 2010",
y = "Median Price of 2 Bedroom Property",
x = NULL) +
labs(colour = "Zipcodes")
Predicting housing prices for August 2019
Due to time constraint, I want a quick method to forecast future values. Hence I will be using auto.Arima() to forecast the housing price until August 2019. As auto.Arima() only works for univariate time series, I will predict for each series separately and then predict housing price August-2019. I will add these values in manually created cost column in NYC Zillow dataset which was created earlier.
# Creating new dataset for adding property value for August 2019
zillow_dataset3 <- filter(zillow_dataset2,City == "New York")
# Adding blank column for cost
zillow_dataset3$cost <- as.integer(NA)
#QC
names(zillow_dataset3)
dim(zillow_dataset3)
# Keeping only relevant columns
zillow_dataset3 <- subset(zillow_dataset3,select = -c(4:ncol(zillow_dataset3)))
# Seprating widets dataset into different time series
for (i in 2:ncol(widets))
{
temp.nyczillow.ts <- ts(widets[,i], start = c(1996, 4), frequency = 12)
temp.nyczillow.ts.2010 <- window(temp.nyczillow.ts, start = c(2010, 1))
model <- auto.arima(temp.nyczillow.ts.2010)
# August forecast i.e. 26 months ahead
nycforecast2 <- forecast(model, h = 26)
# Property Cost for August
predictedvalue2 <- nycforecast2$mean[26]
# adding property values for August-2019 in Zillow dataset
zillow_dataset3$cost[zillow_dataset3$RegionName == colnames(widets[,i])] <- predictedvalue2
}
#QC
dim(zillow_dataset3)
head(zillow_dataset3)
names(zillow_dataset3)
Changing the name of RegionName column to zipcode as we want to use this dataset later, where we will merge this with AirBnB dataset to find the most profitable zipcodes in New York City.
# Changing the column name
colnames(zillow_dataset3)[colnames(zillow_dataset3) == "RegionName"] <- "zipcode"
Rounding off the property cost for future convenience
zillow_dataset3$cost <- round(zillow_dataset3$cost, digits = 0)
# Renaming Dataset
zillow_final <- zillow_dataset3
Our final cleaned cost data with the latest property value looks as follows:
DT::datatable(zillow_final,class = "stripe hover")
The key takeaway from this section is that for the zipcodes with missing cost price within New York City cannot be imputed with the median cost price of that neighbourhood because of very less sample size (see box plot below).
Let us look at how these 25 zipcodes in New York City compare with respect to each other in terms of Median property cost. I am also adding a size rank metric in the following plot to see the impact of population of the area on the property cost. The bubble size is directly proportional to the population of the area.
zillow_final %>%
mutate(zipcode = fct_reorder(zipcode, cost)) %>%
ggplot(aes(cost, zipcode, size = SizeRank)) +
scale_size(trans = 'reverse', range = c(0.1,8), breaks = c(5,50,1000,4000)) +
geom_point(alpha = 0.8) +
geom_rect(xmin = 0,xmax = 700000, ymin = 0, ymax = 11, alpha = 0.01) +
scale_x_continuous(labels = comma) +
labs(x = "Median cost of 2 bedroom property",
y = "",
size = "SizeRank",
title = "How does the median price of 2 bedroom properties compare for different zipcodes in NYC?",
subtitle = "Cost price is only available for 25 zipcodes in New York City in Zillow dataset")
Neighbourhoodwise cost analysis
Here, let us see the variation in the cost of 2 bedroom prices across the neighbourhoods.
zillow_neighbourhood <- merge(x = zillow_final, y = zip_by_neighbourhood, by.x = "zipcode", by.y = "zipcode", all.x = TRUE)
zillow_neighbourhood %>%
add_count(neighbourhood) %>%
mutate(neighbourhood = glue::glue("{ neighbourhood } ({ n })")) %>%
ggplot(aes(neighbourhood, cost)) +
geom_boxplot() +
scale_y_continuous(labels = comma) +
geom_jitter() +
labs(title = "How does cost of 2 bedroom houses differs accross different neighbourhoods of New York City?",
subtitle = "Black dots represents median cost price in each zipcode within neighbourhood")
Moving forward, I will identify the profitability of only those zipcodes for which the cost price is available.
From initial exploration of AirBnB dataset, it can be seen that:
# For verification
dim(airbnb)
The top 10 rows of the dataset are:
DT::datatable(head(airbnb, 10),extensions = 'FixedColumns',
options = list(
dom = 't',
scrollX = TRUE,
scrollY = "300px",
fixedColumns = TRUE
))
# Checking number of duplicate records
nrow(airbnb[which(duplicated(airbnb) == TRUE),])
## [1] 0
#Checking missing values in each column
DT::datatable(as.data.frame(colSums(is.na(airbnb))),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
# Checking count of records having missing value
count(airbnb[rowSums(is.na(airbnb)) > 0,])
As there are 106 columns out of which only some will be of use to me. Hence before further exploration, I will start deleting the column name based on the description provided in the Metadata
The 106 columns are:
DT::datatable(as.data.frame(colnames(airbnb)),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
The main interest is in recommending zipcode and not the individual properties and hence first, I will start cleaning with those columns which are specific to the individual listings.
There are a group of fields providing information about the host of the listing like host_id, host_url, host_name, host_response_rate, etc. Similarly, there are a few fields for guest profile. I will remove these fields first, as I will not need host information for my analysis.
# Identifying column names starting with "host_"
airbnb[,str_detect(colnames(airbnb),"^host_")]
# Identifying column names containing "guest"
airbnb[,str_detect(colnames(airbnb),"guest")]
# Droping this columns
airbnb_2 <- airbnb[,!str_detect(colnames(airbnb),"^host_")]
airbnb_3 <- airbnb_2[,!str_detect(colnames(airbnb_2),"guest")]
# Checking the dimensions
dim(airbnb_3)
Similarly, columns which contain any url are not important to us and hence I am removing them.
# Identifying column names containing with "_url"
airbnb_3[,str_detect(colnames(airbnb_3),"_url")]
# Droping this columns
airbnb_4 <- airbnb_3[,!str_detect(colnames(airbnb_3),"_url")]
# Checking the dimensions
dim(airbnb_4)
I want to keep only those columns which I can roll-up at the zipcode level and can arrive at a conclusion at the zip level. For this I will check the number of unique values present in each column and then remove columns accordingly i.e. I will remove values which have zero variance (only 1 unique value) and extreme high variance (> 40000 unique values)
# Number of Unique values per column
unique_values <- airbnb_4 %>% summarise_all(n_distinct)
DT::datatable(as.data.frame(t(unique_values)),class = "stripe hover")
# Removing columns with zero variance and extreme high variance
column_names <- colnames(unique_values[colSums(unique_values) != 1 & colSums(unique_values) < 40000])
Removing the columns detected in above code chunk
# Checking the dataset after removing above identified columns
airbnb_4[,names(airbnb_4) %in% column_names]
# Creating new database
airbnb_5 <- airbnb_4[,names(airbnb_4) %in% column_names]
Now I will remove the column manually based on the metadata and structure of an existing dataset. The columns which are being removed are either specific to that particular airbnb listing or won’t add much value in our zip level analysis like state column.
# Checking the structure of existing dataset. It will aid me to identify the columns to drop
str(airbnb_5)
dim(airbnb_5)
# Creating vector of column names which I want to remove
column_names_2 <- c("space",
"neighborhood_overview",
"notes",
"transit",
"access",
"interaction",
"house_rules",
"street",
"neighbourhood",
"neighbourhood_cleansed",
"city",
"state",
"market",
"smart_location",
"is_location_exact",
"room_type",
"bathrooms",
"beds",
"extra_people",
"instant_bookable",
"cancellation_policy"
)
# Removing above columns
airbnb_6 <- (airbnb_5[,!(colnames(airbnb_5) %in% column_names_2)])
# Checking Dimensions
dim(airbnb_6)
As mentioned earlier for the purpose, I don’t want to delve deeper into the individual listing as it doesn’t serve any purpose in our analysis right now. Please find below some pointers regarding the retention and deletion of columns:
# columns to be removed
column_names_3 <- c("last_scraped",
"property_type",
"accommodates",
"bed_type",
"weekly_price",
"monthly_price",
"security_deposit",
"cleaning_fee",
"minimum_nights",
"maximum_nights",
"minimum_minimum_nights",
"maximum_minimum_nights",
"minimum_maximum_nights",
"maximum_maximum_nights",
"minimum_nights_avg_ntm",
"maximum_nights_avg_ntm",
"calendar_updated",
"availability_30",
"availability_60",
"availability_90",
"calendar_last_scraped",
"number_of_reviews",
"number_of_reviews_ltm",
"first_review",
"last_review",
"review_scores_rating",
"review_scores_accuracy",
"review_scores_cleanliness",
"review_scores_checkin",
"review_scores_communication",
"review_scores_value",
"calculated_host_listings_count",
"calculated_host_listings_count_entire_homes",
"calculated_host_listings_count_private_rooms",
"calculated_host_listings_count_shared_rooms")
# Removing above columns
airbnb_7 <- (airbnb_6[,!(colnames(airbnb_6) %in% column_names_3)])
# Checking Dimensions
dim(airbnb_7)
Now, let us further explore the quality of these 10 variables to arrive at a dataset which will be merged with zillow dataset.
Let me check the high-level summary of this dataset.
summary(airbnb_7)
## neighbourhood_group_cleansed zipcode latitude
## Length:48895 Length:48895 Min. :40.50
## Class :character Class :character 1st Qu.:40.69
## Mode :character Mode :character Median :40.72
## Mean :40.73
## 3rd Qu.:40.76
## Max. :40.91
##
## longitude bedrooms square_feet price
## Min. :-74.24 Min. : 0.000 Min. : 0.0 Length:48895
## 1st Qu.:-73.98 1st Qu.: 1.000 1st Qu.: 315.0 Class :character
## Median :-73.96 Median : 1.000 Median : 700.0 Mode :character
## Mean :-73.95 Mean : 1.178 Mean : 708.1
## 3rd Qu.:-73.94 3rd Qu.: 1.000 3rd Qu.: 950.0
## Max. :-73.71 Max. :14.000 Max. :5000.0
## NA's :22 NA's :48487
## availability_365 review_scores_location reviews_per_month
## Min. : 0.0 Min. : 2.000 Min. : 0.010
## 1st Qu.: 0.0 1st Qu.: 9.000 1st Qu.: 0.190
## Median : 45.0 Median :10.000 Median : 0.720
## Mean :112.8 Mean : 9.565 Mean : 1.373
## 3rd Qu.:227.0 3rd Qu.:10.000 3rd Qu.: 2.020
## Max. :365.0 Max. :10.000 Max. :58.500
## NA's :11082 NA's :10052
Removing square_feet column
# Removing Square_feet column
airbnb_8 <- airbnb_7 %>% subset(select = -(square_feet))
Quality Checking and Cleaning of ZIPCODE:
# Creating backup of previous dataset
airbnb_9 <- airbnb_8
# Checking Lenght of Zipcodes
unique(airbnb_9 %>% transmute(num_digits_zip = str_length(zipcode)))
It can be seen that there are entries which have more than 5 digit zipcode. Please find below these entries:
DT::datatable(filter(mutate(airbnb_9,num_digits_zip = str_length(zipcode)),num_digits_zip > 5),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
Checking count of non-digit zipcode and viewing them. These are the same zips presented above.
# Checking count of non digit zipcodes
count(airbnb_9 %>% filter(str_detect(zipcode, "[^0-9]")))
# Viewing non-digit zipcode
DT::datatable(airbnb_9 %>% filter(str_detect(zipcode, "[^0-9]")),class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
Cleaning of zipcode:
# In our case the class of zipcode column is charachter but still I am converting it to charachter as it will be general practice if the type of variable is different in future dataset
#airbnb_9$zipcode <- as.character(airbnb_9$zipcode)
# Removing all the whitespaces
#airbnb_9$zipcode = gsub('\\s', '', airbnb_9$zipcode)
# For cases where there is a "-" in zipcode keeping only digits which are before it
#airbnb_9$zipcode = gsub('^([0-9]+)-[0-9]+', "\\1", airbnb_9$zipcode)
# Delete empty strings
#airbnb_9$zipcode[grepl('^$', airbnb_9$zipcode)] = NA
# restore leading zeroes to non 5 digit codes:
#airbnb_9$zipcode <- str_pad(airbnb_9$zipcode, width = 5, side = "left", pad = "0")
# Keeping only first 5 digits from starting:
#airbnb_9$zipcode <- str_sub(airbnb_9$zipcode,start = 1,end = 5)
# Above thing is now realized by zipclean function
airbnb_9$zipcode <- zipclean(airbnb_9$zipcode)
# dim of cleaned dataset
dim(airbnb_9[!is.na(airbnb_9$zipcode),])
## [1] 48378 9
#Storing this value in new dataset
airbnb_10 <- airbnb_9[!is.na(airbnb_9$zipcode),]
It can be seen above that only 517 NA values are removed. Let us repeat the QC which we performed on the zipcodes above. From below QC, it can be seen that the zipcode column is now cleaned.
# Checking count of non digit zipcodes
count(airbnb_10 %>% filter(str_detect(zipcode, "[^0-9]")))
# Checking Lenght of Zipcodes
unique(airbnb_10 %>% transmute(num_digits_zip = str_length(zipcode)))
Exploring the bedrooms variable
Missing value or any other ambiguous values are not present in bedrooms column. Now we will apply the filter of bedrooms == 2 in order to filter out 2-bedroom properties only.
table(airbnb_10$bedrooms)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 14
## 4516 34601 6447 2049 518 146 42 21 8 5 3 1 1
Cleaning Price column
Here, I would remove the $ value from the prefix and convert the column to numeric type so that I can perform analysis on it. Price column may contain some unusual values but I will clean it only after I filter the 2 bedroom apartments.
airbnb_10$price <- as.numeric(gsub('[$,]','', airbnb_10$price))
Availablity of Listing
Please find below the table of frequency of availability_365. As per my initial assumption, a large number of ratings have very high pre-booking. Now let us add the column of Occupancy.
DT::datatable(as.data.frame(table(airbnb_10$availability_365)))
Occupancy:
The number of days the property is occupied in a year is calculated based on the following formula: \[ occupancy = availability\_365*0.75 + (365 - availability\_365) \]
airbnb_11 <- mutate(airbnb_10, occupancy = round(availability_365*0.75 + (365 - availability_365),0))
# QCing occupancy column:
DT::datatable(as.data.frame(table(airbnb_11$occupancy)))
Retaining Only 2 Bedroom Properties
Now we will filter the dataset based on the number of bedrooms. Please find below the final cleaned AirBnB dataset:
airbnb_final <- filter(airbnb_11,airbnb_10$bedrooms == 2)
Quality Check of Price column
There is only one value with price == 0. It won’t affect my analysis hence I am removing it.
# QCing Price column in cleaned dataset
DT::datatable(as.data.frame(table(airbnb_final$price)),class = "stripe hover")
# Filtering out that record
airbnb_final <- filter(airbnb_final,price > 0)
Please find the data preview on the next tab.
Please find below the preview of the final AirBnb cleaned dataset:
# Final Dataset
DT::datatable(airbnb_final,class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
This section is just to give a high-level overview of the distribution of listing. I am not focusing much here because our main goal is to identify the most profitable zipcodes which will be done in subsequent sections.
The New York City is divided into 5 neighbourhoods and here we can try to look at the distribution of listings in different neighbourhoods. It can give us a general idea that which neighbourhood has a high number of listing and how is the distribution of nightly price in each borough. The plot is zoomed between per night price from $20 to $500 to look at the distribution as anything below 20 dollars or above 500 dollars seems unlikely for property in New York. The listings having that values either must be data entry errors or some special cases like a luxury apartment in Manhattan.
airbnb_final %>%
add_count(neighbourhood_group_cleansed) %>%
mutate(Boroughs = glue::glue("{ neighbourhood_group_cleansed } ({ n })")) %>%
ggplot(aes(Boroughs, price)) +
geom_boxplot() +
coord_cartesian(ylim = c(20, 500)) +
labs(title = "How does price of AirBnB listing differs accross different neighbourhoods of New York City?")
Here, first I will roll up the AirBnB dataset at zipcode level and then I will merge it with the Zillow dataset. While rolling up at the zipcode level I will take the mean of latitude and longitude and median of price, occupancy, review_score_location and reviews_per_month. For the neighbourhood group, I will see if there are any duplicate at zipcode and neighbourhood group level and will update the neighbourhood after seeing the observations.
Checking the duplicate at Zipcode and Neighbourhood level
Here, I have found the duplicate records at zipcode and neighbourhood level.
# Checking the values
zip_neighbour <- airbnb_final %>% distinct(zipcode,neighbourhood_group_cleansed)
# Checking Duplicate Zips
dup_zips <- zip_neighbour[which(duplicated(zip_neighbour$zipcode) == TRUE),]$zipcode
# Manually Identifying neighbourhood of these zips
zip_neighbour <- subset(airbnb_final,zipcode %in% dup_zips)
table(zip_neighbour$zipcode,zip_neighbour$neighbourhood_group_cleansed)
##
## Bronx Brooklyn Manhattan Queens
## 10013 0 1 104 0
## 10463 8 0 1 0
## 11385 0 1 0 60
Now, we can see that the duplicate is because of only one record of zips being in a different neighbourhood. Here, I will manually replace the neighbourhood of zip i.e. neighbourhood of 10013 will be Manhattan and not Brooklyn, similarly neighbourhood of 10463 and 11385 will be Bronx and Queens respectively.
Updating the neighbourhood group
# Creating Backup of final airbnb dataset
airbnb_backup <- airbnb_final
airbnb_final[airbnb_final$zipcode == "10013",]$neighbourhood_group_cleansed <- "Manhattan"
airbnb_final[airbnb_final$zipcode == "10463",]$neighbourhood_group_cleansed <- "Bronx"
airbnb_final[airbnb_final$zipcode == "11385",]$neighbourhood_group_cleansed <- "Queens"
# For QC
airbnb_final %>% distinct(zipcode,neighbourhood_group_cleansed)
Rolling Up AirBnB dataset
airbnb_zip_lvl <- airbnb_final %>%
group_by(zipcode,neighbourhood_group_cleansed) %>%
summarise(lat = mean(latitude), long = mean(longitude), rent_per_day = median(price), med_loc_review = median(review_scores_location,na.rm = T), avg_rev_per_year = median(12*reviews_per_month, na.rm = T), med_occupancy = median(occupancy), no_of_listings = n())
# Rolled Up Data preview
head(airbnb_zip_lvl,10)
# QC
dim(airbnb_zip_lvl)
## [1] 169 9
In this section, I will merge both the final dataset and then calculate profitability metrics.
Merging Two Datasets by ZIPCODE
First, I will merge both the datasets so that if later there is a change is a business decision and if median house prices for zipcode are imputed we can use this dataset. For our current analysis purpose, I will only keep those zips for which cost and rent_per_day values are available.
zipcode_pricing <- merge(x = airbnb_zip_lvl, y = zillow_final, by.x = "zipcode", by.y = "zipcode", all = TRUE)
final_data <- filter(zipcode_pricing, !is.na(cost)) %>%
filter(!is.na(rent_per_day))
# Data Preview of Final Dataset
DT::datatable(final_data,class = "stripe hover",options = list(autoWidth = TRUE, scrollX = TRUE,scrollY = "200px"))
Before proceeding further let us review the final_data to see if there are any outlier value present in the dataset.
str(final_data)
summary(final_data)
Calculating Yearly Earning
I am calculating earnings based on my earlier assumption that only nightly rent contributes to earning. Also, I will assume that average occupancy in terms of the number of guests is 3 guests per 2BR property whenever the property is occupied.
Hence, the yearly earning is given by the following formula: \[ yearly\_earning = 3*rent\_per\_day*med\_occupancy \]
final_data_2 <- mutate(final_data, yearly_earning = round(3*rent_per_day*med_occupancy,0))
# QC
final_data_2[c("zipcode","yearly_earning")]
Capitalization Rate
Capitalization Rate or Cap Rate is going to be our most important metric while calculating profitability. Cap rate is the ratio of net yearly income by properties market value. For simplifying calculation and as per our given assumptions, I am assuming that the market value of the property after 1 year is the same as its current value. Therefore the capitalization rate is given by: \[cap\_rate = (yearly\_earning/cost)*100 \]
final_data_3 <- mutate(final_data_2, cap_rate = round((100*yearly_earning)/cost,2))
# QC
final_data_3[c("zipcode","cap_rate")]
Profitability Segment
Based on the above capitalization rate I will divide the properties into three categories High, Medium, and Low, where properties with High profitability have cap rate >20%, Medium profitability have cap rate between 10% to 20% and Low profitability have cap rate less than 10%.
final_data_4 <- mutate(final_data_3,
profit_segment = if_else(cap_rate > 20,"High",if_else(cap_rate < 10,"Low","Medium")))
# QC
final_data_4[c("zipcode","profit_segment")]
Break Even Period
My final metric on deciding which zipcodes to invest in will be break-even period. It is the number of years required to recover initial investment assuming the prices to remain constant. Basically, this is just the reciprocal of Capitalization Rate but it will provide better visualization of the recovery period. The formula for the break-even period is: \[breakeven\_period = cost/yearly\_earning\]
final_data_5 <- mutate(final_data_4, breakeven_period = round(cost/yearly_earning,1))
# QC
final_data_5[c("zipcode","breakeven_period")]
Please find below the final data with all the profitability matrix. You can sort the data by clicking on the column header to find the best and worst zipcodes according to that profitability matrix. The default sorting order is from Best to Worst zipcodes based on Cap_rate followed by yearly_earnings. You can also filter and view top 5 or top 10 entries by selecting the number of entries you want to view in Show drop-down box.
final_data_5 <- arrange(final_data_5, desc(cap_rate), desc(yearly_earning))
colnames(final_data_5)[colnames(final_data_5) == "neighbourhood_group_cleansed"] <- "neighbourhood"
DT::datatable(final_data_5[c("zipcode","neighbourhood","rent_per_day","cost","no_of_listings","cap_rate","profit_segment","yearly_earning","breakeven_period")], options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)
))
We can already see the top zipcodes to invest in as per Capitalization Rate and Breakeven Period or Yearly Earnings. Here I will try to provide visualization for these metrics and see how they stand with respect to each other.
Ranking Zipcodes as per Capitalization Rate
The top 5 zipcode along with their neighbourhood and cap_rate to invest are:
final_data_5 %>%
mutate(zipcode = fct_reorder(zipcode, cap_rate)) %>%
ggplot(aes(x = zipcode, y = cap_rate, fill = neighbourhood)) +
scale_fill_manual(values = c("#E69F00", "#0072B2", "#009E73", "#D55E00")) +
geom_bar(stat = "identity") +
geom_text(aes(label = cap_rate), color = "black",size = 3) +
coord_flip() +
theme(legend.title = element_blank()) +
labs(y = "Capitalization Rate",
x = "",
title = "Zipcodes arranged in order of Capitalization Rate",
subtitle = "")
The number of listings was not considered in the above visualization. Let us see the top Zipcodes having atleast 20 listings as they seem to be a better investment than zipcodes having fewer listing as we don’t have enough data to calculate profitability.
The top 5 zipcodes are:
final_data_5 %>%
filter(no_of_listings >= 20) %>%
mutate(zipcode = fct_reorder(zipcode, cap_rate)) %>%
ggplot(aes(x = zipcode, y = cap_rate, fill = neighbourhood)) +
scale_fill_manual(values = c("#E69F00", "#0072B2", "#009E73", "#D55E00")) +
geom_bar(stat = "identity") +
geom_text(aes(label = cap_rate), color = "black",size = 3) +
coord_flip() +
theme(legend.title = element_blank()) +
labs(y = "Capitalization Rate",
x = "",
title = "Zipcodes arranged in order of Capitalization Rate",
subtitle = "Only zipcodes having atleast 20 listings")
Yearly Earnings vs Property Cost
Here is the distribution of zipcodes wrt to Yearly Earning and Property Cost. We are interested in zipcodes with less property cost and high yearly earning. Basically with high capitalization rate.
final_data_5 %>%
ggplot(aes(x = cost, y = yearly_earning, color = neighbourhood)) +
geom_point(size = 2) +
theme(legend.title = element_blank()) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
geom_text(aes(label = zipcode), cex = 3, font = 8) +
geom_rect(xmin = 0,xmax = 2000000, ymin = 200000, ymax = 400000, alpha = 0.01) +
labs(y = "Yearly Earning (in $)",
x = "Property Cost (in $)",
title = "Which are the best zipcodes to invest?",
subtitle = "Shaded area represent Zipcodes which have relatively low property cost and high yearly earning"
)
Capitalization Rate vs Location Review
Below is the comparison of our profitability matrix i.e. Capitalization Rate with median location review score. Here definite conclusion cannot be made as all of our Zipcodes have very high median location review.
final_data_5 %>%
ggplot(aes(x = med_loc_review, y = cap_rate, color = neighbourhood)) +
geom_point(size = 2) +
theme(legend.title = element_blank()) +
geom_text(aes(label = zipcode), cex = 3, font = 8) +
labs(y = "Capitalization Rate (in %)",
x = "Location Review",
title = "Relationship between Capitalization Rate and Location review",
subtitle = ""
)
Below is the location of properties in which we can invest along with their profile information.
Click on the map pointer to view the zipcode along with its median property cost, median yearly earning, breakeven period, and capitalization rate.
map <- leaflet(final_data_5) %>% addTiles()
profit_segment_color <- c("blue","red","green")[factor(final_data_5$profit_segment)]
icons <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = profit_segment_color
)
map %>% addAwesomeMarkers(lng = ~ final_data_5$long, lat = ~ final_data_5$lat, popup = paste("Zipcode: " ,final_data_5$zipcode , "<br>","Neighbourhood: ", final_data_5$neighbourhood,"<br>","Property Cost: ", dollar(final_data_5$cost), "<br>", "Yearly Earning: ", dollar(final_data_5$yearly_earning), "<br>", "Bearkeven Period (in y): ", final_data_5$breakeven_period, "<br>", "Capitalization Rate: ", final_data_5$cap_rate, "<br>", "Profitability Segment: ", final_data_5$profit_segment), icon = icons)
Below are the top 5 overall zipcodes and top 5 zipcodes having at least 20 listings, in terms of capitalization rate, along with top 5 highest earning zipcodes and top 5 least expensive zipcodes.
Zipcodes in which investment is to be made will now mostly depend on the real estate companies philosophy and budget. I would recommend investing in the top zipcode from Queens and Staten Island as they have very high return on investment, along with investing in top 2 zipcodes from Manhattan and Brooklyn each because they have very high yearly earning and the data is based on high number of listings so the accuracy is more and investing in them will be a safe bet.
The top 5 overall zipcode to invest are:
final_data_5 %>%
top_n(5,cap_rate) %>%
arrange(desc(cap_rate), desc(yearly_earning)) %>%
mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
select(c("zipcode","neighbourhood","cap_rate","breakeven_period","yearly_earning","cost","no_of_listings"))
Top 5 Zipcodes having atleast 20 listings:
final_data_5 %>%
filter(no_of_listings >= 20) %>%
top_n(5,cap_rate) %>%
arrange(desc(cap_rate), desc(yearly_earning)) %>%
mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
select(c("zipcode","neighbourhood","cap_rate","breakeven_period","yearly_earning","cost","no_of_listings"))
Top 5 Zipcodes by Yearly Earnings:
final_data_5 %>%
top_n(5,yearly_earning) %>%
arrange(desc(yearly_earning)) %>%
mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
select(c("zipcode","neighbourhood","yearly_earning","cap_rate","breakeven_period","cost","no_of_listings"))
Top 5 least expensive zipcodes :
final_data_5 %>%
top_n(-5,cost) %>%
arrange(cost) %>%
mutate(cost = dollar(cost,prefix = "$"), yearly_earning = dollar(yearly_earning, prefix = "$")) %>%
select(c("zipcode","neighbourhood","cost","yearly_earning","cap_rate","breakeven_period","no_of_listings"))
auto.Arima() was used to forecast the housing values for August 2019. The process increased run time of the code hence some better and quick forecasting algorithm could be developed