A real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
The real estate company has engaged your firm to build out a data product and provide your conclusions to help them understand which zip codes would generate the most profit on short term rentals within New York City.
Following were the assumptions for this project:-
All the data from different data sources (Zillow and Aitbnb) is correct in terms of values, zipcodes and their corresponsing cities
There are no seasonality variations- prices of properties on Zillow and price of Airbnb listings are not affected by seasonality.
There is no wear and tear to properties- all properties are in good contions and will not require any significant maintainence or repair charges
Tax to be paid for bying the property is constant and doesnt change by zipcode
The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
Occupancy rate of 75% throughout the year for Airbnb properties
Following packages were used:
data.table: Importing the data
plotly : Used to plot interactive charts
tidyverse : Data manipulation and Plotting Graphs
leaflet : Creatioon of interactive maps
DT : Display the data on the screen in a scrollable form
ggthemes : For themes
#Installing package if it is not present
package_listing <- c('data.table', 'tidyverse','DT' , 'leaflet','plotly','ggthemes')
for (required_package in package_listing) {
if (!require(required_package, character.only = T, quietly = T)) {
install.packages(required_package, repos = "http://cran.us.r-project.org")
library(required_package, character.only = T)
}
}
#Loading the packages
library(data.table)
library(tidyverse)
library(DT)
library(plotly)
library(leaflet)
library(ggthemes)
There are 2 data sources from which data is provided
Zillow data- Cost for two-bedroom properties by zipcode.
AirBnB data- Detailed Information about property listing in New York including location, number of bedrooms, reviews, price, availability, property description.
The data was imported from secure sources provided by the company
The data was imported from secure sources provided by the company
The Zillow dataset has 40753 observations and 95 colums. The Airbnb dataset has 8946 observations and 262 colums.
#Import the Data for Airbnb listings and Zillow
airbnb_listings <- fread("listings.csv")
##
Read 32.2% of 62029 rows
Read 64.5% of 62029 rows
Read 40753 rows and 95 (of 95) columns from 0.144 GB file in 00:00:05
zillow <- fread("Zip_Zhvi_2bedroom.csv")
#Check the dimensions of the datasets
dim(airbnb_listings)
## [1] 40753 95
dim(zillow)
## [1] 8946 262
The zillow and airbnb datasets were merged together on the basis of zipcodes.We are finally left with data from 22 zipcodes in our final dataset which are from 4 counties.
Since the real estate company has already identified 2 bedroom properties, filter was applied on airbnb listings.
For our analysis, we keep the colums which are relevant to the problem statement and remove the rest of the columns.
From the Airbnb Dataset we remove majority of the columns containing information about the host because it is not relevant after purchasing new properties.
From the Zillow Dataset we remove attributes containing information about the cost of property from 1996 till the month before last 12 months. This is because the time value of money is 0. So there is no need for to incorporate net present value of a property.
We keep the following colums from Airbnb data
We keep the following colums from Zillow data
#### Selecting last 12 months data
zillow_last_12 <- zillow[,(ncol(zillow) - (12 - 1)):ncol(zillow)]
zillow_new <- cbind(zillow[,1:7],zillow_last_12)
#### Joining airbnb and Zillow data based on zipcodes
merged_data <- merge(airbnb_listings,zillow_new, by.x = "zipcode", by.y = "RegionName")
#### Selcting only 2 bedroom properties as per business requirement
merged_data <- merged_data %>% filter(bedrooms == 2)
####Dropping non-required columns as per data & business understanding
drop_columns <- function(df, drop) {
df <- df[, !names(df) %in% drop, drop = FALSE]
df
}
#This is to remove the effects of a good orginial host on analysis
non_required_cols <- c('listing_url',
'scrape_id',
'last_scraped',
'experiences_offered',
'notes',
'transit',
'access',
'interaction',
'house_rules',
'thumbnail_url',
'medium_url',
'picture_url',
'xl_picture_url',
'description',
'host_id',
'host_url',
'host_name',
'host_since',
'host_location',
'host_about',
'host_response_time',
'host_response_rate',
'host_acceptance_rate',
'host_is_superhost',
'host_thumbnail_url',
'host_picture_url',
'host_neighbourhood',
'host_listings_count',
'host_total_listings_count',
'host_verifications',
'host_has_profile_pic',
'host_identity_verified',
'neighbourhood',
'smart_location',
'country_code',
'country',
'is_location_exact',
'beds',
'bed_type',
'weekly_price',
'monthly_price',
'cleaning_fee',
'minimum_nights',
'maximum_nights',
'calendar_updated',
'has_availability',
'calendar_last_scraped',
'first_review',
'last_review',
'requires license',
'license',
'jurisdiction_names',
'instant_bookable',
'require_guest_profile_picture',
'require_guest_phone_verification',
'calculated_host_listings_count',
'name',
'summary',
'space',
'experiences_offered',
'transit',
'street',
'room_type',
'amenities',
'security_deposit',
'guests_included',
'extra_people',
'requires_license',
'cancellation_policy',
'City',
'State',
'Metro',
'CountyName')
clean_data <- drop_columns(merged_data, non_required_cols)
For data quality check following steps were performed
#### Check missing values for each column and remove columns with 50% NA values
sum(is.na(clean_data)) #total missing values
apply(is.na(clean_data), 2, sum) #missing values for each column
colnames(clean_data)[apply(is.na(clean_data), 2, any)] # identifying colnames containing missing values
clean_data <- clean_data[,colSums(is.na(clean_data)) < (0.5*nrow(clean_data))] #No of nulls is less than 50%
####Check for negative numeric values
neg_count <- sapply(clean_data, function(y) sum(length(which(y < 0))))
nrow(neg_count[which(neg_count > 0)]) # total 0 variables have negative values
####Check for duplicates
clean_data[duplicated(clean_data$id),] # duplicate rows = 0
#### Price Converstion to proper format and calculating mean value of a property
#Converting price into proper format
clean_data$price <- as.numeric(gsub('[$,]','', clean_data$price))
#Removing all values where price is 0
clean_data <- clean_data %>% subset(price != 0) %>% subset(price != 0) # removing all players' whose Valuation and Wage is 0.
For analyzing how much would a property cost on average, the cost of properties from the last 12 months were averaged
Since the price of an airbnb listing is nightly, this was converted to get price of a property for the whole year by assuming an occupancy rate of 75% throughout the year.
Breakeven point is the point at which total cost and total revenue are equal. There is no net loss or gain. This was calculated in years by diving mean cost by yearly price of a property.We will use this as an indicator of profitablity in our decision making
Most of the categorical attributes were in textual format. These were converted into factors accordingly.
In order to identify the most profitable households, it is essential to identify the yearly mean price and cost of a property by zipcode. Hence a new dataset was formed specifically for this purpose.
#Calculating total mean value of property
clean_data$mean_property_cost <- rowMeans(clean_data[,(ncol(clean_data)-(12-1)):ncol(clean_data)], na.rm = TRUE)
#Calculating price for entire year and itdentifying and removing outliers
occupancy_rate <- 0.75 #Assuming a default occupancy rate for an Airbnb property to 75%
availability <- 365
clean_data$yearly_airbnb_price <- clean_data$price * availability * occupancy_rate
# Breakeven_period
clean_data$breakeven_period <- clean_data$mean_property_cost / clean_data$yearly_airbnb_price
#Renaming columns for zipcode and populationrank
colnames(clean_data)[which(names(clean_data) == "RegionName")] <- "zipcode"
colnames(clean_data)[which(names(clean_data) == "SizeRank")] <- "PopulationRank"
#Converting necessary columns to factors
clean_data$zipcode <- as.factor(clean_data$zipcode)
clean_data$PopulationRank <- as.factor(clean_data$PopulationRank)
clean_data$neighbourhood_group_cleansed <- as.factor(clean_data$neighbourhood_group_cleansed)
clean_data$property_type <- as.factor(clean_data$property_type)
# Creating a new dataset for calculating the mean cost, price and breakeven period for each zipcode
breakeven_data <- clean_data %>%
select(zipcode,neighbourhood_group_cleansed,mean_property_cost,breakeven_period,yearly_airbnb_price) %>%
group_by(zipcode,neighbourhood_group_cleansed) %>%
summarise(mean_breakeven = mean(breakeven_period),
mean_yearly_price = mean(yearly_airbnb_price),
mean_yearly_cost = mean(mean_property_cost))
#Creating a 5,10,20 and 25 year profitabiltiy of a zipcode
breakeven_data$n_5 <- (breakeven_data$mean_yearly_price * 5 ) - breakeven_data$mean_yearly_cost
breakeven_data$n_10 <- (breakeven_data$mean_yearly_price * 10 ) - breakeven_data$mean_yearly_cost
breakeven_data$n_15 <- (breakeven_data$mean_yearly_price * 15 ) - breakeven_data$mean_yearly_cost
breakeven_data$n_20 <- (breakeven_data$mean_yearly_price * 20 ) - breakeven_data$mean_yearly_cost
breakeven_data$n_25 <- (breakeven_data$mean_yearly_price * 25 ) - breakeven_data$mean_yearly_cost
Our final dataset has 1238 observations and 45 columns.
#Final dataset
final_data <- clean_data
#Dimensions of final data
dim(final_data)
## [1] 1238 45
#Structure of final data
summary_data <- data.table( attribute = names(final_data),
class = sapply(final_data, class),
missing_values = colSums(is.na(final_data)))
datatable(final_data)
Now, that the data cleaning and munging processes are complete, the analysis stage would proceed. The step-by-step analysis is being segemented into 3 categories.
The EDA process starts with visualizing the zipcodes for all 4 neighbourhoods in NYC. From the below graph we can see that Manhattan neighbourhood has the largest number of zipcodes followed by Brookyn.
ggplot(final_data, aes(neighbourhood_group_cleansed)) +
geom_bar(aes(fill=zipcode), width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Zipcodes across neighbourhoods",
x = "Neighbourhood",
y ="Num,ber of Zipcodes")
Next, we would visulaize the number of properties present in each of the zipcodes. This would help us understand how the data is spread for property listing and help us corroborate a logical conclusion from the graph above that since Manhattan and Brooklyn have higher number of zipcodes present in them, they would naturally have higher number of properties listed as compared to Queens and Staten Island. the below graph does exactly that. Zipcodes 10003 and 11215 have the highest number of properties listed inside them.
final_data %>% select(zipcode,neighbourhood_group_cleansed) %>%
group_by(neighbourhood_group_cleansed,zipcode) %>% count() %>%
plot_ly(x =~n,y=~zipcode,color = ~neighbourhood_group_cleansed, type = 'bar',
text = ~n,textposition = "auto") %>%
layout(title = "Number of Properties by Zipcodes",
yaxis = list(title = "Zipcode"),
xaxis = list(title = "Number of Airbnb Properties"))
From the boxplot, we can observe that there is huge variation in price per night for properties in Manhattan and have quite a few outlier values. Queens and Staten Island have consistency in price per night for the properties listed.
plot_ly(final_data, x = ~neighbourhood_group_cleansed,y = ~price,type = "box", jitter = 0.7,
color = ~neighbourhood_group_cleansed) %>%
layout(title = "Variation in Price per Night for a Airbnb property by Neighbourhood",
yaxis = list(title = "Price per Night"),
xaxis = list(title = "Neighbourhood"))
General knowledge tells us that tourist spots are scattered in NYC and hence living in a location that is close to many attractions would be a plus for tourists. A property that has good rating review as far as location is concerned would naturally charge a high price per night. This is visualized in the scatter plot below. We can see that as the value of location rating increases so does the price per night of that property.
plot_ly(data = final_data, x = ~review_scores_location, y = ~price,
marker = list(size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2))) %>%
layout(title = 'Correlation between Location of Property and its Price',
xaxis = list(zeroline = FALSE, title = 'Rating of Airbnb Property'),
yaxis = list(zeroline = FALSE, title = 'Price'))
Population of an area in which the property is present is determined by the Population Rank attribute. The lower the number the greater the population. In order to see if there was any correlation between population of a neighbourhood and the nightly price of propety, the below visulaization was created. As can be seen there is no such correlation.
plot_ly(data = final_data, x = ~PopulationRank, y = ~price,
marker = list(size = 10,
color = 'rgba(155, 152, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2))) %>%
layout(title = 'Correlation between Population Rank and Prices',
caption = 'Lower the number the greater the population',
xaxis = list(zeroline = FALSE, title = 'Population Rank'),
yaxis = list(zeroline = FALSE, title = 'Price'))
In order to better understand the cost structure of each property for the last 12 months, the average cost for a property was plotted for each neighbourhood. As can be seen from the below visualization, the cost of propeties in Manhattan neighbourhood are commanding higher as compared to their counterparts in Queens and Staten Island.
final_data %>% plot_ly(x = ~ mean_property_cost,y = ~zipcode,color = ~neighbourhood_group_cleansed, type = 'scatter', size = ~mean_property_cost,
textposition = 'auto') %>%
layout(title = "Mean cost of of Property by Zipcode and Neighbourhood",
xaxis = list(title = "Mean cost of Airbnb Property"),
yaxis = list(title = "Zipcode"))
It is natural to conclude that neighbourhood that have costs higher would genreally have a higher price per night on Airbnb. To check this, the average price for a property was plotted for each neighbourhood. As we can see from below visualization, zipcodes 10036 and 1003 belonging to Manhattan neighbourhood have properties that have the highest asking price per night. On the other hand 10304, 10305, 10306, 10308 and 10312 belonging to Staten Island have low price per night on Airbnb.
final_data %>% plot_ly(x = ~ yearly_airbnb_price,y = ~zipcode,color = ~neighbourhood_group_cleansed, type = 'scatter',
textposition = 'auto') %>%
layout(title = "Yearly Price of Airbnb Property by Zipcode and Neighbourhood",
xaxis = list(title = "Yearly price"),
yaxis = list(title = "Zipcode"))
Since zipcode have many properties inside them with varying levels of cost and price per night, it is essential to visualize if the pattern for cost of properties and price per night of properties forllows the same pattern across zipcodes and neighbourhoods. The below visualization graphs the average cost and average price per night of zipcodes side by side. We can see that zipcodes 10312 and 10036 have a higher average price trend that the mean cost structure for their respective neighbourhoods.
cost_visualization <- breakeven_data %>% plot_ly(x = ~mean_yearly_cost,y = ~zipcode,color = ~neighbourhood_group_cleansed, type = 'bar',
text = ~round(mean_yearly_cost,1), textposition = 'auto') %>%
layout(title = "Comparison between mean price and mean cost of Airbnb properties across zipcodes",
xaxis = list(title = "Mean cost per zipcode"),
yaxis = list(title = "Zipcode"))
price_visualization <- breakeven_data %>% plot_ly(x = ~mean_yearly_price,y = ~zipcode,color = ~neighbourhood_group_cleansed, type = 'bar',
text = ~round(mean_yearly_price,1), textposition = 'auto') %>%
layout(title = "Comparison between mean price and mean cost of properties across zipcodes",
xaxis = list(title = "Mean price per zipcode"),
yaxis = list(title = "Zipcode"))
subplot(cost_visualization, price_visualization, shareY = TRUE, titleX = TRUE, margin = 0.1)
The below graph identifies the breakeven period for different zipcodes across the 4 neighbourhoods.An interesting insight can be gathered from the below graph. Although the Manhattan neighbourhood commanded higher price per night of stay, its mean property cost was also very high. Hence this neighbourhood has the highest breakeven period.
Contrast this to the Staten Island neighbourhood zipcodes which have lower mean cost for properties on Zillow but have a fiarly high price per night value making it the most profitable neighbourhood.
Brooklyn zipcodes are doing a decent job in terms of breakeven period. Properties have fairly average cost in this zipcodes and the price per night on Airbnb for these properties is neither too high nor too low.
breakeven_data %>% plot_ly(x = ~order(mean_breakeven),y = ~zipcode,color = ~neighbourhood_group_cleansed, type = 'bar',
text = ~round(mean_breakeven,2), textposition = 'auto') %>%
layout(title = "Breakeven Period Airbnb properties across zipcodes",
xaxis = list(title = "Mean breakeven period (in years)"),
yaxis = list(title = "Zipcode"))
The final step in the analysis was to understand when will a zip code be profitable on a 5,10, 20 and 25 year plan basis. this would help the client better visualize their cash flows streams and return on investment. An interactive graph was plotted to visualize this. Number of years can be selected from the drop down menu.
breakeven_data %>% plot_ly(y = ~n_5,x = ~zipcode,type = 'bar',textposition = 'auto') %>%
layout(title = "Breakeven Period Airbnb properties across zipcodes",
caption = "Bars above 0 is profitable and Bars below 0 are in loss",
yaxis = list(title = "Zipcode")) %>%
layout(
title = "Interactive visualization for profitabiltiy of a zipcode",
xaxis = list(domain = c(1, 1)),
yaxis = list(title = "Profit/Loss"),
updatemenus = list(
list(
y = 25,
buttons = list(
list(method = "restyle",
args = list("y", list(breakeven_data$n_5)), # put it in a list
label = "5 years"),
list(method = "restyle",
args = list("y", list(breakeven_data$n_10)), # put it in a list
label = "10 years"),
list(method = "restyle",
args = list("y", list(breakeven_data$n_15)), # put it in a list
label = "15 years"),
list(method = "restyle",
args = list("y", list(breakeven_data$n_20)), # put it in a list
label = "20 years"),
list(method = "restyle",
args = list("y", list(breakeven_data$n_25)), # put it in a list
label = "25 years")))
))
It can be concluded from the above analysis that
Manhattan and Brooklyn have higher number of zipcodes present in them, and have higher number of properties listed as compared to Queens and Staten Island.
There is huge variation in price per night for properties in Manhattan and have quite a few outlier values. Queens and Staten Island have consistency in price per night for the properties listed.
As the value of location rating a property resies in increases so does the price per night of that property.
There is no correlation between population of a neighbourhood and the nightly price of propety.
Zipcodes belonging to Manhattan neighbourhood have properties that have the highest asking price per night.
Although the Manhattan neighbourhood commanded higher price per night of stay, its mean property cost was also very high. Hence this neighbourhood has the highest breakeven period.
Staten Island neighbourhood zipcodes which have lower mean cost for properties on Zillow but have a fiarly high price per night value making it the most profitable neighbourhood.
Brooklyn zipcodes are doing a decent job in terms of breakeven period. Properties have fairly average cost in this zipcodes and the price per night on Airbnb for these properties is neither too high nor too low.
Avoid investing in Manhattan neighbourhood zipcodes- Although the Manhattan neighbourhood commanded higher price per night of stay, its mean property cost was also very high which leadsd to this neighbourhood having the highest breakeven period.
Additional steps that would be needed to perform to make a better decision or deliver a better product to the real estate company are:-
Customer Segmentation- It is crucial before entiring into any new market segment to understand the choice patters of consumer behaviour. Hence segmenting the customer on basis of type of tourist (leisure, business etcs), financial spending power, age etc would be helpful.
Competitor Analysis- Before making any investment decision, analyzing the strategy of competitors is necessary. If competitors are agressive in their pricing or marketing strategies, then the real estate company would have to make additional investments for marketing or move towards long term plan.Companies like Roomorama, Homeaway, and the Expedia (EXPE)-owned Housetrip all have a dedicated following of their own, and are all beginning to eat into the market share.
Economic Risks- Tourism industry is generally booming if the economy is healthy. This needs to be factored in before making large investments in properties since the breakeven period is in years.
Financial health of Airbnb- Although the company appears to be health and has raised more than $3 billion in venture capital since its inception, taking on industry behemoth such as Wyndham and Marriot might lead to them taking an aggressive approach leading to Airbnb loosing its market share. Airbnb also is widely reagrded to have an IPO in 2018 and it might not perform as well as it should. Since the client would be using Airbnb’s platform, this would have adverse effect on the real estate company’s cashflows.
Regulatory Issues Airbnb has a lot of regulartory and legal hurdles. Last fall, the New York State Senate passed a bill that makes online apartment listings for stays shorter than 30 days illegal, which, not surprisingly, thwarts Airbnb in their goal to expand its market. This woudl definitely have effect on the clients’ revenue.
Next steps in this analysis would be to incorporate additional data about
Crime Data- Safety is one of the important factors for toursists and they generally prefer crime free neighbourhoods for renting properties. By incorporating the crime data, we would be able to factor this information.
Weather Data- Weather plays an important role for tourists to decide to visit a particular city during harsh weathers. Hence , it is important to factor in the seasonality of a particular city as any city would have peak and lows tourist periods accordingly.
Social Media Data- The real estate company can scrape through the social media platformsdata to identify the impressions a particular neighbourhood is having and based on popularity can make a better decision.
Traffic and Ease of Public Transport Data- If a zipcode is heavily congested with traffic or if it doesnt have proper public transport channels neareby, then there are high chances that toursits will avoid such zipcodes. So analysis of this data needs to be performed.