A real estate company plans to purchase properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, the company doesnt know which zip codes are the best to invest in.
My firm has been engaged by the real estate company to build a data product and provide conclusions to help them understand which zip codes would generate the most profit on short term rentals within New York City.
The publicly available data used for the analysis come from Zillow and Airbnb.
Zillow dataset (cost data): Provides selling cost for 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017.
Airbnb dataset (revenue data): Information about property listings in New York including location, number of bedrooms, reviews, price, availability, property description, etc. AirBnB is the medium through which the real estate company plans to lease out their investment property.
The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
Occupancy rate of 75% throughout the year for Airbnb properties.
The company will put properties on rent every day throughout the year.
We assume that 40 percent of bookings are single day bookings that last less than a week, 40 percent are weekly bookings and 20 percent of bookings are made for a month. This assumption is cursory and done for the sake of this exercise. [Guessed from basic internet research]
Selling cost information of 2 bedroom properties is available only until June 2017. I perform time series forecasting to obtain the value of properties in December 2018 by considering data from the past 3 years. Also, the reason I didn’t forecast the selling cost at a time point further in 2019 is because I wanted to keep it as close to June 2017 as possible.
The following packages were used:
plyr : For data analysis
data.table : Importing the data
colorspace : Required in system to install dependent package tidyverse
tidyverse : Data manipulation and plotting graphs
forecast : For time series analysis
plotly : Used to plot interactive charts
astsa : For time series analysis
Amelia : For imputing missing values
mice : For imputing missing values
dplyr : For data aggregation
kable : For table generation
highcharter : Creating charts
magrittr : Coding functionalities
broom : Data cleaning
purrr : Used by a function
if(!require("plyr")){
install.packages("tidyverse",dependencies = T)
library(plyr)
}
detach(package:plyr) #remove it because it yields issues with tidyverse; however, still need to make sure it's installed for rbind.fill
#Install packages if it is not present
package_list<- c('data.table', 'colorspace', 'tidyverse', 'plotly', 'forecast', 'astsa', 'Amelia', 'mice', 'dplyr','highcharter','magrittr','broom','purrr','DT','shiny')
for (req_package in package_list) {
if (!require(req_package, character.only = T, quietly = T)) {
install.packages(req_package, repos = "http://cran.us.r-project.org")
library(req_package, character.only = T)
}
}
#Load the packages
library(colorspace)
library(tidyverse)
library(data.table)
library(plotly)
library(forecast)
library(astsa)
library(Amelia)
library(mice)
library(dplyr)
library(magrittr)
library(broom)
library(purrr)
library(highcharter)
library(DT)
library(shiny)
The 2 datasets upon which analysis has been performed to help solve the business case:
Airbnb data : Contains information about property listings in New York. This data would help us understand the amount of money to be gained by offering short-term rentals in New York.
Zillow data : Cost of 2 bedroom properties in different regions over the years (April 1996 - June 2017).
# read csv files
airbnb_listings <- read.csv("listings.csv")
zillow <- read.csv("Zip_Zhvi_2bedroom.csv")
#Dimensions of airbnb_listings
dim(airbnb_listings)
## [1] 40753 95
#Dimensions of zillow data
dim(zillow)
## [1] 8946 262
The listings data contains 40753 rows and 95 columns. While the Zillow dataset contains 8946 rows and 262 columns.
airbnb_listings
colnames(airbnb_listings)
## [1] "id" "listing_url"
## [3] "scrape_id" "last_scraped"
## [5] "name" "summary"
## [7] "space" "description"
## [9] "experiences_offered" "neighborhood_overview"
## [11] "notes" "transit"
## [13] "access" "interaction"
## [15] "house_rules" "thumbnail_url"
## [17] "medium_url" "picture_url"
## [19] "xl_picture_url" "host_id"
## [21] "host_url" "host_name"
## [23] "host_since" "host_location"
## [25] "host_about" "host_response_time"
## [27] "host_response_rate" "host_acceptance_rate"
## [29] "host_is_superhost" "host_thumbnail_url"
## [31] "host_picture_url" "host_neighbourhood"
## [33] "host_listings_count" "host_total_listings_count"
## [35] "host_verifications" "host_has_profile_pic"
## [37] "host_identity_verified" "street"
## [39] "neighbourhood" "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed" "city"
## [43] "state" "zipcode"
## [45] "market" "smart_location"
## [47] "country_code" "country"
## [49] "latitude" "longitude"
## [51] "is_location_exact" "property_type"
## [53] "room_type" "accommodates"
## [55] "bathrooms" "bedrooms"
## [57] "beds" "bed_type"
## [59] "amenities" "square_feet"
## [61] "price" "weekly_price"
## [63] "monthly_price" "security_deposit"
## [65] "cleaning_fee" "guests_included"
## [67] "extra_people" "minimum_nights"
## [69] "maximum_nights" "calendar_updated"
## [71] "has_availability" "availability_30"
## [73] "availability_60" "availability_90"
## [75] "availability_365" "calendar_last_scraped"
## [77] "number_of_reviews" "first_review"
## [79] "last_review" "review_scores_rating"
## [81] "review_scores_accuracy" "review_scores_cleanliness"
## [83] "review_scores_checkin" "review_scores_communication"
## [85] "review_scores_location" "review_scores_value"
## [87] "requires_license" "license"
## [89] "jurisdiction_names" "instant_bookable"
## [91] "cancellation_policy" "require_guest_profile_picture"
## [93] "require_guest_phone_verification" "calculated_host_listings_count"
## [95] "reviews_per_month"
zillow
colnames(zillow)
## [1] "RegionID" "RegionName" "City" "State" "Metro"
## [6] "CountyName" "SizeRank" "X1996.04" "X1996.05" "X1996.06"
## [11] "X1996.07" "X1996.08" "X1996.09" "X1996.10" "X1996.11"
## [16] "X1996.12" "X1997.01" "X1997.02" "X1997.03" "X1997.04"
## [21] "X1997.05" "X1997.06" "X1997.07" "X1997.08" "X1997.09"
## [26] "X1997.10" "X1997.11" "X1997.12" "X1998.01" "X1998.02"
## [31] "X1998.03" "X1998.04" "X1998.05" "X1998.06" "X1998.07"
## [36] "X1998.08" "X1998.09" "X1998.10" "X1998.11" "X1998.12"
## [41] "X1999.01" "X1999.02" "X1999.03" "X1999.04" "X1999.05"
## [46] "X1999.06" "X1999.07" "X1999.08" "X1999.09" "X1999.10"
## [51] "X1999.11" "X1999.12" "X2000.01" "X2000.02" "X2000.03"
## [56] "X2000.04" "X2000.05" "X2000.06" "X2000.07" "X2000.08"
## [61] "X2000.09" "X2000.10" "X2000.11" "X2000.12" "X2001.01"
## [66] "X2001.02" "X2001.03" "X2001.04" "X2001.05" "X2001.06"
## [71] "X2001.07" "X2001.08" "X2001.09" "X2001.10" "X2001.11"
## [76] "X2001.12" "X2002.01" "X2002.02" "X2002.03" "X2002.04"
## [81] "X2002.05" "X2002.06" "X2002.07" "X2002.08" "X2002.09"
## [86] "X2002.10" "X2002.11" "X2002.12" "X2003.01" "X2003.02"
## [91] "X2003.03" "X2003.04" "X2003.05" "X2003.06" "X2003.07"
## [96] "X2003.08" "X2003.09" "X2003.10" "X2003.11" "X2003.12"
## [101] "X2004.01" "X2004.02" "X2004.03" "X2004.04" "X2004.05"
## [106] "X2004.06" "X2004.07" "X2004.08" "X2004.09" "X2004.10"
## [111] "X2004.11" "X2004.12" "X2005.01" "X2005.02" "X2005.03"
## [116] "X2005.04" "X2005.05" "X2005.06" "X2005.07" "X2005.08"
## [121] "X2005.09" "X2005.10" "X2005.11" "X2005.12" "X2006.01"
## [126] "X2006.02" "X2006.03" "X2006.04" "X2006.05" "X2006.06"
## [131] "X2006.07" "X2006.08" "X2006.09" "X2006.10" "X2006.11"
## [136] "X2006.12" "X2007.01" "X2007.02" "X2007.03" "X2007.04"
## [141] "X2007.05" "X2007.06" "X2007.07" "X2007.08" "X2007.09"
## [146] "X2007.10" "X2007.11" "X2007.12" "X2008.01" "X2008.02"
## [151] "X2008.03" "X2008.04" "X2008.05" "X2008.06" "X2008.07"
## [156] "X2008.08" "X2008.09" "X2008.10" "X2008.11" "X2008.12"
## [161] "X2009.01" "X2009.02" "X2009.03" "X2009.04" "X2009.05"
## [166] "X2009.06" "X2009.07" "X2009.08" "X2009.09" "X2009.10"
## [171] "X2009.11" "X2009.12" "X2010.01" "X2010.02" "X2010.03"
## [176] "X2010.04" "X2010.05" "X2010.06" "X2010.07" "X2010.08"
## [181] "X2010.09" "X2010.10" "X2010.11" "X2010.12" "X2011.01"
## [186] "X2011.02" "X2011.03" "X2011.04" "X2011.05" "X2011.06"
## [191] "X2011.07" "X2011.08" "X2011.09" "X2011.10" "X2011.11"
## [196] "X2011.12" "X2012.01" "X2012.02" "X2012.03" "X2012.04"
## [201] "X2012.05" "X2012.06" "X2012.07" "X2012.08" "X2012.09"
## [206] "X2012.10" "X2012.11" "X2012.12" "X2013.01" "X2013.02"
## [211] "X2013.03" "X2013.04" "X2013.05" "X2013.06" "X2013.07"
## [216] "X2013.08" "X2013.09" "X2013.10" "X2013.11" "X2013.12"
## [221] "X2014.01" "X2014.02" "X2014.03" "X2014.04" "X2014.05"
## [226] "X2014.06" "X2014.07" "X2014.08" "X2014.09" "X2014.10"
## [231] "X2014.11" "X2014.12" "X2015.01" "X2015.02" "X2015.03"
## [236] "X2015.04" "X2015.05" "X2015.06" "X2015.07" "X2015.08"
## [241] "X2015.09" "X2015.10" "X2015.11" "X2015.12" "X2016.01"
## [246] "X2016.02" "X2016.03" "X2016.04" "X2016.05" "X2016.06"
## [251] "X2016.07" "X2016.08" "X2016.09" "X2016.10" "X2016.11"
## [256] "X2016.12" "X2017.01" "X2017.02" "X2017.03" "X2017.04"
## [261] "X2017.05" "X2017.06"
To help with scalability and reusability, we use the following function for preprocessing.
The following steps are performed by the zillow_preprocess function before joining the Airbnb and Zillow data:
Select relevant columns such as Region Name (zipcode), the city, the SizeRank and the cost of the property only from the last 5 years.
Filer data to only include rows containing cost information belonging to the city name passed as the function argument (New York in this case).
Forecast and attach the avg. price of 2 bedroom property in Jan 2019 in each zipcode as CurrentPrice.
Assuming that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diferences between previous values (Moving Average MA), we apply ARIMA model to predict the cost of the properties in Zipcodes from July 2017 to Jan 2019.
The zillow_preprocess Function
zillow_preprocess <- function(tempdf,cityName){ # Zillow data and New york city given as function arguement
# Select zillow cost information from last 5 years and modify zillow data (tempdf) to only hold relevant columns
n <- 60
tempdf <- tempdf[,c(2,3,7,(ncol(tempdf)-n):ncol(tempdf))]
tempdf <- filter(tempdf,City==cityName) # Filter for the required city
colnames(tempdf)[colnames(tempdf)=="RegionName"] <- "zipcode" # Set proper column name to be used for merging later
tempdf$currentPrice <- NULL # Create a new column to store the latest price in January 2019
# we define a for loop to iterate over each zipcode to obtain latest cost of property
for(i in 1:nrow(tempdf)){
tmp = ts(as.vector(t(tempdf[,c(4:64)])[,i]),start = c(2012,6),frequency = 12) # Convert the monthly cost data into time series data
ARIMAfit = arima(tmp, order=c(1,1,1), seasonal=list(order=c(1,0,1),period=NA),
method="ML")# Define ARIMA model to be used for prediction
pred = predict(ARIMAfit, n.ahead = 19)# use the ARIMA model to predict the price from July 2017 to Jan 2019
predval <- pred$pred # Store the predicted values in a variable
tempdf$currentPrice[i] <- predval[length(predval)] # set the value of current price for the specific zipcode as price in Jan 2019
}
return(tempdf[,c(1,2,3,65)]) # return the filtered data containing only relevant columns
}
Call function above by passing zillow data and city name as ‘New York’
Calling the zillow_preprocess Function
city="New York"
zillow_filtered <- zillow_preprocess(zillow,city) # call zillow_preprocess function
str(zillow_filtered) # view the structure of clean and filtered Zillow data
## 'data.frame': 25 obs. of 4 variables:
## $ zipcode : int 10025 10023 10128 10011 10003 11201 11234 10314 11215 10028 ...
## $ City : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
## $ SizeRank : int 1 3 14 15 21 32 52 68 71 109 ...
## $ currentPrice: num 1459633 2147667 1981749 2624939 2171586 ...
The final clean data contains 4 columns i.e. Zipcode, City, SizeRank and the current price of property in the particular zipcode. There are 25 rows where each row describes a unqiue zipcode. Lets look at top five rows from the clean zillow data
head(zillow_filtered,5) # retrieve the first 5 rows of zillow data
## zipcode City SizeRank currentPrice
## 1 10025 New York 1 1459633
## 2 10023 New York 3 2147667
## 3 10128 New York 14 1981749
## 4 10011 New York 15 2624939
## 5 10003 New York 21 2171586
In airbnb_listings data, we are only concerned about 2 bedroom listings. Also, airbnb_listings data has 95 columns and 40753 rows, most of which we will not using during this analysis exercise. We only choose these columns: “id”,“zipcode”,“bedrooms”,“square_feet”,“price”,“weekly_price”,“monthly_price”,“cleaning_fee”,“number_of_reviews”,“review_scores_rating”
listings_filter function
listings_filter <- function(tempdf,numberOfBedroom){
# Select relevant columns
relevantcol <- c("id","zipcode","bedrooms","square_feet","price","weekly_price","monthly_price","cleaning_fee","neighbourhood_group_cleansed","number_of_reviews","review_scores_rating")
tempdf <- tempdf[,relevantcol]
# filter data containing 2 bedrooms
tempdf <- filter(tempdf,bedrooms==numberOfBedroom)
return(tempdf)
}
We call the above function by passing the available listings data and also the number of bedrooms as 2
Calling the above function
noBedroom=2 # set required number of bedrooms
listings_filtered <- listings_filter(airbnb_listings,noBedroom) # call the function
str(listings_filtered) # observe the structure of filtered listings data
## 'data.frame': 4894 obs. of 11 variables:
## $ id : int 9513511 5046189 4357134 16027061 11301089 14855080 16231738 2836845 1841252 1581579 ...
## $ zipcode : Factor w/ 205 levels "","05340","07310",..: 84 91 104 104 108 108 108 108 108 108 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ square_feet : int NA NA NA NA NA NA NA NA NA 1000 ...
## $ price : Factor w/ 583 levels "$1,000.00","$1,021.00",..: 65 87 151 205 516 178 33 98 33 207 ...
## $ weekly_price : Factor w/ 786 levels "","$1,000.00",..: 1 1 1 1 1 1 1 29 1 215 ...
## $ monthly_price : Factor w/ 839 levels "","$1,000.00",..: 1 1 1 1 1 1 1 492 1 724 ...
## $ cleaning_fee : Factor w/ 172 levels "","$0.00","$10.00",..: 1 148 1 1 108 168 144 148 64 159 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 1 1 4 4 4 4 4 4 4 4 ...
## $ number_of_reviews : int 4 31 0 0 1 15 2 107 3 135 ...
## $ review_scores_rating : int 85 95 NA NA 60 87 80 96 90 97 ...
head(listings_filtered) # view first few row of data
## id zipcode bedrooms square_feet price weekly_price monthly_price
## 1 9513511 10462 2 NA $130.00
## 2 5046189 10469 2 NA $150.00
## 3 4357134 11102 2 NA $200.00
## 4 16027061 11102 2 NA $250.00
## 5 11301089 11105 2 NA $79.00
## 6 14855080 11105 2 NA $225.00
## cleaning_fee neighbourhood_group_cleansed number_of_reviews
## 1 Bronx 4
## 2 $75.00 Bronx 31
## 3 Queens 0
## 4 Queens 0
## 5 $400.00 Queens 1
## 6 $95.00 Queens 15
## review_scores_rating
## 1 85
## 2 95
## 3 NA
## 4 NA
## 5 60
## 6 87
The listings_filtered (airbnb) data has 11 columns and 4894 rows where each row describes a unique property
The two filtered datastes - listings_filtered (airbnb) and zillow_filtered (zillow) are joined on the key zipcode. In below code we combine the data sets and have a look at the structure of the combined data.
finalData <- merge(listings_filtered,zillow_filtered,by = c("zipcode")) # merge data sets on zipcode
str(finalData) # observe structure of combined data
## 'data.frame': 1238 obs. of 14 variables:
## $ zipcode : Factor w/ 205 levels "","05340","07310",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ id : int 13561752 4942107 711635 4510857 3799598 568743 8335547 15094880 7664343 8884228 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ square_feet : int NA NA 800 NA NA NA NA NA NA NA ...
## $ price : Factor w/ 583 levels "$1,000.00","$1,021.00",..: 378 579 194 53 194 96 205 182 120 263 ...
## $ weekly_price : Factor w/ 786 levels "","$1,000.00",..: 1 1 110 716 1 147 1 1 2 164 ...
## $ monthly_price : Factor w/ 839 levels "","$1,000.00",..: 1 1 646 452 1 675 1 1 1 659 ...
## $ cleaning_fee : Factor w/ 172 levels "","$0.00","$10.00",..: 37 1 121 133 148 148 159 121 4 4 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ number_of_reviews : int 14 37 63 2 144 137 1 26 2 72 ...
## $ review_scores_rating : int 94 100 95 80 88 83 100 95 90 95 ...
## $ City : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
## $ SizeRank : int 21 21 21 21 21 21 21 21 21 21 ...
## $ currentPrice : num 2171586 2171586 2171586 2171586 2171586 ...
summary(finalData)
## zipcode id bedrooms square_feet
## 11215 :141 Min. : 20853 Min. :2 Min. : 0.0
## 10003 :133 1st Qu.: 4218751 1st Qu.:2 1st Qu.: 800.0
## 10025 :112 Median : 9410246 Median :2 Median : 975.0
## 10036 :108 Mean : 9218383 Mean :2 Mean : 921.6
## 10011 :102 3rd Qu.:14476782 3rd Qu.:2 3rd Qu.:1100.0
## 10014 : 95 Max. :18508770 Max. :2 Max. :1600.0
## (Other):547 NA's :1210
## price weekly_price monthly_price cleaning_fee
## $250.00: 93 :981 :1058 :236
## $200.00: 72 $1,200.00: 14 $4,000.00: 12 $100.00:208
## $300.00: 59 $1,500.00: 13 $6,000.00: 8 $50.00 : 87
## $150.00: 56 $1,100.00: 10 $3,000.00: 6 $80.00 : 81
## $350.00: 42 $1,000.00: 9 $5,000.00: 6 $150.00: 80
## $400.00: 32 $1,400.00: 8 $2,500.00: 5 $75.00 : 73
## (Other):884 (Other) :203 (Other) : 143 (Other):473
## neighbourhood_group_cleansed number_of_reviews review_scores_rating
## Bronx : 0 Min. : 0.00 Min. : 20.00
## Brooklyn :374 1st Qu.: 1.00 1st Qu.: 90.00
## Manhattan :840 Median : 5.00 Median : 95.00
## Queens : 9 Mean : 17.43 Mean : 93.12
## Staten Island: 15 3rd Qu.: 20.00 3rd Qu.:100.00
## Max. :306.00 Max. :100.00
## NA's :268
## City SizeRank currentPrice
## New York:1238 Min. : 1.0 Min. : 369831
## Aberdeen: 0 1st Qu.: 15.0 1st Qu.:1329214
## Abilene : 0 Median : 71.0 Median :1885687
## Abingdon: 0 Mean : 485.9 Mean :1835538
## Abington: 0 3rd Qu.: 580.0 3rd Qu.:2247494
## Acton : 0 Max. :4149.0 Max. :3332599
## (Other) : 0
The combined data consists of 14 columns describing information for 1238 unique listings.
The data contains lot of issues such as NAs, incorrect data symbols such as $ and wrong data types. ### Data cleaning
The merged data obtained in the previous step has a lot of data quality issues such as missing values, incorrect data types and inconsistent data values. We perform a few steps to cleanse the data.
We set standard names for the columns in the final merged data (finalData). From the summary of the finalData (previous step), we see that city column had many other levels. So, we correct the number of levels in the data to include only New York city.
# Set the new column names
colnames(finalData) <- c("zipcode","id","bedrooms","square_feet","per_night_price","weekly_price","monthly_price","cleaning_fee","neighbourhood","number_of_reviews","review_scores_rating","city","size_rank","current_price")
# Correct the factor levels in city column
finalData$city <- factor(finalData$city, levels=c("New York"))
Variables per_night_price(daily rent), weekly_price, monthly_price and cleaning_fee contain symbols such as $ attached which would prevent these columns from being used for numerical analysis. So, we clean them in following snippet.
Dealing with the dollar sign
cols <- c("per_night_price", "weekly_price", "monthly_price","cleaning_fee") # selecting columns to be cleaned
# Specify a function to replace characters with whitespace
replace_dollar <- function(x){
price <- as.numeric(gsub("[$,]","",x)) # this function removes $ from data
return(price)
}
# Apply function to replace characters with whitespace
finalData[cols] <- lapply(finalData[cols], replace_dollar) # running the above defined function on cols
# Again check the structure of the data
str(finalData)
## 'data.frame': 1238 obs. of 14 variables:
## $ zipcode : Factor w/ 205 levels "","05340","07310",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ id : int 13561752 4942107 711635 4510857 3799598 568743 8335547 15094880 7664343 8884228 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ square_feet : int NA NA 800 NA NA NA NA NA NA NA ...
## $ per_night_price : num 450 989 240 119 240 159 250 229 180 305 ...
## $ weekly_price : num NA NA 1365 850 NA ...
## $ monthly_price : num NA NA 5460 3200 NA ...
## $ cleaning_fee : num 150 NA 50 60 75 75 85 50 100 100 ...
## $ neighbourhood : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ number_of_reviews : int 14 37 63 2 144 137 1 26 2 72 ...
## $ review_scores_rating: int 94 100 95 80 88 83 100 95 90 95 ...
## $ city : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
## $ size_rank : int 21 21 21 21 21 21 21 21 21 21 ...
## $ current_price : num 2171586 2171586 2171586 2171586 2171586 ...
The above result shows that all the variables have proper data types and consistent values except for missing data which we will impute soon.
The algorithm which we later use to obtain KPIs to measure and compare zip codes will require some columns to be normalized on 0-1 scale.
normalize Function
normalize <- function(x){
return((x-min(x))/(max(x)-min(x))) # function to scale variables between 0 and 1
}
# Scale the number of reviews column
finalData["number_of_reviews"] <- lapply(finalData["number_of_reviews"], normalize)
# Check the summary of column to ensure scaling has been applied
summary(finalData$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.003268 0.016340 0.056952 0.065360 1.000000
We see that scaling works fine and that all the values lie between 0 and 1.
Finding the percentage of missing values in columns
# Check number of NA in all the columns
missingValues <- as.data.frame(colSums(sapply(finalData,is.na)))
# Convert rownames to columns
missingValues <- as.data.frame(setDT(missingValues, keep.rownames = TRUE))
# Rename the column names
colnames(missingValues) <- c("columnName","totalNA_values")
# Transform totalNA to percent, add it as column and arrange in descending order on the basis of it
missingValues <- missingValues %>%
mutate_at(vars(totalNA_values),funs(percentNA_values=.*100/nrow(finalData))) %>%
arrange(desc(percentNA_values))
# Check the top columns having maximum NA values
head(missingValues,n=nrow(missingValues))
## columnName totalNA_values percentNA_values
## 1 square_feet 1210 97.73829
## 2 monthly_price 1058 85.46042
## 3 weekly_price 981 79.24071
## 4 review_scores_rating 268 21.64782
## 5 cleaning_fee 236 19.06300
## 6 zipcode 0 0.00000
## 7 id 0 0.00000
## 8 bedrooms 0 0.00000
## 9 per_night_price 0 0.00000
## 10 neighbourhood 0 0.00000
## 11 number_of_reviews 0 0.00000
## 12 city 0 0.00000
## 13 size_rank 0 0.00000
## 14 current_price 0 0.00000
Columns square_feet, monthly_price, weekly_price, review_scores_rating and cleaning_fee contain 97.7, 85.46, 79.24, 21.64 and 19.06 percent NA values respectively.
Although, variables which contain very large percentage of missing data should not be imputed but rather removed, here I still impute them. The most important variable (per_night_price) which is vital in computing revenue doesn’t have any missing values. So, our analysis and the resulting conclusions wouldn’t be impaired by this imputation.
The NAs are removed below by imputing them through the use of mice package and CART algorithm as follows
Imputation
## Missing Data Imputation
# Subset the data to remove the constants
dataSet <- subset(finalData,select = -c(id,city)) # id and city are constant and will not help in imputation
# Impute the Missing Data using Mice Package
cat("Started...Imputing Missing Data using Calssification and Regression Tree 'CART'")
## Started...Imputing Missing Data using Calssification and Regression Tree 'CART'
impdataSet <- mice(dataSet, m=5, method='cart', printFlag=FALSE)
cat("....Finished...Imputing Missing Data")
## ....Finished...Imputing Missing Data
#
# Use the imputed values in the original dataframe
complete_dataSet <- complete(impdataSet)
finaldf_subset <- subset(finalData,select = c(id,city))
finaldf_complete <- cbind(complete_dataSet,finaldf_subset) # combining the imputed dataset to add id and city
# Check if there are any more NA values
sum(sapply(finaldf_complete, function(x) { sum(is.na(x)) }))
## [1] 0
# Check the summary of the dataframe again
summary(finaldf_complete)
## zipcode bedrooms square_feet per_night_price weekly_price
## 11215 :141 Min. :2 Min. : 0 Min. : 28.0 Min. : 310
## 10003 :133 1st Qu.:2 1st Qu.: 900 1st Qu.: 165.0 1st Qu.:1100
## 10025 :112 Median :2 Median :1000 Median : 240.0 Median :1442
## 10036 :108 Mean :2 Mean :1003 Mean : 278.7 Mean :1735
## 10011 :102 3rd Qu.:2 3rd Qu.:1150 3rd Qu.: 325.0 3rd Qu.:2200
## 10014 : 95 Max. :2 Max. :1600 Max. :4700.0 Max. :5950
## (Other):547
## monthly_price cleaning_fee neighbourhood number_of_reviews
## Min. : 1250 Min. : 0.00 Bronx : 0 Min. :0.000000
## 1st Qu.: 3500 1st Qu.: 60.00 Brooklyn :374 1st Qu.:0.003268
## Median : 5000 Median : 90.00 Manhattan :840 Median :0.016340
## Mean : 5792 Mean : 94.82 Queens : 9 Mean :0.056952
## 3rd Qu.: 7425 3rd Qu.:118.75 Staten Island: 15 3rd Qu.:0.065360
## Max. :17100 Max. :350.00 Max. :1.000000
##
## review_scores_rating size_rank current_price
## Min. : 20.00 Min. : 1.0 Min. : 369831
## 1st Qu.: 90.00 1st Qu.: 15.0 1st Qu.:1329214
## Median : 96.00 Median : 71.0 Median :1885687
## Mean : 93.31 Mean : 485.9 Mean :1835538
## 3rd Qu.:100.00 3rd Qu.: 580.0 3rd Qu.:2247494
## Max. :100.00 Max. :4149.0 Max. :3332599
##
## id city
## Min. : 20853 New York:1238
## 1st Qu.: 4218751
## Median : 9410246
## Mean : 9218383
## 3rd Qu.:14476782
## Max. :18508770
##
There are no more missing data values in the final data.
To find out which zipcodes are profitable we need to aggregate the current data at zipcode level; where each zipcode is associated with its average daily/weekly/monthly airbnb rent price, number of 2 bedroom airbnb listings in the zipcode, average number of reviews for those listings, and the estimated cost of property of those 2 bedroom listings (current_price).
We create a new variable called unique_id which holds nothing but the number of unique IDs (unique 2 bedroom listings) from airbnb listings.
unique_id: The number of properties listed in Airbnb in each zip code
This is done using the below code.
# Find average of the required columns
avg_df <- finaldf_complete %>%
group_by(zipcode) %>%
summarise_at(vars(square_feet:current_price),mean) # mean of current price and other price attributes
# Find unique id (no. of properties) for each zipcode
unique_id_df <- finaldf_complete %>% select(zipcode,id) %>%
group_by(zipcode) %>%
mutate(unique_id = n_distinct(id)) %>%
select(zipcode,unique_id) %>% distinct() # count number of properties in each zipcode as unique_id
neighbourhood_df <- finaldf_complete %>% select(zipcode,neighbourhood) %>%
group_by(zipcode) %>%
select(zipcode,neighbourhood) %>% distinct()
# Combine both the dataframes to get final summary of the data
finaldf_complete<-inner_join(unique_id_df,finaldf_complete,by="zipcode")
summary_df <- inner_join(avg_df, unique_id_df, by = "zipcode") # combine data agregations using zipcode
summary_df=inner_join(summary_df,neighbourhood_df,by="zipcode")
make.names(colnames(summary_df))
## [1] "zipcode" "square_feet" "per_night_price"
## [4] "weekly_price" "monthly_price" "cleaning_fee"
## [7] "neighbourhood.x" "number_of_reviews" "review_scores_rating"
## [10] "size_rank" "current_price" "unique_id"
## [13] "neighbourhood.y"
summary_df$neighbourhood.x<-NULL
colnames(summary_df)[colnames(summary_df)=="neighbourhood.y"] <- "neighbourhood"
summary(summary_df)
## zipcode square_feet per_night_price weekly_price
## 10003 : 1 Min. : 400.0 Min. : 65.0 Min. : 350
## 10011 : 1 1st Qu.: 782.8 1st Qu.:128.7 1st Qu.:1009
## 10013 : 1 Median :1000.0 Median :220.2 Median :1499
## 10014 : 1 Mean : 917.7 Mean :228.5 Mean :1439
## 10021 : 1 3rd Qu.:1103.8 3rd Qu.:307.1 3rd Qu.:1871
## 10022 : 1 Max. :1153.4 Max. :367.3 Max. :2430
## (Other):16
## monthly_price cleaning_fee number_of_reviews review_scores_rating
## Min. :1250 Min. : 22.50 Min. :0.00000 Min. : 80.00
## 1st Qu.:3297 1st Qu.: 72.39 1st Qu.:0.04422 1st Qu.: 91.73
## Median :5001 Median : 88.38 Median :0.05049 Median : 93.60
## Mean :4878 Mean : 81.27 Mean :0.05772 Mean : 92.81
## 3rd Qu.:6477 3rd Qu.:101.46 3rd Qu.:0.07016 3rd Qu.: 94.74
## Max. :7544 Max. :114.25 Max. :0.23366 Max. :100.00
##
## size_rank current_price unique_id neighbourhood
## Min. : 1.0 Min. : 369831 Min. : 1.00 Bronx : 0
## 1st Qu.: 37.0 1st Qu.: 462082 1st Qu.: 9.00 Brooklyn : 5
## Median : 479.5 Median :1471799 Median : 59.50 Manhattan :11
## Mean : 805.7 Mean :1436329 Mean : 56.27 Queens : 1
## 3rd Qu.:1389.8 3rd Qu.:2106187 3rd Qu.: 92.75 Staten Island: 5
## Max. :4149.0 Max. :3332599 Max. :141.00
##
Some information gleaned from the above summary:
There are 22 Zipcodes in New York city having 2 bedroom properties listed on Airbnb
At least 1 zipcode has only 1 rental property listed on Airbnb and atleast one zipcode has 141 rental properties listed on Airbnb. Most of the zipcodes have more than 50 properties listed on Airbnb
The minimum average ratings associated with a zipcode is 80 while maximum is 96.69 on scale (0-100)
The cheapest zipcodes average daily rent is $65.0 and the costliest zipcodes average daily rent is $367.3
If the company goes for zipcode with cheapest property then each property there will cost on average $363483 and zipcode with costliest property will have properties listed on average at $3290515
Move to Analysis tab
We need to analyze the cost of purchasing 2 bedroom properties in different zipcodes and the revenues that can be expected from renting those properties through Airbnb.
We compute (or assume) the following variables to help us with the analysis:
Occupancy_Rate = 0.75 (probability that the property is occupied aka booked on any given day)
p_daily = 0.75 (probability that booking is done for a day)
p_weekly = 0.18 (probability that booking is done for a week)
p_month = 0.07 (probability that booking is done for a month)
Est_Annual_Revenue(per_property) = Occupancy_Rate x 365 x ((p_daily x Rent_per_Day )+(p_weekly x Rent_per_Week/7)+(p_month x Rent_per_Month/30))
Revenue_by_Cost_Ratio = Est_Annual_Revenue/Est_Purchase_Cost_of_the_Property
Breakeven_Point = (Est_Purchase_Cost_of_the_Property/(Estimated_Annual_Revenue-(Occupancy Rate x 52 x Cleaning_Fee)))
Year_N_est_net. = ((Est_Annual_Revenue x N)-(N x Occupancy_Rate x 52 x Cleaning_Fee))-Est_Purchase_Cost_of_the_Property
Percent_units: This column defines the percentage of number of 2 bedroom properties which are present in the given zipcode. A higher value means more number of properties available in the respective zipcode
Review_effect: This column denotes the scaled value of the Review_score_ratings column on a scale of 0 to 1
p_daily=.75 # probability that booking is done for a day
p_weekly=.18 # probability that booking is done for a week
p_month=.7 # probability that booking is done for a month
occupancy_rate <- .75
Year_days <- 365 # number of days in a year
# derive the percentage of properties listed in a given zipcode
summary_df$Percent_units <- summary_df$unique_id*100/sum(summary_df$unique_id)
summary_df$Review_effect <- normalize(summary_df$review_scores_rating) # scale the review_scores_rating
summary_df$Review_effect <-ifelse(summary_df$Review_effect>0,summary_df$Review_effect,mean(summary_df$Review_effect)) # impute those properties which had no ratings
# Generate the revenue for a year per property
summary_df$Revenue_year_per_property<-occupancy_rate*Year_days*((p_daily*summary_df$per_night_price)+(p_weekly*summary_df$weekly_price/7)+(p_month*summary_df$monthly_price/30))
# Obtain the Revenue/Cost ratio
summary_df$Revenue_by_Cost_Ratio <-((summary_df$Revenue_year_per_property)/(summary_df$current_price))
# Breakeven period
summary_df$breakeven_point <-(summary_df$current_price/(summary_df$Revenue_year_per_property-(occupancy_rate*52*summary_df$cleaning_fee)))
summary_df$year_5 <- ((summary_df$Revenue_year_per_property*5)-(5*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_10 <- ((summary_df$Revenue_year_per_property*10)-(10*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_15 <- ((summary_df$Revenue_year_per_property*15)-(15*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_20 <- ((summary_df$Revenue_year_per_property*20)-(20*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_25 <- ((summary_df$Revenue_year_per_property*25)-(25*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df<-summary_df %>% mutate_at(vars(per_night_price,square_feet,number_of_reviews,Review_effect,Revenue_year_per_property, weekly_price,monthly_price,cleaning_fee,current_price,review_scores_rating,Percent_units, breakeven_point, year_5,year_10,year_15,year_20,year_25), funs(round(., 2)))
summary_df<-summary_df %>% mutate_at(vars(Revenue_by_Cost_Ratio), funs(round(., 3)))
The aggregated data from the previous steps is as follows:
datatable(summary_df[,c("zipcode","per_night_price","weekly_price","monthly_price","cleaning_fee","review_scores_rating","size_rank","current_price","unique_id","neighbourhood","Percent_units","Revenue_year_per_property","Revenue_by_Cost_Ratio","breakeven_point","year_5","year_10","year_15","year_20","year_25")],rownames=FALSE,options=list(pageLength=22,dom='t',searching=FALSE,paging=FALSE,scroller=TRUE,scrollX=TRUE,scrollY="500px",fixedHeader=TRUE, autoWidth=TRUE),colnames=c("Zipcode","Daily Rent","Weekly Rent","Monthly Rent","Cleaning Fee","Scaled Avg. Review Score","Size Rank","Property Cost","No. of Property","Neighbourhood","Percentage of Units","Rev/Property","Revenue/Cost Ratio","Breakeven Point","Year 5 est. net","year 10 est. net","year 15 est. net","Year 20 est. net","Year 25 est. net"))
Please observe that a few zipcodes among the 22 have very few 2 bedroom airbnb listings. Also, noticing from the previous section, the (unique_id) variable’s minimum value is 1, meaning that at least 1 zipcode has just 1 listed 2 bedroom property in airbnb; and the first quartile equals 9, meaning at least 25% of zipcodes have 9 or less number of 2 bedroom properties listed in airbnb. Important values like revenue might get skewed when a zipcode has very few listings. One possible solution would be to write a function that checks if the percentage differences in avg. airbnb rent between 2 bedroom and other types of listings (1/3/4 bedroom) within each such zipcode are comparable to similar percentage difference measures computed for other New York zipcodes (zipcodes with more number of 2 bedroom airbnb listings). For now, I ignore zipcodes with less than 10 two bedroom airbnb listings.
Before we get into more meaningful analysis charts, we need to make sure that we only consider those zipcodes that have enough listings in the Airbnb data. Else the analysis will be flawed.
colnames(summary_df)[colnames(summary_df)=="unique_id"]<-"no_of_2bed_listings"
# min_no_of_listings - only if the number of 2 bedroom listings is higher than this value in a zipcode, is the zipcode used in revenue analysis
min_no_of_listings=10
zipcodes_with_15orless_2bedroom_listings<-summary_df[(summary_df$no_of_2bed_listings<min_no_of_listings),]
zipcodes_with_16ormore_2bedroom_listings<-summary_df[!(summary_df$no_of_2bed_listings<min_no_of_listings),]
# Displaying zipcodes with less than 10 two bedroom listings
datatable(zipcodes_with_15orless_2bedroom_listings[,c('zipcode','no_of_2bed_listings')],rownames=FALSE,options=list(don='t',searching=FALSE,paging=FALSE),colnames=c("zipcode","No. of 2 bedroom listings"))
swthm <- hc_theme_merge(
hc_theme_darkunica(),
hc_theme(
credits = list(
style = list(
color = "#4bd5ee"
)
),
title = list(
style = list(
color = "#4bd5ee"
)
),
chart = list(
backgroundColor = "transparent",
divBackgroundImage = "https://www.howtogeek.com/wp-content/uploads/2012/12/Plain-Black-Wallpaper.png",
style = list(fontFamily = "Comic Sans MS")
)
)
)
zipcodes_with_16ormore_2bedroom_listings <- zipcodes_with_16ormore_2bedroom_listings[order(zipcodes_with_16ormore_2bedroom_listings$no_of_2bed_listings,decreasing = TRUE),]
highchart() %>%
hc_add_theme(swthm) %>%
hc_title(text = "Number of Airbnb 2 bedroom listings in a few NYC Zipcodes ") %>%
hc_xAxis(categories = as.character(zipcodes_with_16ormore_2bedroom_listings$zipcode),
title = list(text = "Zipcodes")) %>%
hc_yAxis(title = list(text = "Number of 2 bedroom properties listed")) %>%
hc_add_series_labels_values(zipcodes_with_16ormore_2bedroom_listings$zipcode,zipcodes_with_16ormore_2bedroom_listings$no_of_2bed_listings,name = " ",colorByPoint=FALSE,type="column",dataLabels = list(enabled = TRUE),showInLegend=FALSE)%>%
hc_tooltip(pointFormat = "Number of 2 bedroom listings: {point.y}")
#hc_add_series(data = zipcodes_with_16ormore_2bedroom_listings$no_of_2bed_listings, name = "Number of 2 bedroom properties listed in Airbnb in the zipcode", color = "#e5b13a") %>%
categories_grouped <- summary_df %>%
group_by(neighbourhood,zipcode)%>%
do(neighbourhood = .$zipcode) %>%
list_parse()
swthm <- hc_theme_merge(
hc_theme_538(),
hc_theme(
credits = list(
style = list(
color = "#2F4F4F"
)
),
title = list(
style = list(
color = "#2F4F4F"
)
),
chart = list(
backgroundColor = "transparent",
divBackgroundImage = "https://www.publicdomainpictures.net/download-picture.php?id=199969&check=2cf9ef5a06ace28f60059bb9a1fa5201",
style = list(fontFamily = "Georgia")
)
)
)
highchart() %>%
hc_add_theme(swthm) %>%
hc_title(text = "Cost of 2 bedroom properties in NYC zipcodes. Hover over the bars to identify neighbourhoods " ) %>%
hc_xAxis(neighbourhood = categories_grouped,categories = as.character(summary_df$zipcode),title = list(text = "Zipcodes")) %>%
hc_yAxis(title = list(text = "Avg. cost of 2 bedroom property")) %>%
hc_add_series(data = summary_df, type = "column", hcaes(x=neighbourhood,y = current_price, color = neighbourhood),showInLegend = FALSE)
We see the positive correlation. Fitting a generic linear model along the path of the scatterplot offers an interesting pattern. I haven’t followed this route in this analysis. Sidenote: Notice the three points that are distinctly seperated above the center of the line. The zipcodes associated with them feature in my final conclusion.
swthm <- hc_theme_merge(
hc_theme_darkunica(),
hc_theme(
credits = list(
style = list(
color = "#4bd5ee"
)
),
title = list(
style = list(
color = "#4bd5ee"
)
),
chart = list(
backgroundColor = "transparent",
divBackgroundImage = "https://www.howtogeek.com/wp-content/uploads/2012/12/Plain-Black-Wallpaper.png",
style = list(fontFamily = "Comic Sans MS")
)
)
)
lm.model <- augment(lm(per_night_price ~ current_price, data = summary_df))
highchart() %>%
hc_add_theme(swthm) %>%
hc_title(text="Price of property Vs Rent per day in different zipcodes") %>%
hc_xAxis(title=list(text='Avg. Current Price'))%>%
hc_yAxis(title=list(text='Avg. Rent per Day'))%>%
hc_add_series(lm.model, "scatter",zoomType='xy',size=20, hcaes(x = current_price,y = per_night_price),showInLegend=FALSE) %>%
hc_add_series(lm.model, "line", hcaes(x = current_price, y = .fitted),showInLegend=FALSE)%>%
hc_tooltip(crosshairs=TRUE,headerFormat = "Price of property(x) vs Rent per day (y):")
Here’s a colorful visualization of the estimated revenue per year that can be expected from renting out a single 2 bedroom property in different zipcodes.
swthm <- hc_theme_merge(
hc_theme_darkunica(),
hc_theme(
credits = list(
style = list(
color = "#4bd5ee"
)
),
title = list(
style = list(
color = "#4bd5ee"
)
),
chart = list(
backgroundColor = "transparent",
divBackgroundImage = "https://www.howtogeek.com/wp-content/uploads/2012/12/Plain-Black-Wallpaper.png",
style = list(fontFamily = "Comic Sans MS")
)
)
)
zipcodes_with_16ormore_2bedroom_listings <- zipcodes_with_16ormore_2bedroom_listings[order(zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property,decreasing = TRUE),]
zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property<-as.integer(zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property)
highchart() %>%
hc_add_theme(swthm) %>%
hc_title(text = "Comparing estimated Revenue per year per property in Zipcodes ") %>%
hc_xAxis(categories = as.character(zipcodes_with_16ormore_2bedroom_listings$zipcode),
title = list(text = "Zipcodes")) %>%
hc_yAxis(title = list(text = "Revenue per year per property"), min=65000, max=135000) %>%
hc_add_series_labels_values(zipcodes_with_16ormore_2bedroom_listings$zipcode,zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property,name = " ",colorByPoint=TRUE,type="column",dataLabels = list(enabled = FALSE),showInLegend=FALSE) %>%
hc_add_series_labels_values(labels = zipcodes_with_16ormore_2bedroom_listings$zipcode,values = zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property,colorBypoint=TRUE,center = c('85%', '30%'),size = 200,type="pie",dataLabels = list(enabled = FALSE))%>%
hc_tooltip(pointFormat = "Revenue/year/property: {point.y}")
As mentioned earlier, I ignore 7 zipcodes for revenue analysis because of insufficient number of 2 bedroom Airbnb listings in those zipcodes. One possible solution to this issue, as mentioned earlier is as follows:
Important values like revenue might get skewed when a zipcode has very few listings. One possible solution would be to write a function that checks if the percentage differences in avg. airbnb rent between 2 bedroom and other types of listings (1/3/4 bedroom) within each such zipcode are comparable to similar percentage difference measures computed for other New York zipcodes (zipcodes with more number of 2 bedroom airbnb listings). For now, I ignore zipcodes with less than 15 two bedroom airbnb listings.
The ignored zipcodes, the number of listings, and the estimated revenue are below:
datatable(zipcodes_with_15orless_2bedroom_listings[,c("zipcode","neighbourhood","no_of_2bed_listings","Revenue_year_per_property","current_price","Revenue_by_Cost_Ratio")],options=list(dom='t',searching=FALSE, paging=FALSE),colnames = c("Zipcode","Neighbourhood","No. of 2 Bed Listings", "Revenue/Year/Property","Cost of Property","Revenue/Cost Ratio"))
In this plot, expenses that will be incurred after the purchase has been made have been neglected. So a plain ratio, Revenue/Forecasted-purchase-cost (Jan 2019), has been used.
A high Revenue/Cost ratio associated with buying and renting 2 bedroom properties in a zipcode means that barring outliers, it is safe to assume that investing in 2 bedroom properties in the zipcode would be safer because the investment would break even sooner than similar 2 bedroom investments in zipcodes which are associated with lower ratios.
The left y-axis holds revenue per year per property and the right y-axis holds the revenue/cost ratio. Revenue/cost value would help the company understand when they’d break even on a purchase.
swthm <- hc_theme_merge(
hc_theme_monokai(),
hc_theme(
credits = list(
style = list(
color = "#4bd5ee"
)
),
title = list(
style = list(
color = "#4bd5ee"
)
),
chart = list(
backgroundColor = "transparent",
divBackgroundImage = "https://www.publicdomainpictures.net/pictures/130000/nahled/charcoal-background.jpg",
style = list(fontFamily = "Comic Sans MS")
)
)
)
highchart()%>%
hc_add_theme(swthm) %>%
hc_title(text = "Revenue per year per property and Revenue/Cost ratio" ) %>%
hc_subtitle(text="(bars-left yaxis) and (line-right yaxis); to view seperately, select the legends on the bottom right") %>%
hc_xAxis(categories = as.character(zipcodes_with_16ormore_2bedroom_listings$zipcode), title = list(text = "Zipcodes")) %>%
hc_yAxis_multiples(list(title = list(text = "Revenue/year per property"),labels = list(format = "{value}"),min=65000,max=150000),list(title = list(text = "Revenue/Cost ratio"),labels=list(format ='{value}'),dataLabels=list(enabled=FALSE),opposite=TRUE,min=0.04,max=0.080)) %>%
hc_plotOptions(column = list(stacking = "normal")) %>%
hc_add_series(zipcodes_with_16ormore_2bedroom_listings,type="column",hcaes(x=zipcode,y=Revenue_year_per_property),showInLegend=TRUE) %>%
hc_add_series(zipcodes_with_16ormore_2bedroom_listings,type="line",hcaes(x=zipcode,y=Revenue_by_Cost_Ratio),yAxis=1,showInLegend=TRUE) %>%
hc_tooltip(pointFormat = "Stat: {point.y}")
df1<-zipcodes_with_16ormore_2bedroom_listings
df<-zipcodes_with_16ormore_2bedroom_listings[,c("zipcode","current_price","Revenue_year_per_property","year_5","year_10","year_15","year_20","year_25")]
rr=range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[2]-range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[1]
r1=range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[2]
r0=range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[1]
myDT2<-datatable(
df,rownames = FALSE,
options = list(pageLength = 15,dom='Blfrtip',searching=FALSE,paging=FALSE,
rowCallback=JS(paste0("function(row, data) {",
paste(lapply(3:ncol(df),function(i){
paste0("var value=Math.abs(data[",i,"]);
var value2=data[",i,"];
if (value!==null){
if(value2<0){
$(this.api().cell(row,",i,").node()).css({'background':isNaN(parseFloat(value)) || value <=",r0," ? '' : 'linear-gradient(90deg, transparent ' + (",r1," - value)/",rr," * 100 + '%, #d30c3e ' + (",r1," - value)/",rr," * 100 + '%)','background-size':'98% 80%','background-repeat':'no-repeat','background-position':'center'});
}else{
$(this.api().cell(row,",i,").node()).css({'background':isNaN(parseFloat(value)) || value <=",r0," ? '' : 'linear-gradient(90deg, transparent ' + (",r1," - value)/",rr," * 100 + '%, #018c3a ' + (",r1," - value)/",rr," * 100 + '%)','background-size':'98% 80%','background-repeat':'no-repeat','background-position':'center'},10000);
}
}
if(data[0]!==null ){
$(row).css('color', '#fff');
$(row).css('background', '#212121');
}
if(data[0]== 10036 || data[0]== 10025 || data[0]== 11215 || data[0]== 10022 || data[0]== 11231){
$(row).css('color', '#FFff00');
}")
}),collapse="\n"),
"}"))
,initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#121212', 'color': '#fff'});",
"var css = document.createElement('style');
css.type = 'text/css';
css.innerHTML = '.table.dataTable.hover tbody tr:hover, table.dataTable.display tbody tr:hover { background-color: #939393 !important }';
document.body.appendChild(css);",
"}")
),
container = tags$table(
class="cell-border stripe",
tags$thead(tags$tr(lapply(c("Zipcode", "Avg Price","Rev/Property", "Year 5 est. net","Year 10 est. net", "Year 15 est. net","Year 20 est. net", "Year 25 est. net"), tags$th)))
)
)
myDT2 <- formatStyle(myDT2,
columns = c(1),
backgroundColor = '#535353',
borderBottomColor = "rgb(255, 255, 255)",
borderBottomStyle = "solid",
borderBottomWidth = "1px",
borderCollapse = "collapse",
borderRightColor = "rgb(255, 255, 255)",
borderRightStyle = "solid",
borderRightWidth = "1px",
color = "#fff",
fontWeight = "bold",
lineHeight = "normal"
)
myDT2
Zipcode 10036 in Manhattan is the best zipcode to purchase 2 bedroom properties. The Average revenue ($134,335/year per property) is one of the highest. The avg 2 bedroom price is cheaper than all but one zipcode in Manhattan - $1,729,317 (see plot 2). It has 108 two bedroom listings in Airbnb (not the highest and far from the lowest); and it has the highest Revenue/cost ratio, 0.078, meaning the investment has the highest probability of breaking-even the soonest.
Zipcode 10025 in Manhattan is not far behind. The Average revenue ($107,973) ranks among the middle of the 15 zipcodes. But the most important catch is that it has the least avg. purchase price for 2 bedroom properties in Manhattan at $1,459,632 (see plot 2). It has more 2 bedroom properties listed on Airbnb than 10036 at 112 listings and has a Revenue/Cost ratio of 0.074.
Zipcode 10022 in Manhattan is another great option to purchase 2 bedroom properties. Revenue/year/property - $133,291, Avg. cost of a 2 bedroom property - $1,901,116, Revenue/Cost ratio - 0.07. One thing to note is that the number of 2 bedroom Airbnb listings associated with this zipcode is 38, which is the lowest among these top 5 choices.
Zipcode 11215 in Brooklyn is also a very good location to purchase 2 bedroom properties. It is one of the cheapest options (avg. cost of $987,001). Eventhough the Avg. revenue obtained by renting through Airbnb is $70,272/year/property which is not as high as revenues associated with a few other zipcodes, the cheap purchase cost offsets the disadvantage of reaping lesser revenue and makes zipcode 11215 a great location for investing in 2 bedroom properties. Also, it has the highest number of 2 bedrooms listed on Airbnb at 141 listings. It too has a revenue/cost ratio of 0.071.
Zipcode 11231 in Brooklyn is a lot similar to zipcode 11215 in that the avg. cost of a 2 bedroom property and the avg. revenue per year obtained from a property are pretty close.
The following zipcodes shouldn’t be first preferences to purchase 2 bedroom properties in. The reasons are:
The avg. 2 bedroom properties in the zipcodes that follow are not the most ideal for investment because they don’t bring in commensurate revenue like zipcodes 10036 or 10022 to justify the investment.
Two bedroom properties in zipcodes 11217 and 11201 are priced at $1.3 mil and $1.4mil respectively. 2 bedroom Airbnb listings in these two zipcodes have an avg. annual revenue of NA and NA respectively. Their revenue/cost ratios are 0.062 and 0.056 respectively. These 2 zipcodes have 86 and 68 two bedroom listings on Airbnb.
10011 and 10014 have pricey 2 bedroom properties on average ($2.62mil and $2.25mil repectively). They command high revenues (140,211 and 122,007 respectively) through Airbnb rentals. Their revenue/cost ratios are similar (0.053 and 0.054). These 2 zipcodes have 102 and 95 two bedroom properties listed on Airbnb. They can be ‘okay’ investments if the investment horizon is close to 25 years.
Zipcodes 10023 and 10003 have avg. 2 bedroom properties priced at $2.14 mil and $2.17 mil respectively. Their revenue/cost ratios are 0.052 and 0.055 Revenues associated are $112,405 and $120,051 respectively.
Zipcode 10021 has two bedroom properties priced on avg. at $1.88 mil. Associated revenue/cost ratio is 0.055; the issue is it has only 19 two bedroom listings on Airbnb. Est. Revenue - 104,272.
10036
10025
10022
10215
11231
Economic Risks - Tourism generally booms if the economy is healthy. This needs to be factored in before making large investments in properties since the breakeven period is almost always several years.
Regulatory Issues - Airbnb has a lot of regulartory and legal hurdles. Last fall, the New York State Senate passed a bill that makes online apartment listings for stays shorter than 30 days illegal, which, not surprisingly, thwarts Airbnb in their goal to expand its market.
Customer Segmentation - It is crucial before entering a new market segment to understand the choice patters of consumer behaviour. Hence segmenting consumers on the basis of type(leisure, business,..), financial spending power, age etc would be helpful.
Due to missing zip codes in Zillow data, match rate between Zillow data file and Airbnb file is around 15%.
There are missing values for important columns in both datasets.
There are about 53 zipcodes for New York in airbnb file which don’t have information about 2 bedrooms properties.
Crime Data - Many cities have made their crime data available online.
Weather Data - A city’s weather affects mobility.
Social Media Data
Traffic Data