Identifying Zip Codes for 2-bedroom sets in New York City that would generate the most profit on short term rentals
Background:
You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
The objective is to identify the zip codes would generate the most profit on short term rentals within New York City.
Data: Publicly available data sets from Zillow and AirBnB
library(R.utils) #to use lisitngs.csv.gz file
library(tibble) #for better visibility of data frames
library(data.table) #to read .csv file
library(dplyr) #to manipulate data
library(tidyr) #to tidy data
library(ggplot2) #to better visualize data
library(purrr) #to use functions to clean data
library(ggrepel) #for better visualizations
library(DT) #to preview the data sets
library(kableExtra) #to create HTML Table
listings <- fread("listings.csv.gz")
zillow <- fread("Zip_Zhvi_2bedroom.csv")
# creating a copy of the original data frame
listings1 <- as.tibble(listings)
# removing columns that are not used for analysis
cols.remove.list <- c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,46,48,51,
59,66,68:76,81:84,86:93,95)
listings1 <- listings1[,-cols.remove.list]
glimpse(listings1)
## Observations: 40,753
## Variables: 29
## $ id <int> 7949480, 16042478, 1886820, 662...
## $ neighbourhood_group_cleansed <chr> "Bronx", "Bronx", "Bronx", "Bro...
## $ city <chr> "Bronx", "Bronx", "Bronx", "Cit...
## $ state <chr> "NY", "NY", "NY", "NY", "NY", "...
## $ zipcode <chr> "10464", "10464", "10464", "104...
## $ market <chr> "New York", "New York", "New Yo...
## $ country_code <chr> "US", "US", "US", "US", "US", "...
## $ latitude <dbl> 40.85205, 40.85349, 40.84114, 4...
## $ longitude <dbl> -73.78868, -73.78861, -73.78305...
## $ property_type <chr> "House", "Apartment", "House", ...
## $ room_type <chr> "Private room", "Private room",...
## $ accommodates <int> 2, 4, 4, 3, 4, 2, 4, 3, 5, 8, 2...
## $ bathrooms <dbl> 1.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1...
## $ bedrooms <int> 1, 1, 3, 1, 1, 0, 1, 1, 1, 1, 1...
## $ beds <int> 1, 1, 3, 1, 1, 1, 2, 2, 1, 3, 1...
## $ bed_type <chr> "Real Bed", "Real Bed", "Real B...
## $ square_feet <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ price <chr> "$99.00", "$200.00", "$300.00",...
## $ weekly_price <chr> "", "", "", "$775.00", "$350.00...
## $ monthly_price <chr> "", "", "", "", "$1,200.00", ""...
## $ security_deposit <chr> "$100.00", "", "$800.00", "", "...
## $ cleaning_fee <chr> "", "", "$100.00", "$75.00", "$...
## $ extra_people <chr> "$20.00", "$0.00", "$25.00", "$...
## $ number_of_reviews <int> 25, 0, 0, 12, 86, 41, 74, 114, ...
## $ first_review <chr> "2016-01-18", "", "", "2015-07-...
## $ last_review <chr> "2017-04-23", "", "", "2016-10-...
## $ review_scores_rating <int> 100, NA, NA, 93, 97, 97, 98, 90...
## $ review_scores_location <int> 10, NA, NA, 10, 10, 10, 10, 9, ...
## $ calculated_host_listings_count <int> 1, 1, 1, 1, 1, 1, 1, 4, 3, 4, 4...
Reason for removal:
Some columns only have text. Text has not been analyzed for this project.
Columns that were not been used in the merged file for analysis were removed as well.
# there are "$" signs in front of prices that need to be removed
# getting index of these columns
price.fee.index<- grep("^price$|^weekly_price$|^monthly_price$|^security_deposit$|^cleaning_fee$|^extra_people$", colnames(listings1))
# removing "$" sign
listings1[, price.fee.index] <- as.tibble(apply(listings1[, price.fee.index], 2,
function(x) {as.numeric(gsub(pattern="\\$", replacement = "", x))}))
# converting price and fee cols to numeric
listings1[,price.fee.index] <- listings1[,price.fee.index] %>%
mutate_all(as.numeric)
# cleaned data
listings1[,price.fee.index]
## # A tibble: 40,753 x 6
## price weekly_price monthly_price security_deposit cleaning_fee
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 99 NA NA 100 NA
## 2 200 NA NA NA NA
## 3 300 NA NA 800 100
## 4 125 775 NA NA 75
## 5 69 350 NA NA 17
## 6 125 550 NA 500 35
## 7 85 NA NA NA 30
## 8 39 NA NA NA 20
## 9 95 NA NA 100 40
## 10 125 NA NA 150 50
## # ... with 40,743 more rows, and 1 more variable: extra_people <dbl>
# similarly converting relevant columns to factor
listings1[,c("neighbourhood_group_cleansed", "property_type","room_type","bathrooms")] <-
listings1[,c("neighbourhood_group_cleansed", "property_type","room_type","bathrooms")] %>%
mutate_all(as.factor)
# checking states data
unique(listings1$state)
## [1] "NY" "MP" "New York" "ny" "NJ" "VT"
# changing "New York"and "ny" entry to "NY" to maintain uniformity
listings1$state[which(listings1$state=="New York")] <- "NY"
listings1$state[which(listings1$state=="ny")] <- "NY"
unique(listings1$state)
## [1] "NY" "MP" "NJ" "VT"
# checking city data
unique(listings1$city[grep(pattern="^n|^N", listings1$city)])
## [1] "New York" "new york"
## [3] "NY" "New York City"
## [5] "NEW YORK" "New York, New York, US"
## [7] "New York, US" "New york"
## [9] "NUEVA YORK" "New York, Brooklyn"
## [11] "new york\nnew york" "Nova York"
## [13] "Nyc" "New York city"
## [15] "NYC" "ny"
## [17] "nyc" "New york , Ridgewood "
## [19] "new York" "Nueva York"
# all the cities starting with "n" or "N" are actually New York City. Cleaning those observations to New York City to maintain uniformity
listings1$city[grep(pattern="^n|^N", listings1$city)] <- "New York City"
unique(listings1$city[grep(pattern="^n|^N", listings1$city)])
## [1] "New York City"
# zip codes: looking at unique zip codes in New York City
# unique(listings1$zipcode)
# some missing zip codes and group of zip codes such as "10003-8623". We can impute values of zip codes using latitudes and longitudes. Will do it in another section below.
# We get the 5 Boroughs as the neighbourhood_group_cleansed. Sanity check.
unique(listings1$neighbourhood_group_cleansed)
## [1] Bronx Queens Brooklyn Staten Island Manhattan
## Levels: Bronx Brooklyn Manhattan Queens Staten Island
# Selecting only 2 bedroom properties in New York City
list.2bed.NY <- listings1 %>% filter(bedrooms==2 & state=="NY" & (city == "New York City" | market== "New York") & country_code =="US")
dim(list.2bed.NY)
## [1] 4881 29
# checking room type and property type
unique(list.2bed.NY$property_type)
## [1] Apartment House Townhouse
## [4] Boat Loft Other
## [7] Condominium Bed & Breakfast Guesthouse
## [10] Villa Bungalow Timeshare
## [13] Guest suite Serviced apartment
## 27 Levels: Apartment Bed & Breakfast Boat Boutique hotel ... Villa
unique(list.2bed.NY$room_type)
## [1] Entire home/apt Private room
## Levels: Entire home/apt Private room Shared room
# price and square_feet of private rooms in a 2 bed room property should be doubled. The following assumption takes care of this issue.
Assumption: The real-estate company will purchase complete 2 bed room properties. Room type = “private room” in a 2 bed room property will require the price and square feet to be doubled.
New Variables Created
sq.feet.cor: Room_type = “private room” in a 2 bed room property required the square feet to be doubled. sq.feet.cor has been used for analysis instead of the given “square_feet”.
price_cor: Room_type = “private room” in a 2 bed room property required the price to be doubled. Price_cor has been used for analysis instead of the given “price”.
# checking room type and property type
unique(list.2bed.NY$property_type)
## [1] Apartment House Townhouse
## [4] Boat Loft Other
## [7] Condominium Bed & Breakfast Guesthouse
## [10] Villa Bungalow Timeshare
## [13] Guest suite Serviced apartment
## 27 Levels: Apartment Bed & Breakfast Boat Boutique hotel ... Villa
unique(list.2bed.NY$room_type)
## [1] Entire home/apt Private room
## Levels: Entire home/apt Private room Shared room
# corrected price
list.2bed.NY <- list.2bed.NY %>% mutate(price_cor = ifelse(room_type=="Private room", 2*price,price ))
# the corresponding sq feet will also be doubled
list.2bed.NY <- list.2bed.NY %>% mutate(sq.feet.cor = ifelse(room_type=="Private room", 2*square_feet,square_feet ))
# checikng missing values
colSums(is.na(list.2bed.NY))
## id neighbourhood_group_cleansed
## 0 0
## city state
## 0 0
## zipcode market
## 0 0
## country_code latitude
## 0 0
## longitude property_type
## 0 0
## room_type accommodates
## 0 0
## bathrooms bedrooms
## 8 0
## beds bed_type
## 3 0
## square_feet price
## 4768 41
## weekly_price monthly_price
## 4494 4879
## security_deposit cleaning_fee
## 2373 963
## extra_people number_of_reviews
## 0 0
## first_review last_review
## 0 0
## review_scores_rating review_scores_location
## 1017 1026
## calculated_host_listings_count price_cor
## 0 41
## sq.feet.cor
## 4768
Noteworthy Observations:
# finding the weird zip codes (such as "11426-1175" "10003-8623") and assigning NA to them. We wiil use latitudes and longitudes to find the correct zip codes below
weird.zip.index <- which(nchar(list.2bed.NY$zipcode)==10)
list.2bed.NY$zipcode[weird.zip.index] <- NA
list.2bed.NY$zipcode <- as.numeric(list.2bed.NY$zipcode)
# 64 observations with no zip codes
length(list.2bed.NY$zipcode[list.2bed.NY$zipcode==""])
## [1] 64
list.2bed.NY$zipcode[list.2bed.NY$zipcode==""] <- NA
sum(is.na(list.2bed.NY$zipcode))
## [1] 64
str(list.2bed.NY$zipcode)
## num [1:4881] 10462 10469 11102 11102 11105 ...
# Running a loop to pick up closest zip codes for these obsrvations by matching neighbourhood_group_cleansed and long and lat
for (k in 4:1){
ind <- which(is.na(list.2bed.NY$zipcode))
for(i in seq_along(ind)){
for (j in 1:nrow(list.2bed.NY)){
if(round(list.2bed.NY$latitude[ind[i]],k)==round(list.2bed.NY$latitude[j],k) &&
round(list.2bed.NY$longitude[ind[i]],k)==round(list.2bed.NY$longitude[j],k) &&
list.2bed.NY$neighbourhood_group_cleansed[ind[i]] == list.2bed.NY$neighbourhood_group_cleansed[j])
{
list.2bed.NY$zipcode[ind[i]] = list.2bed.NY$zipcode[j]
}
}}
}
# number of zip codes with missing values
sum(is.na(list.2bed.NY$zipcode))
## [1] 0
# removing outliers in sq feet
summary(list.2bed.NY$sq.feet.cor)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 750.0 900.0 988.9 1100.0 10118.0 4768
boxplot(list.2bed.NY$sq.feet.cor)
# outliers in the boxplot
boxplot.stats(list.2bed.NY$sq.feet.cor)$out
## [1] 32 1800 80 2000 1800 0 0 10118 1800 1700 2200
## [12] 0 0 1800 0 0 11 1
# removing extreme values which definitely seem wrong
# this step has to be manually done after reviewing the outliers and has to be hardcoded
sq.feet.outlier <- which(list.2bed.NY$sq.feet.cor %in% c(0,1,11,32, 10118))
list.2bed.NY <- list.2bed.NY[-sq.feet.outlier,]
sum(!is.na(list.2bed.NY$sq.feet.cor))
## [1] 103
# Only 103 observations have sq footage data
# outliers in the boxplot of price. Looks fine.
summary(list.2bed.NY$price_cor)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 140.0 195.0 224.5 275.0 1140.0 41
boxplot(list.2bed.NY$price_cor)
# Looking at the distribution of available square feet data for neighbourhoods
list.2bed.NY %>% filter(!is.na(sq.feet.cor)) %>% dplyr::select(neighbourhood_group_cleansed) %>% table()
## .
## Bronx Brooklyn Manhattan Queens Staten Island
## 0 50 43 10 0
# Bronx and Staten Island have no values for sq feet cor
# selecting observations with non missing sq.feet.cor value
model.data <- list.2bed.NY %>% filter(!is.na(sq.feet.cor)) %>% dplyr::select(neighbourhood_group_cleansed,
property_type, room_type,sq.feet.cor, bathrooms,beds,accommodates,price_cor, weekly_price,
monthly_price, security_deposit ,cleaning_fee, extra_people)
# Looking at the number of missing observations for other variables
colSums(is.na(model.data))
## neighbourhood_group_cleansed property_type
## 0 0
## room_type sq.feet.cor
## 0 0
## bathrooms beds
## 1 0
## accommodates price_cor
## 0 0
## weekly_price monthly_price
## 81 103
## security_deposit cleaning_fee
## 37 15
## extra_people
## 0
# Removing weekly_price, monthly_price, security_deposit as they have a large number of missing values
model.data <- model.data %>% dplyr::select(-c("weekly_price","monthly_price","security_deposit"))
# checking property type
table(model.data$property_type)
##
## Apartment Bed & Breakfast Boat
## 87 0 0
## Boutique hotel Bungalow Cabin
## 0 0 0
## Castle Cave Chalet
## 0 0 0
## Condominium Dorm Earth House
## 1 0 0
## Guest suite Guesthouse Hostel
## 0 0 0
## House Hut In-law
## 10 0 0
## Lighthouse Loft Other
## 0 3 0
## Serviced apartment Tent Timeshare
## 0 0 0
## Townhouse Vacation home Villa
## 2 0 0
# Only 1 Condominium, 2 Townhouse, and 3 Lofts. These are too few values to correctly predict the missing values for these catgories
model.data <- model.data %>% filter(!property_type %in%c("Condominium","Townhouse","Loft"))
# omitting 1 observation with no value for number of bathrooms
model.data <- model.data[-which(is.na(model.data$bathrooms)),]
# 14 values of cleaning fee missing
colSums(is.na(model.data))
## neighbourhood_group_cleansed property_type
## 0 0
## room_type sq.feet.cor
## 0 0
## bathrooms beds
## 0 0
## accommodates price_cor
## 0 0
## cleaning_fee extra_people
## 14 0
dim(model.data)
## [1] 96 10
# Manhattan looks more pricey than Brooklyn and Queens
model.data %>%
ggplot(aes(x=sq.feet.cor, y=price_cor, col=neighbourhood_group_cleansed)) + geom_point(size=3) +
labs(title = "Price vs Square Feet of Property", subtitle = "Manhattan looks pricier than Brooklyn and Queens",
x = "Square Feet",
y = "Price per Night ($)", color = "Neighbourhood") +
theme_classic()
# price_cor vs square feet by number of bathrooms
model.data %>%
ggplot(aes(x=sq.feet.cor, y=price_cor, shape=bathrooms)) + geom_point(size=3) +
labs(title = "Price vs Square Feet of Property", subtitle = "Higher Square Feet corresponds to higher chances of >1 bathrooms",
x = "Square Feet",
y = "Price per Night ($)", shape = "bathrooms") + theme_classic()
# number of bathrooms and square feet
model.data %>%
ggplot(aes(x=sq.feet.cor, fill=bathrooms)) + geom_histogram(position="dodge", binwidth = 250) +
labs(title = "Distribution of Square Feet of Property by # of Bathrooms", subtitle = "Higher square feet show more occurences of bathrooms>1",
x = "Square Feet",
y = "Count") + theme_classic()
# cleaning fee vs square feet
model.data %>%
ggplot(aes(x=sq.feet.cor, y=cleaning_fee)) + geom_point(size=3) +
labs(title = "Cleaning Fee vs Square Feet of Property", subtitle = "Cleaning fee shows a mild linear relationship with square feet",
x = "Square Feet",
y = "Cleaning Fee ($)") + theme_classic()
# room_type and square feet
model.data %>%
ggplot(aes(x=sq.feet.cor, fill=room_type)) + geom_histogram(position="dodge", binwidth = 250) +
labs(title = "Distribution of Square Feet of Property by type of room", subtitle = "Private Rooms either have very large or small square_feet",
x = "Square Feet",
y = "Count", fill="Room Type") + theme_classic()
# checking correlation among continous variables
sub.set <- na.omit(model.data)
cor(sub.set[,-c(1:3,5)])
## sq.feet.cor beds accommodates price_cor cleaning_fee
## sq.feet.cor 1.00000000 -0.04866056 -0.09844025 0.45263004 0.2929749
## beds -0.04866056 1.00000000 0.70227726 0.03367773 0.1467176
## accommodates -0.09844025 0.70227726 1.00000000 -0.02736240 0.1037853
## price_cor 0.45263004 0.03367773 -0.02736240 1.00000000 0.4842777
## cleaning_fee 0.29297489 0.14671756 0.10378530 0.48427771 1.0000000
## extra_people 0.15190545 0.11911416 -0.02202184 0.34505990 0.2252941
## extra_people
## sq.feet.cor 0.15190545
## beds 0.11911416
## accommodates -0.02202184
## price_cor 0.34505990
## cleaning_fee 0.22529407
## extra_people 1.00000000
price_cor, cleaning_fee, and bathrooms have a noteworthy and positive correlation with sq.feet.cor
Manhattan looks pricier than Brooklyn and Queens
Higher Square Feet corresponds to higher chances of >1 bathrooms
Distribution of Square Feet of Property by type of room = Private Rooms either have very large or small square_feet
# Variable Selection
# I have use backward stepwise selection for variable selection
nullmodel=lm(sq.feet.cor~1, data=model.data)
fullmodel=lm(sq.feet.cor~., data=model.data)
model_step_b <- step(fullmodel,direction='backward', trace=0)
mod.summary <- summary(model_step_b)
# adj.r.sq
mod.summary$adj.r.squared
## [1] 0.3921451
# MSE
(mod.summary$sigma)^2
## [1] 64618.85
# Final Model: sq.feet.cor ~ room_type + bathrooms + price_cor + cleaning_fee
# Adj R sq values and model Model MSE are not that great. But with so less number of observations, this is fine.
final.model <- lm(sq.feet.cor ~ room_type + bathrooms + price_cor + cleaning_fee, data=model.data)
# Residual Analysis
aug.l_model.manh<- final.model %>% broom::augment() %>% mutate(row_num=1:n())
# residuals
ggplot(data=aug.l_model.manh, aes(x=.fitted,y=.std.resid))+geom_point() + geom_smooth(se=FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2", size=1) +
geom_hline(yintercept = c(-2, 2), linetype = "dotted", size=1) +
xlab("Fitted value") + ylab("Standardized residual") + ggtitle("Standardized Residuals vs Fitted Values")+
theme(plot.title = element_text(hjust = 0.5))
# QQ Plot
ggplot(data=aug.l_model.manh, aes(sample = .std.resid)) +
geom_qq() +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") + ggtitle("Q-Q Plot")+
theme(plot.title = element_text(hjust = 0.5))
# Number of missing values for predictors: bathrooms and cleaning_Fee
sum(is.na(list.2bed.NY$bathrooms))
## [1] 8
sum(is.na(list.2bed.NY$cleaning_fee))
## [1] 962
# selecting obs with mising values for square feet and non missing values for predictors
# we are selecting property type = Apartment and House because we did not have other property type categories in our model
sq.foot.miss <- list.2bed.NY %>% filter(is.na(sq.feet.cor) & !is.na(price_cor) &
!is.na(room_type) & !is.na(bathrooms)& !(bathrooms == "0.5") & !is.na(cleaning_fee) & property_type %in% c("Apartment","House")) %>%
dplyr::select(id, sq.feet.cor,price_cor, bathrooms, room_type, cleaning_fee)
# checking dimension of this data frame
dim(sq.foot.miss)
## [1] 3552 6
sq.foot.miss$sq.f.prd <- predict(object = final.model, newdata = sq.foot.miss[,-1])
sum(!is.na(sq.foot.miss$sq.f.prd))
## [1] 3552
join.var <- sq.foot.miss %>% dplyr::select(id,sq.f.prd)
# joining join.var with originial data frame list.2bed.NY
list.2bed.NY <- list.2bed.NY %>% left_join(join.var, c("id"))
# merging the given sq feet values in the predicted sq feet column
# index of observations with non missing sq.feet.cor values in original data
x <- which(!is.na(list.2bed.NY$sq.feet.cor))
# merging the original sq.feet.cor data into the predicted sq.f.prd values
list.2bed.NY$sq.f.prd[x] <- list.2bed.NY$sq.feet.cor[x]
\[ medp = Median Price of RegionName for that year \]
Assumptions:
The median of monthly median property prices for a RegionName (zipcode) will correctly reflect the median market cost of property in a zip code
For 2017, the medp has been taken as the median price for the 6 months because the data is only available for 6 months in 2017
zillow1 <- as.tibble(zillow)
# Filtering data for New York in NY
zillow.NY <- zillow1 %>% filter(City == "New York" & State == "NY")
dim(zillow.NY)
## [1] 25 262
# checking how many NAs we have in the data
sum(is.na(zillow.NY))
## [1] 1696
# 1696 NAs
# checking NAs column wise
# colSums(is.na(zillow.NY)) # the output takes too much space
# all the NAs are in property pricing
# tidying year and months in 1 column for each. 1 column for price.
zill.NY.y.m <- zillow.NY %>% gather(year_month, prop.val, -c(1:7)) %>%
separate(year_month, c("year", "month"), sep="-")
head(zill.NY.y.m)
## # A tibble: 6 x 10
## RegionID RegionName City State Metro CountyName SizeRank year month
## <int> <int> <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 61639 10025 New ~ NY New ~ New York 1 1996 04
## 2 61637 10023 New ~ NY New ~ New York 3 1996 04
## 3 61703 10128 New ~ NY New ~ New York 14 1996 04
## 4 61625 10011 New ~ NY New ~ New York 15 1996 04
## 5 61617 10003 New ~ NY New ~ New York 21 1996 04
## 6 62012 11201 New ~ NY New ~ Kings 32 1996 04
## # ... with 1 more variable: prop.val <int>
# taking median price for each year
zill.NY.y.medp <- zill.NY.y.m %>% group_by(RegionName, year,CountyName ,SizeRank) %>% summarise(medp = median(prop.val)) %>%
ungroup()
# removing observations where median price is not available
zill.NY.y.medp <- na.omit(zill.NY.y.medp)
zill.NY.y.medp <- zill.NY.y.medp %>% arrange(RegionName, year)
zill.NY.y.medp$year <- as.numeric(zill.NY.y.medp$year)
head(zill.NY.y.medp)
## # A tibble: 6 x 5
## RegionName year CountyName SizeRank medp
## <int> <dbl> <chr> <int> <dbl>
## 1 10003 2005 New York 21 1130000
## 2 10003 2006 New York 21 1378050
## 3 10003 2007 New York 21 1414900
## 4 10003 2008 New York 21 1532950
## 5 10003 2009 New York 21 1396050
## 6 10003 2010 New York 21 1278700
# checking missing data
miss.data <- zill.NY.y.medp %>% dplyr::select(RegionName, year, medp) %>% spread(year, medp)
colSums(is.na(miss.data))
## RegionName 1996 1997 1998 1999 2000
## 0 17 17 17 16 16
## 2001 2002 2003 2004 2005 2006
## 16 17 17 13 4 1
## 2007 2008 2009 2010 2011 2012
## 1 0 0 0 0 0
## 2013 2014 2015 2016 2017
## 0 0 0 0 0
\[CAGR = ((medp2 / medp1)^n -1)*100\]
medp1: Medp for year t
medp2: Medp for year t+x
n (number of years): t + x - t = x
# unique zip codes
unique.zip <- unique(zill.NY.y.medp$RegionName)
# x: start year; y:end year
cagr.cal <- function(x,y){
n_yr <- y - x + 1
cagr <- rep(NA, length(unique.zip))
for (k in seq_along(unique.zip)){
price_st.yr <- zill.NY.y.medp %>% filter(RegionName==unique.zip[k] & year==x) %>% dplyr::select(medp)
price_end.yr <- zill.NY.y.medp %>% filter(RegionName==unique.zip[k] & year==y) %>% dplyr::select(medp)
price_st.yr <- price_st.yr[[1]]
price_end.yr <- price_end.yr[[1]]
cagr[k] <- ((price_end.yr/price_st.yr)^(1/n_yr) - 1)*100
}
return(cagr)
}
# taking 5 year and 2 year cagr
cagr5_val <- cagr.cal(2012,2017)
cagr2_val <- cagr.cal(2015,2017)
# creating a data frame with the cagr values and zip codes
cagr.df <- tibble(zip = unique.zip, cagr5_val,cagr2_val)
head(cagr.df)
## # A tibble: 6 x 3
## zip cagr5_val cagr2_val
## <int> <dbl> <dbl>
## 1 10003 6.55 3.50
## 2 10011 7.30 4.43
## 3 10013 3.36 1.37
## 4 10014 5.95 3.47
## 5 10021 4.96 5.13
## 6 10022 4.79 2.93
Assumptions:
CAGR would be a decent indicator to forecast property and rent prices
2019 median property costs would be better reflected by using cagr 2 year growth rate and forecasting the present property costs using 2017 median property costs
\[2019 Median Property Cost = (2017 Median Property Cost)*(1 + (CAGR/100))^2\]
# Forecasting median property price in each zip code in 2019 using 2 year CAGR growth
# joining data frame with cagr values to zill.NY.y.medp data frame containing yearly median price
zill.NY.y.medp <- zill.NY.y.medp %>% filter(year==2017) %>% left_join(cagr.df, by = c("RegionName" = "zip"))
# ASSUMPTION
# calculating 2019 median property value using 2year cagr (immediate market sentiment reflected)
zill.NY.y.medp$medp.2019 <- zill.NY.y.medp$medp*((1+(zill.NY.y.medp$cagr2_val/100))^2)
# renaming medp to medp.2017 to maintain consistency in col names
names(zill.NY.y.medp)[names(zill.NY.y.medp) == 'medp'] <- 'medp.2017'
# tidying the final data frame zill.NY.y.medp
zill.NY.y.medp <- zill.NY.y.medp %>% dplyr::select(-c("year","CountyName"))
zill.NY.y.medp <- zill.NY.y.medp[,c(1,2,3,6,4,5)]
head(zill.NY.y.medp)
## # A tibble: 6 x 6
## RegionName SizeRank medp.2017 medp.2019 cagr5_val cagr2_val
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 10003 21 2005500 2148220. 6.55 3.50
## 2 10011 15 2354000 2567311. 7.30 4.43
## 3 10013 1744 3212450 3301149. 3.36 1.37
## 4 10014 379 2476250 2651225. 5.95 3.47
## 5 10021 190 1709950 1889716. 4.96 5.13
## 6 10022 894 1863650 1974295. 4.79 2.93
merge.data.fn <- function(df1,df2){
return_df <- df1 %>% inner_join(df2, c("zipcode"="RegionName"))
return(return_df)
}
# merged file
data.merged <- merge.data.fn(list.2bed.NY,zill.NY.y.medp)
Assumption: It has been given that all properties and all square feet within each locale can be assumed to be homogeneous. Hence, each zipcode in a neighbourhood group will have medp.2019 as the median property price. I have assumed that the per unit square feet property cost for a neighbourhood would be the medp.2019 / median square foot of properties in a neighbourhood.
\[ Square Foot Unit Cost = Median Property Price in Neighbourhood in 2019 / Median Square Feet in Neighbourhood\] \[ Property Price in 2019 = Square Foot Unit Cost * Square Foot of Property \]
New Variables:
medp.neigh.2019: median of medp.2019 values for a given neighborhood
med.sq.ft : median square foot of properties in a given neighborhood
sq.ft.unit.cost: unit cost of sq feet in a neighbourhood in 2019
prop.price.2019: price of the specific rental property. It will be different for all properties
# removing values for which predicted sq ft (sq.f.prd) value is NA
data.merged <- data.merged[-which(is.na(data.merged$sq.f.prd)),]
# finding median sq feet in each zip code, unit cost of sq feet in each neighborhood group cleansed, and property price for each row
data.merged <- data.merged %>% group_by(neighbourhood_group_cleansed) %>%
mutate(med.sq.ft = median(sq.f.prd)) %>% mutate(medp.neigh.2019 = median(medp.2019)) %>% arrange(med.sq.ft) %>%
mutate(sq.ft.unit.cost = medp.neigh.2019/med.sq.ft) %>% mutate(prop.price.2019=sq.ft.unit.cost*sq.f.prd)
Assumption: The increase in rent per night (price) for 2019 would be at the same rate as the increase in property values (2 year CAGR) for that zip code from 2017 to 2019. In case the CAGR is negative then there would be no increase in rent. It will remain the same.
\[ Daily Rent (2019) = Daily Rent in 2017 * (1 + (CAGR/100))^2 \]
New Variables:
data.merged <- data.merged %>% mutate(day.rent.2019 = ifelse(cagr2_val<0,price_cor, price_cor*((1+(cagr2_val/100))^2)))
Assumption:
Since the break even time is for longer periods, I have used the 5 year CAGR calculated from 2012 - 2017 change in property cost value in Zillow file as the rate at which rent will increase in the years following 2019
75% occupancy has been assumed in a year for all neighbourhoods and zip codes
Day rent would be charged for all the days
Cleaning Fee is not included as there would be low margin to earn for this parameter after adjusting for labor charges
Security deposit is not included in the revenue as it would generally be returned or used up for repairs
Logic for calculating break even years (t):
\[ Property Cost in 2019 = (365*0.75* Day rent in 2019)*(1 + (1 + CAGR/100) + (1 + CAGR/100)^2 ...(1+CAGR/100)^t) \]
Solving the above equation gives the following result:
\[ t = log10(1 + (Property Cost in 2019 * (CAGR/100)/(365*0.75*day rent))) / log10(1 + CAGR/100) \]
New Variables:
data.merged <- data.merged %>%
mutate(break.even.yrs =
log10(1+(prop.price.2019*(cagr5_val/100)/(365*0.75*day.rent.2019)))/log10(1 + cagr5_val/100))
Both of the following calculations are at zip code level.
\[ Median Break Even Years = median(break even years of all properties in a zip code) \] \[ Revenue Cost Ratio = Median Property Price in 2019 / Median Break Even Years \]
# calculating median break even years and rev cost ratio of each zip code
zip.med.break.even <- data.merged %>% dplyr::select(zipcode,medp.2019,day.rent.2019, break.even.yrs) %>% group_by(zipcode) %>%
summarise(cost=mean(medp.2019), med.day.rent.2019 = median(day.rent.2019), med.break.even.year = median(break.even.yrs)) %>%
mutate(rev.cost.ratio= round((365*0.75*med.day.rent.2019/cost)*100,2)) %>%
ungroup() %>% dplyr::select(zipcode,med.break.even.year,rev.cost.ratio)
# joining files
data.merged <- data.merged %>% left_join(zip.med.break.even, c("zipcode"))
# Distribution of Time Taken by Properties to Break Even
data.merged %>% ggplot(aes(x=ceiling(break.even.yrs))) +
geom_histogram(fill = "#FF6666") + xlab("Years Taken to Break Even") +
ylab("Frequency") + ggtitle("Distribution of Time Taken by Properties to Break Even") +
theme(plot.title = element_text(hjust = 0.5)) + theme_classic()
# Distribution of Time Taken by Properties to Break Even - by neighborhood
data.merged %>% ggplot(aes(x=ceiling(break.even.yrs), fill=neighbourhood_group_cleansed)) +
geom_histogram() +
labs(title = "Distribution of Time Taken by Properties to Break Even", x = "Years Taken to Break Even",
y = "Frequency", fill = "Neighbourhood")+
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic()
Analysis
The median break even time of all properties in all zipcodes is 15.6 and the mean break even time is 16.2
Let us take a look at median break even time separately for each zip code
data.merged %>% group_by(zipcode) %>% summarise(med.break.even.year = mean(round(med.break.even.year,1))) %>%
ggplot(aes(x=reorder(as.factor(zipcode),med.break.even.year), y=med.break.even.year)) +
geom_bar(stat="identity", fill = "#FF6666") +
labs(title = "Median Break Even Year Comparison", subtitle = "Each bar is the median of break even years of all properties in a zip code", x = "Zip Codes",
y = "Median Break Even Years") +
geom_text( aes(label = med.break.even.year, y = med.break.even.year),position = position_dodge(width = 1),hjust =1.3) +
coord_flip() +
theme_light()
11434, 10304, and 10305 are some of the zipcodes where the median break even time is the lowest. 10304 and 10305 belong to Staten Island. 11434 is in Queens. It is important to find out how many properties are there in these zip codes?
Let us take a look at some details about zipcodes with specific properties with break even time < 10 years
# Median Property Price vs Number of Properties
data.merged %>%
filter(break.even.yrs<=10) %>%
group_by(neighbourhood_group_cleansed, zipcode) %>%
summarise(count=n(), medp.2019 = mean(medp.2019)) %>%
ungroup() %>%
ggplot(aes(x=count, y=medp.2019, col=as.factor(neighbourhood_group_cleansed))) +
geom_point(size=3) +
geom_label_repel(aes(label=zipcode), hjust=-0.25, vjust=0, size=4, show.legend = FALSE) +
scale_y_continuous(labels = scales::comma, limits = c(0,1700000)) +
labs(title = "Median Property Price vs Number of Properties", subtitle = "Each dot is a Zip Code having properties with break even \n time < 10 years", x = "Number of Properties",
y = "Median Property Price (2019)", color = "Neighbourhood") +
xlim(c(0,8))
Analysis:
Zip codes 10305, 10304 in Staten Island show property values < $500,000. However, there is only 1 property in each Zip Code.
Queens has 1 Zip Code that has 4 properties with cost <$500,000. This looks promising.
Brooklyn has higher property costs but it has many zipcodes with a lot of properties with < 10 years break even time.
# creating categories of break even years for zip codes > 20 properties
zipcodes.nprop.20plus <- data.merged %>% group_by(zipcode) %>% summarise(count=n()) %>% filter(count>20) %>%
dplyr::select(zipcode)
stack.fill.plot <- data.merged %>%
group_by(zipcode) %>%
dplyr::select(zipcode, break.even.yrs) %>%
inner_join(zipcodes.nprop.20plus, c("zipcode")) %>%
mutate(BE.catg = if_else(break.even.yrs<=10,"Less Than 10 Years",
ifelse(break.even.yrs<=15,"10-15 Years","Greater Than 15 Years"))) %>% ungroup()
# reordering levels for better visualization
stack.fill.plot$BE.catg <- as.factor(stack.fill.plot$BE.catg)
stack.fill.plot$BE.catg <- factor( stack.fill.plot$BE.catg, levels = c("Greater Than 15 Years","10-15 Years","Less Than 10 Years"))
# Comparison of Break Even Time for Properties in Zip Codes
ggplot(stack.fill.plot,aes(x=as.factor(zipcode), fill = as.factor(BE.catg))) +
geom_bar(position = "fill") +
# geom_text(stat = "count", aes(label =..count..), position = position_fill(vjust = 1), hjust =1.3, size = 3.5) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
labs(title = "Comparison of Break Even Time for Properties in Zip Codes", subtitle = "Each bar displays the % of properties in a zip code belonging \n to a break even time category", x = "Zip Codes", y = "Proportion of Number of Properties", fill = "Break Even Time") + theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
# Revenue to Cost Ratio
data.merged %>% group_by(zipcode) %>% summarise(rev.cost.ratio = mean(rev.cost.ratio)) %>%
ggplot(aes(x=reorder(as.factor(zipcode),rev.cost.ratio), y=rev.cost.ratio)) +
geom_bar(stat="identity", fill = "#FF6666") +
labs(title = "Revenue to Cost Ratio", subtitle = "High ratio is indicative of a lower property cost and higher profit after break even", x = "Zip Codes",
y = "Revenue/Cost Ratio (%)") +
geom_text( aes(label = rev.cost.ratio, y = rev.cost.ratio),position = position_dodge(width = 1),hjust =1.3) +
coord_flip() +
theme_light()
Zip codes such as 10305, 10304 belonging to Staten Island are showing exceptionally high revenue cost ratio because the number of properties in them are only 1 for this comparison.
11434 belonging to Queens looks very promising.
Parameters Used for Selection:
Selection Criteria:
Any zipcode that satisfied any 2 of the above parameters was included in the final list. The final list of zipcodes is arranged in ascending order of median break even years.
A function, best.zip.codes has been created, which only needs to be fed the final file after all the calculations in step 16.
best.zip.codes <- function(x) {
# Median break even time of a zipcode (top 10)
top10.med.break.even.years <- x %>%
group_by(zipcode) %>%
summarise(med.break.even.year = mean(med.break.even.year)) %>%
arrange(med.break.even.year) %>%
top_n(-10) %>% arrange(med.break.even.year)
# Revenue cost ratio of a zipcode (top 10)
top10.rev.cost.ratio <- x %>%
group_by(zipcode) %>%
summarise(rev.cost.ratio = mean(rev.cost.ratio)) %>%
arrange(rev.cost.ratio) %>%
top_n(10) %>%
arrange(desc(rev.cost.ratio))
# Zipcodes with number of properties > 20
zipcodes.nprop.20plus <- x %>%
group_by(zipcode) %>%
summarise(count=n()) %>%
filter(count>20) %>%
dplyr::select(zipcode)
# calculations for break even time categories
break.even.year.cat <- x %>%
group_by(zipcode) %>%
dplyr::select(zipcode, break.even.yrs) %>%
inner_join(zipcodes.nprop.20plus, c("zipcode")) %>%
mutate(BE.catg = if_else(break.even.yrs<=10,"Less Than 10 Years",
ifelse(break.even.yrs<=15,"10-15 Years","Greater Than 15 Years"))) %>%
ungroup()
# Selecting zip codes with atleast 50% of the properties having break even time <15 years
top.break.even.prop.15 <- break.even.year.cat %>%
inner_join(zipcodes.nprop.20plus, c("zipcode")) %>%
group_by(zipcode, BE.catg) %>%
summarise(count.prop = n()) %>%
arrange(zipcode) %>%
spread(BE.catg,count.prop) %>%
mutate(total.properties = sum(`Greater Than 15 Years`, `10-15 Years` ,`Less Than 10 Years`, na.rm=TRUE)) %>%
mutate(proportion=sum(`10-15 Years` ,`Less Than 10 Years`, na.rm=TRUE)/total.properties) %>%
filter(proportion>0.5) %>%
arrange(desc(proportion)) %>%
dplyr::select(zipcode,proportion)
# selecting zip codes that are present in atleast 2 categories
df1 <- top10.med.break.even.years %>% inner_join(top10.rev.cost.ratio, c("zipcode")) %>%
dplyr::select("zipcode")
df2 <- top10.med.break.even.years %>%
inner_join(top.break.even.prop.15, c("zipcode")) %>%
dplyr::select("zipcode")
df3 <- top10.rev.cost.ratio %>%
inner_join(top.break.even.prop.15, c("zipcode")) %>%
dplyr::select("zipcode")
# merging all captured zip codes in 1 data frame: final.zip.codes
merge1 <- merge(df1,df2, all=TRUE)
merge.all <- merge(df3,merge1,all=TRUE)
final.zip.codes <- merge.all %>% left_join(x, c("zipcode")) %>%
dplyr::select(zipcode, neighbourhood_group_cleansed,med.break.even.year,rev.cost.ratio ) %>%
group_by(zipcode,neighbourhood_group_cleansed) %>%
summarise(med.break.even.year = round(mean(med.break.even.year),1), rev.cost.ratio = mean(rev.cost.ratio)) %>%
arrange(med.break.even.year)
return(final.zip.codes)
}
# calling the function to identify the final list of zip codes to be shown to the client
zipcodes.final <- best.zip.codes(data.merged)
zipcodes.final.table <- head(zipcodes.final, n = 10)
# creating a data table containing the final list of zipcodes arranged in increasing order of break even time.
datatable(zipcodes.final.table, caption = "Final Zip Codes")
Analysis:
The above zipcodes satisfy atleast 2 of the three criteria mentioned.
Queens and Staten Island have the top 3 zip zodes with lowest break even time period and maximum revenue to cost ratio. It is to be noted that (as highlighted in step 18) Queens zipcode, 11434 had the 2019 median property price < $500,000 and had 4 properties with individual break even time <10 years in this zip code. It would be prudent to look at this zip code with maximum priority for short term profits.
3 zipcodes of Staten Island feature in the list - 10304, 10305, 10308. The first two out of these zip codes look very promising because of low break even time.
Brooklyn has 4 zipcodes in the list having atleast 20 properties in each of these zip codes. As seen in the visualization in step 18, all of these zip codes have atleast some properties whose break even time is < 10 years and atleast 50% of the total number of properties have break even time <15 years.
Manhatten has expensive properties and will take a higher time to break even. It is not much suited for short term investment.
Queens’ 11434 zip code seems very promising for short term investment with median property cost ~ $500,000.
Staten Island also has 2 promising zip codes: 10304, 10305 with 1 property in each with ~ 10 years of break even time. The median property cost is < $500,000.
The number of properties in our data for Queens and Staten Island were very few. It would be prudent to look at Brooklyn’s best zip codes, which provides wider choice to purchase property becuase of hugher number of listed properties in each zip code.
Brookyln’s zipcodes: 11217, 11231, 11201, and 11215 may be considered if the client is comfortable in spending a higher amount compared with that for Queens and Staten Island (between $1,000,000 - 1,500,000. Brooklyn’s advantage is that it has many zipcodes that performed well in the metrics discussed in visualizations in step 18 and 19.
Manhattan’s only zipcode that looked good for investment was 10011. This neighbourhood can be ignored for investment if short term profit is the goal.
Yearly Median Price for property values has been taken using the median value of each month for each year in Zillow file
Various functions have been created to automate tasks and make the project scalable.
At the same time I have used built in functions such as apply, gsub, mutate, etc. to automate data cleaning.
I have created detailed visualizations in steps 8.1.3 and 17-21 to clearly depict my analysis to the reader
Utilize review scores in the criteria for zip code selection
Use Amenities column for better prediction of missing square foot area. Additionally, use it to see how price is impacted when there are more amenities. That impact should be considered to scale rent for the same number of amenities for all properties.
Using the initial columns in which host has written a lot of text to identify keywords and their relationship to rent.
Build a better model for square foot prediction. Additionally, I would want to impute missing values for cleaning_fee, price, and number of bathrooms as well (which were used in the linear regression model)
Given Assumptions:
2 bed room sets are most profitable (as identified by the client)
Occupancy rate of 75% has been assumed. Further, it has been assumed that these occupancies would be one day occupancies.
The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
New Assumptions:
Step 5: The real-estate company will purchase complete 2 bed room properties. Room type = “private room” in a 2 bed room property will require the price and square feet to be doubled
Step 10: The median of monthly median property prices for a RegionName (zipcode) will correctly reflect the median market cost of property in a zip code
Step 10: For 2017, the medp has been taken as the median price for the 6 months because the data is only available for 6 months in 2017
Step 12: CAGR would be a decent indicator to forecast property and rent prices
Step 12 : 2019 median property costs would be better reflected by using cagr 2 year growth rate and forecasting the present property costs using 2017 median property costs
Step 14: It has been given that all properties and all square feet within each locale can be assumed to be homogeneous. Hence, each zipcode in a neighbourhood group will have medp.2019 as the median property price. I have assumed that the per unit square feet property cost for a neighbourhood would be the medp.2019 / median square foot of properties in a neighbourhood.
Step 15: The increase in rent per night (price) for 2019 would be at the same rate as the increase in property values (2 year CAGR) for that zip code from 2017 to 2019. In case the CAGR is negative then there would be no increase in rent. It will remain the same.
Step 16: Since the break even time is for longer periods, I have used the 5 year CAGR calculated from 2012 - 2017 change in property cost value in Zillow file as the rate at which rent will increase in the years following 2019. Day rent would be charged for all the days. 75% occupancy has been assumed in a year for all neighbourhoods and zip codes.