AirBnB & Zillow Data Analysis

Objective

Identifying Zip Codes for 2-bedroom sets in New York City that would generate the most profit on short term rentals

Background:

You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
The objective is to identify the zip codes would generate the most profit on short term rentals within New York City.
Data: Publicly available data sets from Zillow and AirBnB

Packages Required

library(R.utils)      #to use lisitngs.csv.gz file
library(tibble)       #for better visibility of data frames   
library(data.table)   #to read .csv file
library(dplyr)        #to manipulate data
library(tidyr)        #to tidy data
library(ggplot2)      #to better visualize data
library(purrr)        #to use functions to clean data 
library(ggrepel)      #for better visualizations
library(DT)           #to preview the data sets
library(kableExtra)   #to create HTML Table

Loading Data Files

listings <- fread("listings.csv.gz")
zillow <- fread("Zip_Zhvi_2bedroom.csv")

Data Wrangling

Quality Check

Some of the columns need to be cleaned
- Pricing columns (in Listings) such as “price”, “weekly price”, etc. have “$” sign in front of them and are characters
- Some columns such as “neighbourhood_group_cleansed” in listings need to be converted to factor
- The city column in listings has different formats for “New York City”
- Some Zip Codes in listing file are weirdly formatted (1234-4566)
- The State column in listings has different formats for New York
- The data in zillow file was in wide format. That was tidied up in long format.
Meta data created and attached in the zipped file in the email.

Listings

1) Removing Columns not used for analysis

# creating a copy of the original data frame
listings1 <- as.tibble(listings)

# removing columns that are not used for analysis
cols.remove.list <- c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
                     24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,46,48,51,
                     59,66,68:76,81:84,86:93,95)
listings1 <- listings1[,-cols.remove.list]
glimpse(listings1)

## Observations: 40,753
## Variables: 29
## $ id                             <int> 7949480, 16042478, 1886820, 662...
## $ neighbourhood_group_cleansed   <chr> "Bronx", "Bronx", "Bronx", "Bro...
## $ city                           <chr> "Bronx", "Bronx", "Bronx", "Cit...
## $ state                          <chr> "NY", "NY", "NY", "NY", "NY", "...
## $ zipcode                        <chr> "10464", "10464", "10464", "104...
## $ market                         <chr> "New York", "New York", "New Yo...
## $ country_code                   <chr> "US", "US", "US", "US", "US", "...
## $ latitude                       <dbl> 40.85205, 40.85349, 40.84114, 4...
## $ longitude                      <dbl> -73.78868, -73.78861, -73.78305...
## $ property_type                  <chr> "House", "Apartment", "House", ...
## $ room_type                      <chr> "Private room", "Private room",...
## $ accommodates                   <int> 2, 4, 4, 3, 4, 2, 4, 3, 5, 8, 2...
## $ bathrooms                      <dbl> 1.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1...
## $ bedrooms                       <int> 1, 1, 3, 1, 1, 0, 1, 1, 1, 1, 1...
## $ beds                           <int> 1, 1, 3, 1, 1, 1, 2, 2, 1, 3, 1...
## $ bed_type                       <chr> "Real Bed", "Real Bed", "Real B...
## $ square_feet                    <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ price                          <chr> "$99.00", "$200.00", "$300.00",...
## $ weekly_price                   <chr> "", "", "", "$775.00", "$350.00...
## $ monthly_price                  <chr> "", "", "", "", "$1,200.00", ""...
## $ security_deposit               <chr> "$100.00", "", "$800.00", "", "...
## $ cleaning_fee                   <chr> "", "", "$100.00", "$75.00", "$...
## $ extra_people                   <chr> "$20.00", "$0.00", "$25.00", "$...
## $ number_of_reviews              <int> 25, 0, 0, 12, 86, 41, 74, 114, ...
## $ first_review                   <chr> "2016-01-18", "", "", "2015-07-...
## $ last_review                    <chr> "2017-04-23", "", "", "2016-10-...
## $ review_scores_rating           <int> 100, NA, NA, 93, 97, 97, 98, 90...
## $ review_scores_location         <int> 10, NA, NA, 10, 10, 10, 10, 9, ...
## $ calculated_host_listings_count <int> 1, 1, 1, 1, 1, 1, 1, 4, 3, 4, 4...

There are 40,753 observations and 29 variables

Reason for removal:

Some columns only have text. Text has not been analyzed for this project.
Columns that were not been used in the merged file for analysis were removed as well.

2) Data Cleaning

Some of the columns need to be cleaned
- Pricing columns such as “price”, “weekly price”, etc. have “$” sign in front of them and are characters
- Some columns such as “neighbourhood_group_cleansed” need to be converted to factor
- The city column has different formats for “New York City”
- Some Zip Codes in listing file are weirdly formatted (1234-4566)
- The State column has different formats for New York

# there are "$" signs in front of prices that need to be removed
# getting index of these columns
price.fee.index<- grep("^price$|^weekly_price$|^monthly_price$|^security_deposit$|^cleaning_fee$|^extra_people$", colnames(listings1))

# removing "$" sign
listings1[, price.fee.index] <- as.tibble(apply(listings1[, price.fee.index], 2,
 function(x) {as.numeric(gsub(pattern="\\$", replacement = "", x))}))

# converting price and fee cols to numeric 
listings1[,price.fee.index] <- listings1[,price.fee.index] %>% 
mutate_all(as.numeric) 

# cleaned data
listings1[,price.fee.index]

## # A tibble: 40,753 x 6
##    price weekly_price monthly_price security_deposit cleaning_fee
##    <dbl>        <dbl>         <dbl>            <dbl>        <dbl>
##  1    99           NA            NA              100           NA
##  2   200           NA            NA               NA           NA
##  3   300           NA            NA              800          100
##  4   125          775            NA               NA           75
##  5    69          350            NA               NA           17
##  6   125          550            NA              500           35
##  7    85           NA            NA               NA           30
##  8    39           NA            NA               NA           20
##  9    95           NA            NA              100           40
## 10   125           NA            NA              150           50
## # ... with 40,743 more rows, and 1 more variable: extra_people <dbl>

# similarly converting relevant columns to factor 
listings1[,c("neighbourhood_group_cleansed", "property_type","room_type","bathrooms")] <- 
listings1[,c("neighbourhood_group_cleansed", "property_type","room_type","bathrooms")] %>%
mutate_all(as.factor) 

# checking states data
unique(listings1$state)

## [1] "NY"       "MP"       "New York" "ny"       "NJ"       "VT"

# changing "New York"and "ny" entry to "NY" to maintain uniformity
listings1$state[which(listings1$state=="New York")] <- "NY"
listings1$state[which(listings1$state=="ny")] <- "NY"
unique(listings1$state)

## [1] "NY" "MP" "NJ" "VT"

# checking city data
unique(listings1$city[grep(pattern="^n|^N", listings1$city)])

##  [1] "New York"               "new york"              
##  [3] "NY"                     "New York City"         
##  [5] "NEW YORK"               "New York, New York, US"
##  [7] "New York, US"           "New york"              
##  [9] "NUEVA YORK"             "New York, Brooklyn"    
## [11] "new york\nnew york"     "Nova York"             
## [13] "Nyc"                    "New York city"         
## [15] "NYC"                    "ny"                    
## [17] "nyc"                    "New york , Ridgewood " 
## [19] "new York"               "Nueva York"

# all the cities starting with "n" or "N" are actually New York City. Cleaning those observations to New York City to maintain uniformity
listings1$city[grep(pattern="^n|^N", listings1$city)] <- "New York City"
unique(listings1$city[grep(pattern="^n|^N", listings1$city)])

## [1] "New York City"

#  zip codes: looking at unique zip codes in New York City
# unique(listings1$zipcode)
# some missing zip codes and group of zip codes such as "10003-8623". We can impute values of zip codes using latitudes and longitudes. Will do it in another section below.

# We get the 5 Boroughs as the neighbourhood_group_cleansed. Sanity check.
unique(listings1$neighbourhood_group_cleansed)

## [1] Bronx         Queens        Brooklyn      Staten Island Manhattan    
## Levels: Bronx Brooklyn Manhattan Queens Staten Island

3) Filtering Data: Selecting 2 bed room properties in New York City

# Selecting only 2 bedroom properties in New York City
list.2bed.NY <-  listings1 %>% filter(bedrooms==2 & state=="NY" & (city == "New York City" | market== "New York") & country_code =="US")
dim(list.2bed.NY)

## [1] 4881   29

We get a data frame with 4871 rows and 30 observations

4) Exploring 2 bed room properties

# checking room type and property type
unique(list.2bed.NY$property_type)

##  [1] Apartment          House              Townhouse         
##  [4] Boat               Loft               Other             
##  [7] Condominium        Bed & Breakfast    Guesthouse        
## [10] Villa              Bungalow           Timeshare         
## [13] Guest suite        Serviced apartment
## 27 Levels: Apartment Bed & Breakfast Boat Boutique hotel ... Villa

unique(list.2bed.NY$room_type)

## [1] Entire home/apt Private room   
## Levels: Entire home/apt Private room Shared room

# price and square_feet of private rooms in a 2 bed room property should be doubled. The following assumption takes care of this issue.

5) Price and Square Feet Correction

Assumption: The real-estate company will purchase complete 2 bed room properties. Room type = “private room” in a 2 bed room property will require the price and square feet to be doubled.

New Variables Created

sq.feet.cor: Room_type = “private room” in a 2 bed room property required the square feet to be doubled. sq.feet.cor has been used for analysis instead of the given “square_feet”.
price_cor: Room_type = “private room” in a 2 bed room property required the price to be doubled. Price_cor has been used for analysis instead of the given “price”.

# checking room type and property type
unique(list.2bed.NY$property_type)

##  [1] Apartment          House              Townhouse         
##  [4] Boat               Loft               Other             
##  [7] Condominium        Bed & Breakfast    Guesthouse        
## [10] Villa              Bungalow           Timeshare         
## [13] Guest suite        Serviced apartment
## 27 Levels: Apartment Bed & Breakfast Boat Boutique hotel ... Villa

unique(list.2bed.NY$room_type)

## [1] Entire home/apt Private room   
## Levels: Entire home/apt Private room Shared room

# corrected price
list.2bed.NY <- list.2bed.NY %>% mutate(price_cor = ifelse(room_type=="Private room", 2*price,price ))    

# the corresponding sq feet will also be doubled
list.2bed.NY <- list.2bed.NY %>% mutate(sq.feet.cor = ifelse(room_type=="Private room", 2*square_feet,square_feet ))

6) Missing Values

# checikng missing values
colSums(is.na(list.2bed.NY))

##                             id   neighbourhood_group_cleansed 
##                              0                              0 
##                           city                          state 
##                              0                              0 
##                        zipcode                         market 
##                              0                              0 
##                   country_code                       latitude 
##                              0                              0 
##                      longitude                  property_type 
##                              0                              0 
##                      room_type                   accommodates 
##                              0                              0 
##                      bathrooms                       bedrooms 
##                              8                              0 
##                           beds                       bed_type 
##                              3                              0 
##                    square_feet                          price 
##                           4768                             41 
##                   weekly_price                  monthly_price 
##                           4494                           4879 
##               security_deposit                   cleaning_fee 
##                           2373                            963 
##                   extra_people              number_of_reviews 
##                              0                              0 
##                   first_review                    last_review 
##                              0                              0 
##           review_scores_rating         review_scores_location 
##                           1017                           1026 
## calculated_host_listings_count                      price_cor 
##                              0                             41 
##                    sq.feet.cor 
##                           4768

Noteworthy Observations:

Square_feet has 4768 missing values. This is a very important parameter. Values would need to be imputed.
Price has 41 missing values.
Weekly_price, monthly_price, and security_depsoit have significant number of missing values.

7) Cleaning Zip Codes

# finding the weird zip codes (such as "11426-1175" "10003-8623") and assigning NA to them. We wiil use latitudes and longitudes to find the correct zip codes below

weird.zip.index <- which(nchar(list.2bed.NY$zipcode)==10)
list.2bed.NY$zipcode[weird.zip.index] <- NA

list.2bed.NY$zipcode <- as.numeric(list.2bed.NY$zipcode)

# 64 observations with no zip codes
length(list.2bed.NY$zipcode[list.2bed.NY$zipcode==""])

## [1] 64

list.2bed.NY$zipcode[list.2bed.NY$zipcode==""] <- NA
sum(is.na(list.2bed.NY$zipcode))

## [1] 64

str(list.2bed.NY$zipcode)

##  num [1:4881] 10462 10469 11102 11102 11105 ...

# Running a loop to pick up closest zip codes for these obsrvations by matching neighbourhood_group_cleansed and long and lat
for (k in 4:1){
  ind <- which(is.na(list.2bed.NY$zipcode))
for(i in seq_along(ind)){
  for (j in 1:nrow(list.2bed.NY)){
    if(round(list.2bed.NY$latitude[ind[i]],k)==round(list.2bed.NY$latitude[j],k) && 
       round(list.2bed.NY$longitude[ind[i]],k)==round(list.2bed.NY$longitude[j],k) &&
      list.2bed.NY$neighbourhood_group_cleansed[ind[i]] == list.2bed.NY$neighbourhood_group_cleansed[j])
      {
      list.2bed.NY$zipcode[ind[i]] =  list.2bed.NY$zipcode[j]
    }
    }}
}

# number of zip codes with missing values
sum(is.na(list.2bed.NY$zipcode))

## [1] 0

8) Imputing missing values - Sq Feet

I have created a linear regression model to impute the values for square feet. The following variables can help in prediction of sq feet:
- Price cor
- Cleaning fee
- Bathrooms
- Property type
- Room Type
- Neighbourhood group cleansed
Variables such as weekly, monthly and yearly price and security deposit could have been useful. But those have a significant number of missing values. So, I am not using them.

8.1) Exploratory Data Analysis

8.1.1) Looking at sq.feet.cor and price_cor

# removing outliers in sq feet
summary(list.2bed.NY$sq.feet.cor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   750.0   900.0   988.9  1100.0 10118.0    4768

boxplot(list.2bed.NY$sq.feet.cor)

# outliers in the boxplot
boxplot.stats(list.2bed.NY$sq.feet.cor)$out

##  [1]    32  1800    80  2000  1800     0     0 10118  1800  1700  2200
## [12]     0     0  1800     0     0    11     1

# removing extreme values which definitely seem wrong
# this step has to be manually done after reviewing the outliers and has to be hardcoded
sq.feet.outlier <- which(list.2bed.NY$sq.feet.cor %in% c(0,1,11,32, 10118))
list.2bed.NY <- list.2bed.NY[-sq.feet.outlier,]

sum(!is.na(list.2bed.NY$sq.feet.cor))

## [1] 103

#  Only 103 observations have sq footage data 

# outliers in the boxplot of price. Looks fine.
summary(list.2bed.NY$price_cor)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0   140.0   195.0   224.5   275.0  1140.0      41

boxplot(list.2bed.NY$price_cor)

There are only 103 observations with square feet area. Building a model would be tough with so less observations. But it is very important to differentiate among properties in a zip code. It has been given that property cost and square foot area are homogenous in a given neighbourhood. Hence, it is important to correclty predict square feet area.

8.1.2) Looking at other variables

# Looking at the distribution of available square feet data for neighbourhoods
list.2bed.NY %>% filter(!is.na(sq.feet.cor)) %>% dplyr::select(neighbourhood_group_cleansed) %>% table()

## .
##         Bronx      Brooklyn     Manhattan        Queens Staten Island 
##             0            50            43            10             0

# Bronx and Staten Island have no values for sq feet cor

# selecting observations with non missing sq.feet.cor value
model.data <- list.2bed.NY %>% filter(!is.na(sq.feet.cor)) %>% dplyr::select(neighbourhood_group_cleansed, 
property_type,  room_type,sq.feet.cor, bathrooms,beds,accommodates,price_cor, weekly_price,
monthly_price,  security_deposit ,cleaning_fee, extra_people)

# Looking at the number of missing observations for other variables
colSums(is.na(model.data))

## neighbourhood_group_cleansed                property_type 
##                            0                            0 
##                    room_type                  sq.feet.cor 
##                            0                            0 
##                    bathrooms                         beds 
##                            1                            0 
##                 accommodates                    price_cor 
##                            0                            0 
##                 weekly_price                monthly_price 
##                           81                          103 
##             security_deposit                 cleaning_fee 
##                           37                           15 
##                 extra_people 
##                            0

# Removing weekly_price, monthly_price, security_deposit as they have a large number of missing values
model.data <- model.data %>% dplyr::select(-c("weekly_price","monthly_price","security_deposit"))

# checking property type
table(model.data$property_type)

## 
##          Apartment    Bed & Breakfast               Boat 
##                 87                  0                  0 
##     Boutique hotel           Bungalow              Cabin 
##                  0                  0                  0 
##             Castle               Cave             Chalet 
##                  0                  0                  0 
##        Condominium               Dorm        Earth House 
##                  1                  0                  0 
##        Guest suite         Guesthouse             Hostel 
##                  0                  0                  0 
##              House                Hut             In-law 
##                 10                  0                  0 
##         Lighthouse               Loft              Other 
##                  0                  3                  0 
## Serviced apartment               Tent          Timeshare 
##                  0                  0                  0 
##          Townhouse      Vacation home              Villa 
##                  2                  0                  0

# Only 1 Condominium, 2 Townhouse, and 3 Lofts. These are too few values to correctly predict the missing values for these catgories
model.data <- model.data %>%  filter(!property_type %in%c("Condominium","Townhouse","Loft"))

# omitting 1 observation with no value for number of bathrooms
model.data <- model.data[-which(is.na(model.data$bathrooms)),]

# 14 values of cleaning fee missing
colSums(is.na(model.data))

## neighbourhood_group_cleansed                property_type 
##                            0                            0 
##                    room_type                  sq.feet.cor 
##                            0                            0 
##                    bathrooms                         beds 
##                            0                            0 
##                 accommodates                    price_cor 
##                            0                            0 
##                 cleaning_fee                 extra_people 
##                           14                            0

dim(model.data)

## [1] 96 10

Let us move onto modeling

8.1.3) Plots and Correlation

# Manhattan looks more pricey than Brooklyn and Queens
model.data %>%  
ggplot(aes(x=sq.feet.cor, y=price_cor, col=neighbourhood_group_cleansed)) + geom_point(size=3) +
labs(title = "Price vs Square Feet of Property", subtitle = "Manhattan looks pricier than Brooklyn and Queens", 
x = "Square Feet", 
y = "Price per Night ($)", color = "Neighbourhood") +
theme_classic()

# price_cor vs square feet by number of bathrooms
model.data %>%  
ggplot(aes(x=sq.feet.cor, y=price_cor, shape=bathrooms)) + geom_point(size=3) +
labs(title = "Price vs Square Feet of Property", subtitle = "Higher Square Feet corresponds to higher chances of >1 bathrooms", 
x = "Square Feet", 
y = "Price per Night ($)", shape = "bathrooms") + theme_classic()

# number of bathrooms and square feet 
model.data %>%  
ggplot(aes(x=sq.feet.cor, fill=bathrooms)) + geom_histogram(position="dodge", binwidth = 250) +
labs(title = "Distribution of Square Feet of Property by # of Bathrooms", subtitle = "Higher square feet show more occurences of bathrooms>1", 
x = "Square Feet", 
y = "Count") + theme_classic()

# cleaning fee vs square feet
model.data %>%  
ggplot(aes(x=sq.feet.cor, y=cleaning_fee)) + geom_point(size=3) +
labs(title = "Cleaning Fee vs Square Feet of Property", subtitle = "Cleaning fee shows a mild linear relationship with square feet", 
x = "Square Feet", 
y = "Cleaning Fee ($)") + theme_classic()

# room_type and square feet
model.data %>%  
ggplot(aes(x=sq.feet.cor, fill=room_type)) + geom_histogram(position="dodge", binwidth = 250) +
labs(title = "Distribution of Square Feet of Property by type of room", subtitle = "Private Rooms either have very large or small square_feet", 
x = "Square Feet", 
y = "Count", fill="Room Type") + theme_classic()

# checking correlation among continous variables 
sub.set <- na.omit(model.data)
cor(sub.set[,-c(1:3,5)])

##              sq.feet.cor        beds accommodates   price_cor cleaning_fee
## sq.feet.cor   1.00000000 -0.04866056  -0.09844025  0.45263004    0.2929749
## beds         -0.04866056  1.00000000   0.70227726  0.03367773    0.1467176
## accommodates -0.09844025  0.70227726   1.00000000 -0.02736240    0.1037853
## price_cor     0.45263004  0.03367773  -0.02736240  1.00000000    0.4842777
## cleaning_fee  0.29297489  0.14671756   0.10378530  0.48427771    1.0000000
## extra_people  0.15190545  0.11911416  -0.02202184  0.34505990    0.2252941
##              extra_people
## sq.feet.cor    0.15190545
## beds           0.11911416
## accommodates  -0.02202184
## price_cor      0.34505990
## cleaning_fee   0.22529407
## extra_people   1.00000000

price_cor, cleaning_fee, and bathrooms have a noteworthy and positive correlation with sq.feet.cor
Manhattan looks pricier than Brooklyn and Queens
Higher Square Feet corresponds to higher chances of >1 bathrooms
Distribution of Square Feet of Property by type of room = Private Rooms either have very large or small square_feet

8.2) Linear Regression Model

# Variable Selection
# I have use backward stepwise selection for variable selection 
nullmodel=lm(sq.feet.cor~1, data=model.data)
fullmodel=lm(sq.feet.cor~., data=model.data)
model_step_b <- step(fullmodel,direction='backward', trace=0)

mod.summary <- summary(model_step_b)

# adj.r.sq
mod.summary$adj.r.squared

## [1] 0.3921451

# MSE
(mod.summary$sigma)^2

## [1] 64618.85

# Final Model: sq.feet.cor ~ room_type + bathrooms + price_cor + cleaning_fee
# Adj R sq values and model Model MSE are not that great. But with so less number of observations, this is fine.

final.model <- lm(sq.feet.cor ~ room_type + bathrooms + price_cor + cleaning_fee, data=model.data)


# Residual Analysis
aug.l_model.manh<- final.model %>% broom::augment() %>% mutate(row_num=1:n())

# residuals
ggplot(data=aug.l_model.manh, aes(x=.fitted,y=.std.resid))+geom_point() + geom_smooth(se=FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2", size=1) +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted", size=1) +
  xlab("Fitted value") + ylab("Standardized residual") + ggtitle("Standardized Residuals vs Fitted Values")+
  theme(plot.title = element_text(hjust = 0.5))

# QQ Plot
ggplot(data=aug.l_model.manh, aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") + ggtitle("Q-Q Plot")+
  theme(plot.title = element_text(hjust = 0.5))

Final Model: sq.feet.cor ~ room_type + bathrooms + price_cor + cleaning_fee
The residual plot looks fine. The QQ plot for normality deviates from th 45 degree line at ends indicating fat tails. But not something to worry much. We can assume normality assumption is satisfied.
Adjusted R Sq value: 0.3921451
MSE Value: 6.461885510^{4}
Adj R sq values and model MSE are not that great. But with so less number of observations, this is fine.

8.3) Predicting missing square feet values

# Number of missing values for predictors: bathrooms and cleaning_Fee
sum(is.na(list.2bed.NY$bathrooms))

## [1] 8

sum(is.na(list.2bed.NY$cleaning_fee))

## [1] 962

# selecting obs with mising values for square feet and non missing values for predictors
# we are selecting property type = Apartment and House because we did not have other property type categories in our model

sq.foot.miss <- list.2bed.NY %>% filter(is.na(sq.feet.cor) & !is.na(price_cor) & 
!is.na(room_type) & !is.na(bathrooms)& !(bathrooms == "0.5") & !is.na(cleaning_fee) & property_type %in% c("Apartment","House")) %>% 
  dplyr::select(id, sq.feet.cor,price_cor, bathrooms, room_type, cleaning_fee)  

# checking dimension of this data frame
dim(sq.foot.miss)

## [1] 3552    6

sq.foot.miss$sq.f.prd <- predict(object = final.model, newdata = sq.foot.miss[,-1])

sum(!is.na(sq.foot.miss$sq.f.prd))

## [1] 3552

9) Joining predicted values to original table

join.var <- sq.foot.miss %>%  dplyr::select(id,sq.f.prd)

# joining join.var with originial data frame list.2bed.NY
list.2bed.NY <- list.2bed.NY %>% left_join(join.var, c("id")) 

# merging the given sq feet values in the predicted sq feet column

# index of observations with non missing sq.feet.cor values in original data
x <- which(!is.na(list.2bed.NY$sq.feet.cor))

# merging the original sq.feet.cor data into the predicted sq.f.prd values
list.2bed.NY$sq.f.prd[x] <- list.2bed.NY$sq.feet.cor[x]

Now we have 3655 values of square feet compared with 103 earlier.

Zillow

10) Tidying Zillow data

I have created a variable “medp”, which is the yearly median price for a RegionName.

\[ medp = Median Price of RegionName for that year \]

Assumptions:

The median of monthly median property prices for a RegionName (zipcode) will correctly reflect the median market cost of property in a zip code
For 2017, the medp has been taken as the median price for the 6 months because the data is only available for 6 months in 2017

zillow1 <- as.tibble(zillow)

# Filtering data for New York in NY 
zillow.NY <- zillow1 %>% filter(City == "New York" & State == "NY")
dim(zillow.NY)

## [1]  25 262

# checking how many NAs we have in the data
sum(is.na(zillow.NY))

## [1] 1696

# 1696 NAs

# checking NAs column wise
# colSums(is.na(zillow.NY)) # the output takes too much space
# all the NAs are in property pricing

# tidying year and months in 1 column for each. 1 column for price.
zill.NY.y.m <- zillow.NY %>% gather(year_month, prop.val, -c(1:7)) %>% 
                              separate(year_month, c("year", "month"), sep="-")

head(zill.NY.y.m)

## # A tibble: 6 x 10
##   RegionID RegionName City  State Metro CountyName SizeRank year  month
##      <int>      <int> <chr> <chr> <chr> <chr>         <int> <chr> <chr>
## 1    61639      10025 New ~ NY    New ~ New York          1 1996  04   
## 2    61637      10023 New ~ NY    New ~ New York          3 1996  04   
## 3    61703      10128 New ~ NY    New ~ New York         14 1996  04   
## 4    61625      10011 New ~ NY    New ~ New York         15 1996  04   
## 5    61617      10003 New ~ NY    New ~ New York         21 1996  04   
## 6    62012      11201 New ~ NY    New ~ Kings            32 1996  04   
## # ... with 1 more variable: prop.val <int>

# taking median price for each year 
zill.NY.y.medp <- zill.NY.y.m %>%  group_by(RegionName, year,CountyName ,SizeRank) %>% summarise(medp = median(prop.val)) %>% 
  ungroup()

#  removing observations where median price is not available
zill.NY.y.medp <- na.omit(zill.NY.y.medp) 
zill.NY.y.medp <- zill.NY.y.medp %>%  arrange(RegionName, year)
zill.NY.y.medp$year <- as.numeric(zill.NY.y.medp$year)
head(zill.NY.y.medp)

## # A tibble: 6 x 5
##   RegionName  year CountyName SizeRank    medp
##        <int> <dbl> <chr>         <int>   <dbl>
## 1      10003  2005 New York         21 1130000
## 2      10003  2006 New York         21 1378050
## 3      10003  2007 New York         21 1414900
## 4      10003  2008 New York         21 1532950
## 5      10003  2009 New York         21 1396050
## 6      10003  2010 New York         21 1278700

# checking missing data
miss.data <- zill.NY.y.medp %>% dplyr::select(RegionName, year, medp) %>% spread(year, medp)
colSums(is.na(miss.data))

## RegionName       1996       1997       1998       1999       2000 
##          0         17         17         17         16         16 
##       2001       2002       2003       2004       2005       2006 
##         16         17         17         13          4          1 
##       2007       2008       2009       2010       2011       2012 
##          1          0          0          0          0          0 
##       2013       2014       2015       2016       2017 
##          0          0          0          0          0

From 2008 onwards, we don’t have any missing data. This means that we can use the latest market data.

11) Developing a function to calculate CAGR (Compound Annual Growth Rate)

\[CAGR = ((medp2 / medp1)^n -1)*100\]

medp1: Medp for year t
medp2: Medp for year t+x
n (number of years): t + x - t = x

# unique zip codes
unique.zip <- unique(zill.NY.y.medp$RegionName)

# x: start year; y:end year
cagr.cal <- function(x,y){
  n_yr <- y - x + 1 
  cagr <- rep(NA, length(unique.zip))
for (k in seq_along(unique.zip)){
  price_st.yr <- zill.NY.y.medp %>% filter(RegionName==unique.zip[k] & year==x) %>% dplyr::select(medp)
  price_end.yr <- zill.NY.y.medp %>% filter(RegionName==unique.zip[k] & year==y) %>% dplyr::select(medp)
  
  price_st.yr <- price_st.yr[[1]]
  price_end.yr <- price_end.yr[[1]]
  
  cagr[k] <- ((price_end.yr/price_st.yr)^(1/n_yr) - 1)*100
}

return(cagr)
}

# taking 5 year and 2 year cagr
cagr5_val <- cagr.cal(2012,2017)
cagr2_val <- cagr.cal(2015,2017)

# creating a data frame with the cagr values and zip codes
cagr.df <- tibble(zip = unique.zip, cagr5_val,cagr2_val)

head(cagr.df)

## # A tibble: 6 x 3
##     zip cagr5_val cagr2_val
##   <int>     <dbl>     <dbl>
## 1 10003      6.55      3.50
## 2 10011      7.30      4.43
## 3 10013      3.36      1.37
## 4 10014      5.95      3.47
## 5 10021      4.96      5.13
## 6 10022      4.79      2.93

12) Calculating 2019 median property values

Assumptions:

CAGR would be a decent indicator to forecast property and rent prices
2019 median property costs would be better reflected by using cagr 2 year growth rate and forecasting the present property costs using 2017 median property costs

\[2019 Median Property Cost = (2017 Median Property Cost)*(1 + (CAGR/100))^2\]

# Forecasting median property price in each zip code in 2019 using 2 year CAGR growth
# joining data  frame with cagr values to zill.NY.y.medp data frame containing yearly median price
zill.NY.y.medp <- zill.NY.y.medp %>% filter(year==2017) %>% left_join(cagr.df, by = c("RegionName" = "zip"))


# ASSUMPTION
# calculating 2019 median property value using 2year cagr (immediate market sentiment reflected)
zill.NY.y.medp$medp.2019 <- zill.NY.y.medp$medp*((1+(zill.NY.y.medp$cagr2_val/100))^2)

# renaming medp to medp.2017 to maintain consistency in col names 
names(zill.NY.y.medp)[names(zill.NY.y.medp) == 'medp'] <- 'medp.2017'


# tidying the final data frame zill.NY.y.medp
zill.NY.y.medp <- zill.NY.y.medp %>% dplyr::select(-c("year","CountyName"))  
zill.NY.y.medp <- zill.NY.y.medp[,c(1,2,3,6,4,5)]

head(zill.NY.y.medp)

## # A tibble: 6 x 6
##   RegionName SizeRank medp.2017 medp.2019 cagr5_val cagr2_val
##        <int>    <int>     <dbl>     <dbl>     <dbl>     <dbl>
## 1      10003       21   2005500  2148220.      6.55      3.50
## 2      10011       15   2354000  2567311.      7.30      4.43
## 3      10013     1744   3212450  3301149.      3.36      1.37
## 4      10014      379   2476250  2651225.      5.95      3.47
## 5      10021      190   1709950  1889716.      4.96      5.13
## 6      10022      894   1863650  1974295.      4.79      2.93

Merging Data Frames

13) Merging Data Frames

I have created a function that can be used to merge the listings file (list.2bed.NY) and zillow file (zill.NY.y.medp)

merge.data.fn <- function(df1,df2){
return_df <-  df1 %>% inner_join(df2, c("zipcode"="RegionName"))
return(return_df)
}

# merged file
data.merged <- merge.data.fn(list.2bed.NY,zill.NY.y.medp)

Break Even Analysis

Calculations

14) Calculating property prices in 2019

Assumption: It has been given that all properties and all square feet within each locale can be assumed to be homogeneous. Hence, each zipcode in a neighbourhood group will have medp.2019 as the median property price. I have assumed that the per unit square feet property cost for a neighbourhood would be the medp.2019 / median square foot of properties in a neighbourhood.

\[ Square Foot Unit Cost = Median Property Price in Neighbourhood in 2019 / Median Square Feet in Neighbourhood\] \[ Property Price in 2019 = Square Foot Unit Cost * Square Foot of Property \]

New Variables:

medp.neigh.2019: median of medp.2019 values for a given neighborhood
med.sq.ft : median square foot of properties in a given neighborhood
sq.ft.unit.cost: unit cost of sq feet in a neighbourhood in 2019
prop.price.2019: price of the specific rental property. It will be different for all properties

# removing values for which predicted sq ft (sq.f.prd) value is NA
data.merged <- data.merged[-which(is.na(data.merged$sq.f.prd)),]

# finding median sq feet in each zip code, unit cost of sq feet in each neighborhood group cleansed, and property price for each row
data.merged <- data.merged %>% group_by(neighbourhood_group_cleansed) %>%
  mutate(med.sq.ft = median(sq.f.prd)) %>% mutate(medp.neigh.2019 = median(medp.2019)) %>% arrange(med.sq.ft) %>% 
  mutate(sq.ft.unit.cost = medp.neigh.2019/med.sq.ft) %>% mutate(prop.price.2019=sq.ft.unit.cost*sq.f.prd)

15) Forecasting rent for properties in 2019

Assumption: The increase in rent per night (price) for 2019 would be at the same rate as the increase in property values (2 year CAGR) for that zip code from 2017 to 2019. In case the CAGR is negative then there would be no increase in rent. It will remain the same.

\[ Daily Rent (2019) = Daily Rent in 2017 * (1 + (CAGR/100))^2 \]

New Variables:

day.rent.2019: price for staying in the property for 1 day (or night) charged by the host in 2019

data.merged <- data.merged %>% mutate(day.rent.2019 = ifelse(cagr2_val<0,price_cor, price_cor*((1+(cagr2_val/100))^2)))

16) Break Even Years

Assumption:

Since the break even time is for longer periods, I have used the 5 year CAGR calculated from 2012 - 2017 change in property cost value in Zillow file as the rate at which rent will increase in the years following 2019
75% occupancy has been assumed in a year for all neighbourhoods and zip codes
Day rent would be charged for all the days
Cleaning Fee is not included as there would be low margin to earn for this parameter after adjusting for labor charges
Security deposit is not included in the revenue as it would generally be returned or used up for repairs

Logic for calculating break even years (t):

\[ Property Cost in 2019 = (365*0.75* Day rent in 2019)*(1 + (1 + CAGR/100) + (1 + CAGR/100)^2 ...(1+CAGR/100)^t) \]

Solving the above equation gives the following result:

\[ t = log10(1 + (Property Cost in 2019 * (CAGR/100)/(365*0.75*day rent))) / log10(1 + CAGR/100) \]

New Variables:

break.even.yrs: Number of years in which the cost of purchasing the property would be recovered

data.merged <- data.merged %>%
  mutate(break.even.yrs = 
    log10(1+(prop.price.2019*(cagr5_val/100)/(365*0.75*day.rent.2019)))/log10(1 + cagr5_val/100))

16) Median Break Even Years and Revenue to Cost Ratio for zip codes

Both of the following calculations are at zip code level.

\[ Median Break Even Years = median(break even years of all properties in a zip code) \] \[ Revenue Cost Ratio = Median Property Price in 2019 / Median Break Even Years \]

# calculating median break even years and rev cost ratio of each zip code
zip.med.break.even <- data.merged %>%  dplyr::select(zipcode,medp.2019,day.rent.2019, break.even.yrs) %>% group_by(zipcode) %>% 
summarise(cost=mean(medp.2019), med.day.rent.2019 = median(day.rent.2019), med.break.even.year = median(break.even.yrs)) %>% 
mutate(rev.cost.ratio= round((365*0.75*med.day.rent.2019/cost)*100,2)) %>% 
ungroup() %>% dplyr::select(zipcode,med.break.even.year,rev.cost.ratio)

# joining files
data.merged <- data.merged %>%  left_join(zip.med.break.even, c("zipcode"))

Visualizations

17) Distribution of Time Taken by Properties to Break Even

# Distribution of Time Taken by Properties to Break Even
data.merged %>% ggplot(aes(x=ceiling(break.even.yrs))) +
geom_histogram(fill = "#FF6666") + xlab("Years Taken to Break Even") +
ylab("Frequency")  + ggtitle("Distribution of Time Taken by Properties to Break Even") +
theme(plot.title = element_text(hjust = 0.5)) + theme_classic()

# Distribution of Time Taken by Properties to Break Even - by neighborhood
data.merged %>% ggplot(aes(x=ceiling(break.even.yrs), fill=neighbourhood_group_cleansed)) +
geom_histogram() + 
  labs(title = "Distribution of Time Taken by Properties to Break Even",  x = "Years Taken to Break Even", 
  y = "Frequency", fill = "Neighbourhood")+
theme(plot.title = element_text(hjust = 0.5)) + 
  theme_classic()

Analysis

The median break even time of all properties in all zipcodes is 15.6 and the mean break even time is 16.2
Let us take a look at median break even time separately for each zip code

18) Median Break Even Time Comparison of Zip Codes

data.merged %>% group_by(zipcode) %>% summarise(med.break.even.year = mean(round(med.break.even.year,1))) %>% 
ggplot(aes(x=reorder(as.factor(zipcode),med.break.even.year), y=med.break.even.year)) + 
geom_bar(stat="identity", fill = "#FF6666") + 
labs(title = "Median Break Even Year Comparison", subtitle = "Each bar is the median of break even years of all properties in a zip code", x = "Zip Codes", 
y = "Median Break Even Years") + 
geom_text( aes(label = med.break.even.year, y = med.break.even.year),position = position_dodge(width = 1),hjust =1.3) +
  coord_flip() + 
  theme_light()

11434, 10304, and 10305 are some of the zipcodes where the median break even time is the lowest. 10304 and 10305 belong to Staten Island. 11434 is in Queens. It is important to find out how many properties are there in these zip codes?
Let us take a look at some details about zipcodes with specific properties with break even time < 10 years

19) Median Property Price vs Number of Properties

# Median Property Price vs Number of Properties
data.merged %>% 
filter(break.even.yrs<=10) %>% 
group_by(neighbourhood_group_cleansed, zipcode) %>% 
summarise(count=n(), medp.2019 = mean(medp.2019)) %>% 
ungroup() %>% 
ggplot(aes(x=count, y=medp.2019, col=as.factor(neighbourhood_group_cleansed))) + 
geom_point(size=3) +
geom_label_repel(aes(label=zipcode), hjust=-0.25, vjust=0, size=4, show.legend = FALSE) + 
scale_y_continuous(labels = scales::comma, limits = c(0,1700000)) +
labs(title = "Median Property Price vs Number of Properties", subtitle = "Each dot is a Zip Code having properties with break even \n time < 10 years", x = "Number of Properties", 
y = "Median Property Price (2019)", color = "Neighbourhood") + 
xlim(c(0,8))

Analysis:

Zip codes 10305, 10304 in Staten Island show property values < $500,000. However, there is only 1 property in each Zip Code.
Queens has 1 Zip Code that has 4 properties with cost <$500,000. This looks promising.
Brooklyn has higher property costs but it has many zipcodes with a lot of properties with < 10 years break even time.

20) Comparison of Break Even Time for Properties in Zip Codes

# creating categories of break even years for zip codes > 20 properties
zipcodes.nprop.20plus <- data.merged %>% group_by(zipcode) %>% summarise(count=n()) %>% filter(count>20) %>% 
dplyr::select(zipcode)
 
stack.fill.plot <- data.merged %>% 
group_by(zipcode) %>% 
dplyr::select(zipcode, break.even.yrs) %>% 
inner_join(zipcodes.nprop.20plus, c("zipcode")) %>% 
mutate(BE.catg = if_else(break.even.yrs<=10,"Less Than 10 Years",
ifelse(break.even.yrs<=15,"10-15 Years","Greater Than 15 Years"))) %>% ungroup()

# reordering levels for better visualization 
stack.fill.plot$BE.catg <- as.factor(stack.fill.plot$BE.catg)
stack.fill.plot$BE.catg <- factor( stack.fill.plot$BE.catg, levels = c("Greater Than 15 Years","10-15 Years","Less Than 10 Years"))

# Comparison of Break Even Time for Properties in Zip Codes
ggplot(stack.fill.plot,aes(x=as.factor(zipcode), fill = as.factor(BE.catg))) + 
geom_bar(position = "fill") + 
# geom_text(stat = "count", aes(label =..count..), position = position_fill(vjust = 1), hjust =1.3, size = 3.5) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) + 
labs(title = "Comparison of Break Even Time for Properties in Zip Codes", subtitle = "Each bar displays the % of properties in a zip code belonging \n to a break even time category", x = "Zip Codes", y = "Proportion of Number of Properties", fill = "Break Even Time") + theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()

Only zip codes with number of properties > 20 have been considered for this plot
We can see that zip codes such as 11231, 11217, 11215, and 11201 have very high proportion of number of properties with break even time <15 years

21) Revenue to Cost Ratio

# Revenue to Cost Ratio
data.merged %>% group_by(zipcode) %>% summarise(rev.cost.ratio = mean(rev.cost.ratio)) %>% 
ggplot(aes(x=reorder(as.factor(zipcode),rev.cost.ratio), y=rev.cost.ratio)) + 
geom_bar(stat="identity", fill = "#FF6666") + 
labs(title = "Revenue to Cost Ratio", subtitle = "High ratio is indicative of a lower property cost and higher profit after break even", x = "Zip Codes", 
y = "Revenue/Cost Ratio (%)") + 
geom_text( aes(label = rev.cost.ratio, y = rev.cost.ratio),position = position_dodge(width = 1),hjust =1.3) +
coord_flip() + 
theme_light()

Zip codes such as 10305, 10304 belonging to Staten Island are showing exceptionally high revenue cost ratio because the number of properties in them are only 1 for this comparison.
11434 belonging to Queens looks very promising.

Zip Codes Selected

Parameters Used for Selection:

Median break even time of a zipcode (top 10)
Revenue cost ratio of a zipcode (top 10)
Zipcodes with number of properties > 20 and atleast 50% of the properties in that zip code had break even time <15 years

Selection Criteria:

Any zipcode that satisfied any 2 of the above parameters was included in the final list. The final list of zipcodes is arranged in ascending order of median break even years.
A function, best.zip.codes has been created, which only needs to be fed the final file after all the calculations in step 16.

best.zip.codes <- function(x) {
  
# Median break even time of a zipcode (top 10)
top10.med.break.even.years <- x %>% 
group_by(zipcode) %>%
summarise(med.break.even.year = mean(med.break.even.year)) %>% 
arrange(med.break.even.year) %>% 
top_n(-10) %>% arrange(med.break.even.year)
  
# Revenue cost ratio of a zipcode  (top 10)
top10.rev.cost.ratio <- x %>% 
group_by(zipcode) %>%
summarise(rev.cost.ratio = mean(rev.cost.ratio)) %>% 
arrange(rev.cost.ratio) %>% 
top_n(10) %>% 
arrange(desc(rev.cost.ratio))

# Zipcodes with number of properties > 20 
zipcodes.nprop.20plus <- x %>% 
group_by(zipcode) %>% 
summarise(count=n()) %>%
filter(count>20) %>% 
dplyr::select(zipcode)
  
# calculations for break even time categories
break.even.year.cat <- x %>% 
group_by(zipcode) %>% 
dplyr::select(zipcode, break.even.yrs) %>% 
inner_join(zipcodes.nprop.20plus, c("zipcode")) %>% 
mutate(BE.catg = if_else(break.even.yrs<=10,"Less Than 10 Years",
ifelse(break.even.yrs<=15,"10-15 Years","Greater Than 15 Years"))) %>% 
ungroup()

# Selecting zip codes with atleast 50% of the properties having break even time <15 years
top.break.even.prop.15 <- break.even.year.cat %>% 
inner_join(zipcodes.nprop.20plus, c("zipcode")) %>% 
group_by(zipcode, BE.catg) %>% 
summarise(count.prop = n()) %>%  
arrange(zipcode) %>% 
spread(BE.catg,count.prop) %>% 
mutate(total.properties = sum(`Greater Than 15 Years`, `10-15 Years` ,`Less Than 10 Years`, na.rm=TRUE)) %>% 
mutate(proportion=sum(`10-15 Years` ,`Less Than 10 Years`, na.rm=TRUE)/total.properties) %>%
filter(proportion>0.5) %>% 
arrange(desc(proportion)) %>% 
dplyr::select(zipcode,proportion)

  
# selecting zip codes that are present in atleast 2 categories
df1 <- top10.med.break.even.years %>% inner_join(top10.rev.cost.ratio, c("zipcode")) %>% 
dplyr::select("zipcode")
df2 <- top10.med.break.even.years %>%  
inner_join(top.break.even.prop.15, c("zipcode")) %>% 
dplyr::select("zipcode")
df3 <- top10.rev.cost.ratio %>%  
inner_join(top.break.even.prop.15, c("zipcode")) %>% 
dplyr::select("zipcode")
  
# merging all captured zip codes in 1 data frame: final.zip.codes
merge1 <-  merge(df1,df2, all=TRUE)
merge.all <- merge(df3,merge1,all=TRUE)
final.zip.codes <- merge.all %>% left_join(x, c("zipcode")) %>%
dplyr::select(zipcode, neighbourhood_group_cleansed,med.break.even.year,rev.cost.ratio ) %>% 
group_by(zipcode,neighbourhood_group_cleansed) %>% 
summarise(med.break.even.year = round(mean(med.break.even.year),1), rev.cost.ratio = mean(rev.cost.ratio)) %>% 
arrange(med.break.even.year)
  
return(final.zip.codes)
}

# calling the function to identify the final list of zip codes to be shown to the client
zipcodes.final <- best.zip.codes(data.merged)
zipcodes.final.table <- head(zipcodes.final, n = 10)

# creating a data table containing the final list of zipcodes arranged in increasing order of break even time.
datatable(zipcodes.final.table, caption = "Final Zip Codes")

Analysis:

The above zipcodes satisfy atleast 2 of the three criteria mentioned.
Queens and Staten Island have the top 3 zip zodes with lowest break even time period and maximum revenue to cost ratio. It is to be noted that (as highlighted in step 18) Queens zipcode, 11434 had the 2019 median property price < $500,000 and had 4 properties with individual break even time <10 years in this zip code. It would be prudent to look at this zip code with maximum priority for short term profits.
3 zipcodes of Staten Island feature in the list - 10304, 10305, 10308. The first two out of these zip codes look very promising because of low break even time.
Brooklyn has 4 zipcodes in the list having atleast 20 properties in each of these zip codes. As seen in the visualization in step 18, all of these zip codes have atleast some properties whose break even time is < 10 years and atleast 50% of the total number of properties have break even time <15 years.
Manhatten has expensive properties and will take a higher time to break even. It is not much suited for short term investment.

Summary

Observations

Queens’ 11434 zip code seems very promising for short term investment with median property cost ~ $500,000.
Staten Island also has 2 promising zip codes: 10304, 10305 with 1 property in each with ~ 10 years of break even time. The median property cost is < $500,000.
The number of properties in our data for Queens and Staten Island were very few. It would be prudent to look at Brooklyn’s best zip codes, which provides wider choice to purchase property becuase of hugher number of listed properties in each zip code.
Brookyln’s zipcodes: 11217, 11231, 11201, and 11215 may be considered if the client is comfortable in spending a higher amount compared with that for Queens and Staten Island (between $1,000,000 - 1,500,000. Brooklyn’s advantage is that it has many zipcodes that performed well in the metrics discussed in visualizations in step 18 and 19.
Manhattan’s only zipcode that looked good for investment was 10011. This neighbourhood can be ignored for investment if short term profit is the goal.

Data Munging and Visualizations

Yearly Median Price for property values has been taken using the median value of each month for each year in Zillow file
Various functions have been created to automate tasks and make the project scalable.
At the same time I have used built in functions such as apply, gsub, mutate, etc. to automate data cleaning.
I have created functions to:
- calculate CAGR for any 2 years
- Merge the final zillow and listings file
- output the final zip codes to be selected based on criteria decided
I have created detailed visualizations in steps 8.1.3 and 17-21 to clearly depict my analysis to the reader

What’s Next?

Utilize review scores in the criteria for zip code selection
Use Amenities column for better prediction of missing square foot area. Additionally, use it to see how price is impacted when there are more amenities. That impact should be considered to scale rent for the same number of amenities for all properties.
Using the initial columns in which host has written a lot of text to identify keywords and their relationship to rent.
Build a better model for square foot prediction. Additionally, I would want to impute missing values for cleaning_fee, price, and number of bathrooms as well (which were used in the linear regression model)

Assumptions

Given Assumptions:

2 bed room sets are most profitable (as identified by the client)
Occupancy rate of 75% has been assumed. Further, it has been assumed that these occupancies would be one day occupancies.
The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)

New Assumptions:

Step 5: The real-estate company will purchase complete 2 bed room properties. Room type = “private room” in a 2 bed room property will require the price and square feet to be doubled
Step 10: The median of monthly median property prices for a RegionName (zipcode) will correctly reflect the median market cost of property in a zip code
Step 10: For 2017, the medp has been taken as the median price for the 6 months because the data is only available for 6 months in 2017
Step 12: CAGR would be a decent indicator to forecast property and rent prices
Step 12 : 2019 median property costs would be better reflected by using cagr 2 year growth rate and forecasting the present property costs using 2017 median property costs
Step 14: It has been given that all properties and all square feet within each locale can be assumed to be homogeneous. Hence, each zipcode in a neighbourhood group will have medp.2019 as the median property price. I have assumed that the per unit square feet property cost for a neighbourhood would be the medp.2019 / median square foot of properties in a neighbourhood.
Step 15: The increase in rent per night (price) for 2019 would be at the same rate as the increase in property values (2 year CAGR) for that zip code from 2017 to 2019. In case the CAGR is negative then there would be no increase in rent. It will remain the same.
Step 16: Since the break even time is for longer periods, I have used the 5 year CAGR calculated from 2012 - 2017 change in property cost value in Zillow file as the rate at which rent will increase in the years following 2019. Day rent would be charged for all the days. 75% occupancy has been assumed in a year for all neighbourhoods and zip codes.

AirBnB & Zillow Data Analysis

Aabhaas Sethi