Case Overview

Business Problem

A real estate company plans to purchase properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, the company doesn’t know which zip codes are the best to invest in.

My firm has been engaged by the real estate company to build a data product and provide conclusions to help them understand which zip codes would generate the most profit on short term rentals within New York City.

The publicly available data used for the analysis come from Zillow and Airbnb.

  • Zillow dataset (cost data): Provides selling cost for 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017.

  • Airbnb dataset (revenue data): Information about property listings in New York including location, number of bedrooms, reviews, price, availability, property description, etc. AirBnB is the medium through which the real estate company plans to lease out their investment property.

Assumptions

  • The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).

  • The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).

  • All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)

  • Occupancy rate of 75% throughout the year for Airbnb properties.

  • The company will put properties on rent every day throughout the year.

  • We assume that 40 percent of bookings are single day bookings that last less than a week, 40 percent are weekly bookings and 20 percent of bookings are made for a month. This assumption is cursory and done for the sake of this exercise. [Guessed from basic internet research]

Note

Selling cost information of 2 bedroom properties is available only until June 2017. I perform time series forecasting to obtain the value of properties in December 2018 by considering data from the past 3 years. Also, the reason I didn’t forecast the selling cost at a time point further in 2019 is because I wanted to keep it as close to June 2017 as possible.

Packages Used

The following packages were used:

  • plyr : For data analysis

  • data.table : Importing the data

  • colorspace : Required in system to install dependent package tidyverse

  • tidyverse : Data manipulation and plotting graphs

  • forecast : For time series analysis

  • plotly : Used to plot interactive charts

  • astsa : For time series analysis

  • Amelia : For imputing missing values

  • mice : For imputing missing values

  • dplyr : For data aggregation

  • kable : For table generation

  • highcharter : Creating charts

  • magrittr : Coding functionalities

  • broom : Data cleaning

  • purrr : Used by a function

if(!require("plyr")){
    
    install.packages("tidyverse",dependencies = T)
    
    library(plyr)
    
  }
  
detach(package:plyr) #remove it because it yields issues with tidyverse; however, still need to make sure it's installed for rbind.fill

#Install packages if it is not present
package_list<- c('data.table', 'colorspace', 'tidyverse', 'plotly', 'forecast', 'astsa', 'Amelia', 'mice', 'dplyr','highcharter','magrittr','broom','purrr','DT','shiny')

for (req_package in package_list) {
  if (!require(req_package, character.only = T, quietly = T)) {
    install.packages(req_package, repos = "http://cran.us.r-project.org")
    library(req_package, character.only = T)
  }
}

#Load the packages
library(colorspace)
library(tidyverse)
library(data.table)
library(plotly)
library(forecast)
library(astsa)
library(Amelia)
library(mice)
library(dplyr)
library(magrittr)
library(broom)
library(purrr)
library(highcharter)
library(DT)
library(shiny)

Data Preparation

Data Import

The 2 datasets upon which analysis has been performed to help solve the business case:

  • Airbnb data : Contains information about property listings in New York. This data would help us understand the amount of money to be gained by offering short-term rentals in New York.

  • Zillow data : Cost of 2 bedroom properties in different regions over the years (April 1996 - June 2017).

Import data

# read csv files
airbnb_listings <- read.csv("listings.csv") 
zillow <- read.csv("Zip_Zhvi_2bedroom.csv")
#Dimensions of airbnb_listings 
dim(airbnb_listings)
## [1] 40753    95
#Dimensions of zillow data
dim(zillow)
## [1] 8946  262

The listings data contains 40753 rows and 95 columns. While the Zillow dataset contains 8946 rows and 262 columns.

We take a look at the columns of both datasets to familiarize ourselves with them.

airbnb_listings

colnames(airbnb_listings)
##  [1] "id"                               "listing_url"                     
##  [3] "scrape_id"                        "last_scraped"                    
##  [5] "name"                             "summary"                         
##  [7] "space"                            "description"                     
##  [9] "experiences_offered"              "neighborhood_overview"           
## [11] "notes"                            "transit"                         
## [13] "access"                           "interaction"                     
## [15] "house_rules"                      "thumbnail_url"                   
## [17] "medium_url"                       "picture_url"                     
## [19] "xl_picture_url"                   "host_id"                         
## [21] "host_url"                         "host_name"                       
## [23] "host_since"                       "host_location"                   
## [25] "host_about"                       "host_response_time"              
## [27] "host_response_rate"               "host_acceptance_rate"            
## [29] "host_is_superhost"                "host_thumbnail_url"              
## [31] "host_picture_url"                 "host_neighbourhood"              
## [33] "host_listings_count"              "host_total_listings_count"       
## [35] "host_verifications"               "host_has_profile_pic"            
## [37] "host_identity_verified"           "street"                          
## [39] "neighbourhood"                    "neighbourhood_cleansed"          
## [41] "neighbourhood_group_cleansed"     "city"                            
## [43] "state"                            "zipcode"                         
## [45] "market"                           "smart_location"                  
## [47] "country_code"                     "country"                         
## [49] "latitude"                         "longitude"                       
## [51] "is_location_exact"                "property_type"                   
## [53] "room_type"                        "accommodates"                    
## [55] "bathrooms"                        "bedrooms"                        
## [57] "beds"                             "bed_type"                        
## [59] "amenities"                        "square_feet"                     
## [61] "price"                            "weekly_price"                    
## [63] "monthly_price"                    "security_deposit"                
## [65] "cleaning_fee"                     "guests_included"                 
## [67] "extra_people"                     "minimum_nights"                  
## [69] "maximum_nights"                   "calendar_updated"                
## [71] "has_availability"                 "availability_30"                 
## [73] "availability_60"                  "availability_90"                 
## [75] "availability_365"                 "calendar_last_scraped"           
## [77] "number_of_reviews"                "first_review"                    
## [79] "last_review"                      "review_scores_rating"            
## [81] "review_scores_accuracy"           "review_scores_cleanliness"       
## [83] "review_scores_checkin"            "review_scores_communication"     
## [85] "review_scores_location"           "review_scores_value"             
## [87] "requires_license"                 "license"                         
## [89] "jurisdiction_names"               "instant_bookable"                
## [91] "cancellation_policy"              "require_guest_profile_picture"   
## [93] "require_guest_phone_verification" "calculated_host_listings_count"  
## [95] "reviews_per_month"

zillow

colnames(zillow)
##   [1] "RegionID"   "RegionName" "City"       "State"      "Metro"     
##   [6] "CountyName" "SizeRank"   "X1996.04"   "X1996.05"   "X1996.06"  
##  [11] "X1996.07"   "X1996.08"   "X1996.09"   "X1996.10"   "X1996.11"  
##  [16] "X1996.12"   "X1997.01"   "X1997.02"   "X1997.03"   "X1997.04"  
##  [21] "X1997.05"   "X1997.06"   "X1997.07"   "X1997.08"   "X1997.09"  
##  [26] "X1997.10"   "X1997.11"   "X1997.12"   "X1998.01"   "X1998.02"  
##  [31] "X1998.03"   "X1998.04"   "X1998.05"   "X1998.06"   "X1998.07"  
##  [36] "X1998.08"   "X1998.09"   "X1998.10"   "X1998.11"   "X1998.12"  
##  [41] "X1999.01"   "X1999.02"   "X1999.03"   "X1999.04"   "X1999.05"  
##  [46] "X1999.06"   "X1999.07"   "X1999.08"   "X1999.09"   "X1999.10"  
##  [51] "X1999.11"   "X1999.12"   "X2000.01"   "X2000.02"   "X2000.03"  
##  [56] "X2000.04"   "X2000.05"   "X2000.06"   "X2000.07"   "X2000.08"  
##  [61] "X2000.09"   "X2000.10"   "X2000.11"   "X2000.12"   "X2001.01"  
##  [66] "X2001.02"   "X2001.03"   "X2001.04"   "X2001.05"   "X2001.06"  
##  [71] "X2001.07"   "X2001.08"   "X2001.09"   "X2001.10"   "X2001.11"  
##  [76] "X2001.12"   "X2002.01"   "X2002.02"   "X2002.03"   "X2002.04"  
##  [81] "X2002.05"   "X2002.06"   "X2002.07"   "X2002.08"   "X2002.09"  
##  [86] "X2002.10"   "X2002.11"   "X2002.12"   "X2003.01"   "X2003.02"  
##  [91] "X2003.03"   "X2003.04"   "X2003.05"   "X2003.06"   "X2003.07"  
##  [96] "X2003.08"   "X2003.09"   "X2003.10"   "X2003.11"   "X2003.12"  
## [101] "X2004.01"   "X2004.02"   "X2004.03"   "X2004.04"   "X2004.05"  
## [106] "X2004.06"   "X2004.07"   "X2004.08"   "X2004.09"   "X2004.10"  
## [111] "X2004.11"   "X2004.12"   "X2005.01"   "X2005.02"   "X2005.03"  
## [116] "X2005.04"   "X2005.05"   "X2005.06"   "X2005.07"   "X2005.08"  
## [121] "X2005.09"   "X2005.10"   "X2005.11"   "X2005.12"   "X2006.01"  
## [126] "X2006.02"   "X2006.03"   "X2006.04"   "X2006.05"   "X2006.06"  
## [131] "X2006.07"   "X2006.08"   "X2006.09"   "X2006.10"   "X2006.11"  
## [136] "X2006.12"   "X2007.01"   "X2007.02"   "X2007.03"   "X2007.04"  
## [141] "X2007.05"   "X2007.06"   "X2007.07"   "X2007.08"   "X2007.09"  
## [146] "X2007.10"   "X2007.11"   "X2007.12"   "X2008.01"   "X2008.02"  
## [151] "X2008.03"   "X2008.04"   "X2008.05"   "X2008.06"   "X2008.07"  
## [156] "X2008.08"   "X2008.09"   "X2008.10"   "X2008.11"   "X2008.12"  
## [161] "X2009.01"   "X2009.02"   "X2009.03"   "X2009.04"   "X2009.05"  
## [166] "X2009.06"   "X2009.07"   "X2009.08"   "X2009.09"   "X2009.10"  
## [171] "X2009.11"   "X2009.12"   "X2010.01"   "X2010.02"   "X2010.03"  
## [176] "X2010.04"   "X2010.05"   "X2010.06"   "X2010.07"   "X2010.08"  
## [181] "X2010.09"   "X2010.10"   "X2010.11"   "X2010.12"   "X2011.01"  
## [186] "X2011.02"   "X2011.03"   "X2011.04"   "X2011.05"   "X2011.06"  
## [191] "X2011.07"   "X2011.08"   "X2011.09"   "X2011.10"   "X2011.11"  
## [196] "X2011.12"   "X2012.01"   "X2012.02"   "X2012.03"   "X2012.04"  
## [201] "X2012.05"   "X2012.06"   "X2012.07"   "X2012.08"   "X2012.09"  
## [206] "X2012.10"   "X2012.11"   "X2012.12"   "X2013.01"   "X2013.02"  
## [211] "X2013.03"   "X2013.04"   "X2013.05"   "X2013.06"   "X2013.07"  
## [216] "X2013.08"   "X2013.09"   "X2013.10"   "X2013.11"   "X2013.12"  
## [221] "X2014.01"   "X2014.02"   "X2014.03"   "X2014.04"   "X2014.05"  
## [226] "X2014.06"   "X2014.07"   "X2014.08"   "X2014.09"   "X2014.10"  
## [231] "X2014.11"   "X2014.12"   "X2015.01"   "X2015.02"   "X2015.03"  
## [236] "X2015.04"   "X2015.05"   "X2015.06"   "X2015.07"   "X2015.08"  
## [241] "X2015.09"   "X2015.10"   "X2015.11"   "X2015.12"   "X2016.01"  
## [246] "X2016.02"   "X2016.03"   "X2016.04"   "X2016.05"   "X2016.06"  
## [251] "X2016.07"   "X2016.08"   "X2016.09"   "X2016.10"   "X2016.11"  
## [256] "X2016.12"   "X2017.01"   "X2017.02"   "X2017.03"   "X2017.04"  
## [261] "X2017.05"   "X2017.06"

Initial Data Processing

Filter Zillow data and forecast current price.

To help with scalability and reusability, we use the following function for preprocessing.

The following steps are performed by the zillow_preprocess function before joining the Airbnb and Zillow data:

  • Zillow data and the name of the city (in this case New York) are given as the arguments of the zillow_preprocess function.
  1. Select relevant columns such as Region Name (zipcode), the city, the SizeRank and the cost of the property only from the last 5 years.

  2. Filer data to only include rows containing cost information belonging to the city name passed as the function argument (New York in this case).

  3. Forecast and attach the avg. price of 2 bedroom property in Jan 2019 in each zipcode as CurrentPrice.

Assuming that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on diferences between previous values (Moving Average MA), we apply ARIMA model to predict the cost of the properties in Zipcodes from July 2017 to Jan 2019.

The zillow_preprocess Function

zillow_preprocess <- function(tempdf,cityName){ # Zillow data and New york city given as function arguement

  # Select zillow cost information from last 5 years and modify zillow data (tempdf) to only hold relevant columns
  n <- 60
  tempdf <- tempdf[,c(2,3,7,(ncol(tempdf)-n):ncol(tempdf))]
 
  tempdf <- filter(tempdf,City==cityName) # Filter for the required city
  
  colnames(tempdf)[colnames(tempdf)=="RegionName"] <- "zipcode" # Set proper column name to be used for merging later
  
  tempdf$currentPrice <- NULL # Create a new column to store the latest price in January 2019
  # we define a for loop to iterate over each zipcode to obtain latest cost of property
  for(i in 1:nrow(tempdf)){

    tmp = ts(as.vector(t(tempdf[,c(4:64)])[,i]),start = c(2012,6),frequency = 12) # Convert the monthly cost data into time series data 
  
    ARIMAfit = arima(tmp, order=c(1,1,1), seasonal=list(order=c(1,0,1),period=NA), 
                     method="ML")# Define ARIMA model to be used for prediction
    
    pred = predict(ARIMAfit, n.ahead = 19)# use the ARIMA model to predict the price from July 2017 to Jan 2019
    
    predval <- pred$pred # Store the predicted values in a variable 
    
    tempdf$currentPrice[i] <- predval[length(predval)] # set the value of current price for the specific zipcode as price in Jan 2019
  }
  return(tempdf[,c(1,2,3,65)]) # return the filtered data containing only relevant columns

}

Call function above by passing zillow data and city name as ‘New York’

Calling the zillow_preprocess Function

city="New York"
zillow_filtered <- zillow_preprocess(zillow,city) # call zillow_preprocess function
str(zillow_filtered) # view the structure of clean and filtered Zillow data
## 'data.frame':    25 obs. of  4 variables:
##  $ zipcode     : int  10025 10023 10128 10011 10003 11201 11234 10314 11215 10028 ...
##  $ City        : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
##  $ SizeRank    : int  1 3 14 15 21 32 52 68 71 109 ...
##  $ currentPrice: num  1459633 2147667 1981749 2624939 2171586 ...

The final clean data contains 4 columns i.e. Zipcode, City, SizeRank and the current price of property in the particular zipcode. There are 25 rows where each row describes a unqiue zipcode. Lets look at top five rows from the clean zillow data

head(zillow_filtered,5) # retrieve the first 5 rows of zillow data
##   zipcode     City SizeRank currentPrice
## 1   10025 New York        1      1459633
## 2   10023 New York        3      2147667
## 3   10128 New York       14      1981749
## 4   10011 New York       15      2624939
## 5   10003 New York       21      2171586

Filter airbnb_listings data

In airbnb_listings data, we are only concerned about 2 bedroom listings. Also, airbnb_listings data has 95 columns and 40753 rows, most of which we will not using during this analysis exercise. We only choose these columns: “id”,“zipcode”,“bedrooms”,“square_feet”,“price”,“weekly_price”,“monthly_price”,“cleaning_fee”,“number_of_reviews”,“review_scores_rating”

listings_filter function

listings_filter <- function(tempdf,numberOfBedroom){
  # Select relevant columns
  relevantcol <- c("id","zipcode","bedrooms","square_feet","price","weekly_price","monthly_price","cleaning_fee","neighbourhood_group_cleansed","number_of_reviews","review_scores_rating")
  tempdf <- tempdf[,relevantcol]
  # filter data containing 2 bedrooms
  tempdf <- filter(tempdf,bedrooms==numberOfBedroom)
  return(tempdf)
}

We call the above function by passing the available listings data and also the number of bedrooms as 2

Calling the above function

noBedroom=2 # set required number of bedrooms
listings_filtered <- listings_filter(airbnb_listings,noBedroom) # call the function
str(listings_filtered) # observe the structure of filtered listings data
## 'data.frame':    4894 obs. of  11 variables:
##  $ id                          : int  9513511 5046189 4357134 16027061 11301089 14855080 16231738 2836845 1841252 1581579 ...
##  $ zipcode                     : Factor w/ 205 levels "","05340","07310",..: 84 91 104 104 108 108 108 108 108 108 ...
##  $ bedrooms                    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ square_feet                 : int  NA NA NA NA NA NA NA NA NA 1000 ...
##  $ price                       : Factor w/ 583 levels "$1,000.00","$1,021.00",..: 65 87 151 205 516 178 33 98 33 207 ...
##  $ weekly_price                : Factor w/ 786 levels "","$1,000.00",..: 1 1 1 1 1 1 1 29 1 215 ...
##  $ monthly_price               : Factor w/ 839 levels "","$1,000.00",..: 1 1 1 1 1 1 1 492 1 724 ...
##  $ cleaning_fee                : Factor w/ 172 levels "","$0.00","$10.00",..: 1 148 1 1 108 168 144 148 64 159 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 1 1 4 4 4 4 4 4 4 4 ...
##  $ number_of_reviews           : int  4 31 0 0 1 15 2 107 3 135 ...
##  $ review_scores_rating        : int  85 95 NA NA 60 87 80 96 90 97 ...
head(listings_filtered)  # view first few row of data 
##         id zipcode bedrooms square_feet   price weekly_price monthly_price
## 1  9513511   10462        2          NA $130.00                           
## 2  5046189   10469        2          NA $150.00                           
## 3  4357134   11102        2          NA $200.00                           
## 4 16027061   11102        2          NA $250.00                           
## 5 11301089   11105        2          NA  $79.00                           
## 6 14855080   11105        2          NA $225.00                           
##   cleaning_fee neighbourhood_group_cleansed number_of_reviews
## 1                                     Bronx                 4
## 2       $75.00                        Bronx                31
## 3                                    Queens                 0
## 4                                    Queens                 0
## 5      $400.00                       Queens                 1
## 6       $95.00                       Queens                15
##   review_scores_rating
## 1                   85
## 2                   95
## 3                   NA
## 4                   NA
## 5                   60
## 6                   87

The listings_filtered (airbnb) data has 11 columns and 4894 rows where each row describes a unique property

Join the filtered Airbnb and Zillow Data

The two filtered datastes - listings_filtered (airbnb) and zillow_filtered (zillow) are joined on the key zipcode. In below code we combine the data sets and have a look at the structure of the combined data.

finalData <- merge(listings_filtered,zillow_filtered,by = c("zipcode")) # merge data sets on zipcode
str(finalData) # observe structure of combined data
## 'data.frame':    1238 obs. of  14 variables:
##  $ zipcode                     : Factor w/ 205 levels "","05340","07310",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ id                          : int  13561752 4942107 711635 4510857 3799598 568743 8335547 15094880 7664343 8884228 ...
##  $ bedrooms                    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ square_feet                 : int  NA NA 800 NA NA NA NA NA NA NA ...
##  $ price                       : Factor w/ 583 levels "$1,000.00","$1,021.00",..: 378 579 194 53 194 96 205 182 120 263 ...
##  $ weekly_price                : Factor w/ 786 levels "","$1,000.00",..: 1 1 110 716 1 147 1 1 2 164 ...
##  $ monthly_price               : Factor w/ 839 levels "","$1,000.00",..: 1 1 646 452 1 675 1 1 1 659 ...
##  $ cleaning_fee                : Factor w/ 172 levels "","$0.00","$10.00",..: 37 1 121 133 148 148 159 121 4 4 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ number_of_reviews           : int  14 37 63 2 144 137 1 26 2 72 ...
##  $ review_scores_rating        : int  94 100 95 80 88 83 100 95 90 95 ...
##  $ City                        : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
##  $ SizeRank                    : int  21 21 21 21 21 21 21 21 21 21 ...
##  $ currentPrice                : num  2171586 2171586 2171586 2171586 2171586 ...
summary(finalData)
##     zipcode          id              bedrooms  square_feet    
##  11215  :141   Min.   :   20853   Min.   :2   Min.   :   0.0  
##  10003  :133   1st Qu.: 4218751   1st Qu.:2   1st Qu.: 800.0  
##  10025  :112   Median : 9410246   Median :2   Median : 975.0  
##  10036  :108   Mean   : 9218383   Mean   :2   Mean   : 921.6  
##  10011  :102   3rd Qu.:14476782   3rd Qu.:2   3rd Qu.:1100.0  
##  10014  : 95   Max.   :18508770   Max.   :2   Max.   :1600.0  
##  (Other):547                                  NA's   :1210    
##      price        weekly_price   monthly_price   cleaning_fee
##  $250.00: 93            :981            :1058          :236  
##  $200.00: 72   $1,200.00: 14   $4,000.00:  12   $100.00:208  
##  $300.00: 59   $1,500.00: 13   $6,000.00:   8   $50.00 : 87  
##  $150.00: 56   $1,100.00: 10   $3,000.00:   6   $80.00 : 81  
##  $350.00: 42   $1,000.00:  9   $5,000.00:   6   $150.00: 80  
##  $400.00: 32   $1,400.00:  8   $2,500.00:   5   $75.00 : 73  
##  (Other):884   (Other)  :203   (Other)  : 143   (Other):473  
##  neighbourhood_group_cleansed number_of_reviews review_scores_rating
##  Bronx        :  0            Min.   :  0.00    Min.   : 20.00      
##  Brooklyn     :374            1st Qu.:  1.00    1st Qu.: 90.00      
##  Manhattan    :840            Median :  5.00    Median : 95.00      
##  Queens       :  9            Mean   : 17.43    Mean   : 93.12      
##  Staten Island: 15            3rd Qu.: 20.00    3rd Qu.:100.00      
##                               Max.   :306.00    Max.   :100.00      
##                                                 NA's   :268         
##        City         SizeRank       currentPrice    
##  New York:1238   Min.   :   1.0   Min.   : 369831  
##  Aberdeen:   0   1st Qu.:  15.0   1st Qu.:1329214  
##  Abilene :   0   Median :  71.0   Median :1885687  
##  Abingdon:   0   Mean   : 485.9   Mean   :1835538  
##  Abington:   0   3rd Qu.: 580.0   3rd Qu.:2247494  
##  Acton   :   0   Max.   :4149.0   Max.   :3332599  
##  (Other) :   0

The combined data consists of 14 columns describing information for 1238 unique listings.

The data contains lot of issues such as NAs, incorrect data symbols such as $ and wrong data types. ### Data cleaning

The merged data obtained in the previous step has a lot of data quality issues such as missing values, incorrect data types and inconsistent data values. We perform a few steps to cleanse the data.

Changing the data types and names of variables

We set standard names for the columns in the final merged data (finalData). From the summary of the finalData (previous step), we see that city column had many other levels. So, we correct the number of levels in the data to include only New York city.

# Set the new column names 
colnames(finalData) <-  c("zipcode","id","bedrooms","square_feet","per_night_price","weekly_price","monthly_price","cleaning_fee","neighbourhood","number_of_reviews","review_scores_rating","city","size_rank","current_price")

# Correct the factor levels in city column
finalData$city <- factor(finalData$city, levels=c("New York"))

Cleaning the rent price data

Variables per_night_price(daily rent), weekly_price, monthly_price and cleaning_fee contain symbols such as $ attached which would prevent these columns from being used for numerical analysis. So, we clean them in following snippet.

Dealing with the dollar sign

cols <- c("per_night_price", "weekly_price", "monthly_price","cleaning_fee") # selecting columns to be cleaned

# Specify a function to replace characters with whitespace

replace_dollar <- function(x){
  price <- as.numeric(gsub("[$,]","",x)) # this function removes $ from data
  return(price)
}

# Apply function to replace characters with whitespace

finalData[cols] <- lapply(finalData[cols], replace_dollar) # running the above defined function on cols

# Again check the structure of the data

str(finalData)
## 'data.frame':    1238 obs. of  14 variables:
##  $ zipcode             : Factor w/ 205 levels "","05340","07310",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ id                  : int  13561752 4942107 711635 4510857 3799598 568743 8335547 15094880 7664343 8884228 ...
##  $ bedrooms            : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ square_feet         : int  NA NA 800 NA NA NA NA NA NA NA ...
##  $ per_night_price     : num  450 989 240 119 240 159 250 229 180 305 ...
##  $ weekly_price        : num  NA NA 1365 850 NA ...
##  $ monthly_price       : num  NA NA 5460 3200 NA ...
##  $ cleaning_fee        : num  150 NA 50 60 75 75 85 50 100 100 ...
##  $ neighbourhood       : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ number_of_reviews   : int  14 37 63 2 144 137 1 26 2 72 ...
##  $ review_scores_rating: int  94 100 95 80 88 83 100 95 90 95 ...
##  $ city                : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
##  $ size_rank           : int  21 21 21 21 21 21 21 21 21 21 ...
##  $ current_price       : num  2171586 2171586 2171586 2171586 2171586 ...

The above result shows that all the variables have proper data types and consistent values except for missing data which we will impute soon.

Create a function to normalize the data points

The algorithm which we later use to obtain KPIs to measure and compare zip codes will require some columns to be normalized on 0-1 scale.

normalize Function

normalize <- function(x){
  return((x-min(x))/(max(x)-min(x))) # function to scale variables between 0 and 1
}

# Scale the number of reviews column

finalData["number_of_reviews"] <- lapply(finalData["number_of_reviews"], normalize) 

# Check the summary of column to ensure scaling has been applied

summary(finalData$number_of_reviews)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.003268 0.016340 0.056952 0.065360 1.000000

We see that scaling works fine and that all the values lie between 0 and 1.

Identifying and imputing NA values

Finding the percentage of missing values in columns

# Check number of NA in all the columns

missingValues <- as.data.frame(colSums(sapply(finalData,is.na)))  

# Convert rownames to columns

missingValues <- as.data.frame(setDT(missingValues, keep.rownames = TRUE))

# Rename the column names

colnames(missingValues) <- c("columnName","totalNA_values")

# Transform totalNA to percent, add it as column and arrange in descending order on the basis of it

missingValues <- missingValues %>% 
  mutate_at(vars(totalNA_values),funs(percentNA_values=.*100/nrow(finalData))) %>% 
  arrange(desc(percentNA_values)) 

# Check the top columns having maximum NA values 

head(missingValues,n=nrow(missingValues))
##              columnName totalNA_values percentNA_values
## 1           square_feet           1210         97.73829
## 2         monthly_price           1058         85.46042
## 3          weekly_price            981         79.24071
## 4  review_scores_rating            268         21.64782
## 5          cleaning_fee            236         19.06300
## 6               zipcode              0          0.00000
## 7                    id              0          0.00000
## 8              bedrooms              0          0.00000
## 9       per_night_price              0          0.00000
## 10        neighbourhood              0          0.00000
## 11    number_of_reviews              0          0.00000
## 12                 city              0          0.00000
## 13            size_rank              0          0.00000
## 14        current_price              0          0.00000

Columns square_feet, monthly_price, weekly_price, review_scores_rating and cleaning_fee contain 97.7, 85.46, 79.24, 21.64 and 19.06 percent NA values respectively.

Although, variables which contain very large percentage of missing data should not be imputed but rather removed, here I still impute them. The most important variable (per_night_price) which is vital in computing revenue doesn’t have any missing values. So, our analysis and the resulting conclusions wouldn’t be impaired by this imputation.

The NAs are removed below by imputing them through the use of mice package and CART algorithm as follows

Imputation

## Missing Data Imputation

# Subset the data to remove the constants

dataSet <- subset(finalData,select = -c(id,city)) # id and city are constant and will not help in imputation

# Impute the Missing Data using Mice Package

cat("Started...Imputing Missing Data using Calssification and Regression Tree 'CART'")
## Started...Imputing Missing Data using Calssification and Regression Tree 'CART'
impdataSet <- mice(dataSet, m=5, method='cart', printFlag=FALSE)

cat("....Finished...Imputing Missing Data")
## ....Finished...Imputing Missing Data
#
# Use the imputed values in the original dataframe

complete_dataSet <- complete(impdataSet)

finaldf_subset <- subset(finalData,select = c(id,city))

finaldf_complete <- cbind(complete_dataSet,finaldf_subset) # combining the imputed dataset to add id and city


# Check if there are any more NA values

sum(sapply(finaldf_complete, function(x) { sum(is.na(x)) }))
## [1] 0
# Check the summary of the dataframe again

summary(finaldf_complete)
##     zipcode       bedrooms  square_feet   per_night_price   weekly_price 
##  11215  :141   Min.   :2   Min.   :   0   Min.   :  28.0   Min.   : 310  
##  10003  :133   1st Qu.:2   1st Qu.: 900   1st Qu.: 165.0   1st Qu.:1100  
##  10025  :112   Median :2   Median :1000   Median : 240.0   Median :1442  
##  10036  :108   Mean   :2   Mean   :1003   Mean   : 278.7   Mean   :1735  
##  10011  :102   3rd Qu.:2   3rd Qu.:1150   3rd Qu.: 325.0   3rd Qu.:2200  
##  10014  : 95   Max.   :2   Max.   :1600   Max.   :4700.0   Max.   :5950  
##  (Other):547                                                             
##  monthly_price    cleaning_fee          neighbourhood number_of_reviews 
##  Min.   : 1250   Min.   :  0.00   Bronx        :  0   Min.   :0.000000  
##  1st Qu.: 3500   1st Qu.: 60.00   Brooklyn     :374   1st Qu.:0.003268  
##  Median : 5000   Median : 90.00   Manhattan    :840   Median :0.016340  
##  Mean   : 5792   Mean   : 94.82   Queens       :  9   Mean   :0.056952  
##  3rd Qu.: 7425   3rd Qu.:118.75   Staten Island: 15   3rd Qu.:0.065360  
##  Max.   :17100   Max.   :350.00                       Max.   :1.000000  
##                                                                         
##  review_scores_rating   size_rank      current_price    
##  Min.   : 20.00       Min.   :   1.0   Min.   : 369831  
##  1st Qu.: 90.00       1st Qu.:  15.0   1st Qu.:1329214  
##  Median : 96.00       Median :  71.0   Median :1885687  
##  Mean   : 93.31       Mean   : 485.9   Mean   :1835538  
##  3rd Qu.:100.00       3rd Qu.: 580.0   3rd Qu.:2247494  
##  Max.   :100.00       Max.   :4149.0   Max.   :3332599  
##                                                         
##        id                 city     
##  Min.   :   20853   New York:1238  
##  1st Qu.: 4218751                  
##  Median : 9410246                  
##  Mean   : 9218383                  
##  3rd Qu.:14476782                  
##  Max.   :18508770                  
## 

There are no more missing data values in the final data.

Aggregating Data for Analysis

To find out which zipcodes are profitable we need to aggregate the current data at zipcode level; where each zipcode is associated with its average daily/weekly/monthly airbnb rent price, number of 2 bedroom airbnb listings in the zipcode, average number of reviews for those listings, and the estimated cost of property of those 2 bedroom listings (current_price).

We create a new variable called unique_id which holds nothing but the number of unique IDs (unique 2 bedroom listings) from airbnb listings.

unique_id: The number of properties listed in Airbnb in each zip code

This is done using the below code.

# Find average of the required columns

avg_df <- finaldf_complete %>% 
  group_by(zipcode) %>% 
  summarise_at(vars(square_feet:current_price),mean) # mean of current price and other price attributes

# Find unique id (no. of properties) for each zipcode

unique_id_df <- finaldf_complete %>% select(zipcode,id) %>% 
    group_by(zipcode) %>%
    mutate(unique_id = n_distinct(id)) %>% 
    select(zipcode,unique_id) %>% distinct() # count number of properties in each zipcode as unique_id

neighbourhood_df <- finaldf_complete %>% select(zipcode,neighbourhood) %>% 
    group_by(zipcode) %>%
    select(zipcode,neighbourhood) %>% distinct()

# Combine both the dataframes to get final summary of the data
finaldf_complete<-inner_join(unique_id_df,finaldf_complete,by="zipcode")
summary_df <- inner_join(avg_df, unique_id_df, by = "zipcode") # combine data agregations using zipcode
summary_df=inner_join(summary_df,neighbourhood_df,by="zipcode")
make.names(colnames(summary_df))
##  [1] "zipcode"              "square_feet"          "per_night_price"     
##  [4] "weekly_price"         "monthly_price"        "cleaning_fee"        
##  [7] "neighbourhood.x"      "number_of_reviews"    "review_scores_rating"
## [10] "size_rank"            "current_price"        "unique_id"           
## [13] "neighbourhood.y"
summary_df$neighbourhood.x<-NULL
colnames(summary_df)[colnames(summary_df)=="neighbourhood.y"] <- "neighbourhood"
summary(summary_df)
##     zipcode    square_feet     per_night_price  weekly_price 
##  10003  : 1   Min.   : 400.0   Min.   : 65.0   Min.   : 350  
##  10011  : 1   1st Qu.: 782.8   1st Qu.:128.7   1st Qu.:1009  
##  10013  : 1   Median :1000.0   Median :220.2   Median :1499  
##  10014  : 1   Mean   : 917.7   Mean   :228.5   Mean   :1439  
##  10021  : 1   3rd Qu.:1103.8   3rd Qu.:307.1   3rd Qu.:1871  
##  10022  : 1   Max.   :1153.4   Max.   :367.3   Max.   :2430  
##  (Other):16                                                  
##  monthly_price   cleaning_fee    number_of_reviews review_scores_rating
##  Min.   :1250   Min.   : 22.50   Min.   :0.00000   Min.   : 80.00      
##  1st Qu.:3297   1st Qu.: 72.39   1st Qu.:0.04422   1st Qu.: 91.73      
##  Median :5001   Median : 88.38   Median :0.05049   Median : 93.60      
##  Mean   :4878   Mean   : 81.27   Mean   :0.05772   Mean   : 92.81      
##  3rd Qu.:6477   3rd Qu.:101.46   3rd Qu.:0.07016   3rd Qu.: 94.74      
##  Max.   :7544   Max.   :114.25   Max.   :0.23366   Max.   :100.00      
##                                                                        
##    size_rank      current_price       unique_id            neighbourhood
##  Min.   :   1.0   Min.   : 369831   Min.   :  1.00   Bronx        : 0   
##  1st Qu.:  37.0   1st Qu.: 462082   1st Qu.:  9.00   Brooklyn     : 5   
##  Median : 479.5   Median :1471799   Median : 59.50   Manhattan    :11   
##  Mean   : 805.7   Mean   :1436329   Mean   : 56.27   Queens       : 1   
##  3rd Qu.:1389.8   3rd Qu.:2106187   3rd Qu.: 92.75   Staten Island: 5   
##  Max.   :4149.0   Max.   :3332599   Max.   :141.00                      
## 

Some information gleaned from the above summary:

There are 22 Zipcodes in New York city having 2 bedroom properties listed on Airbnb

At least 1 zipcode has only 1 rental property listed on Airbnb and atleast one zipcode has 141 rental properties listed on Airbnb. Most of the zipcodes have more than 50 properties listed on Airbnb

The minimum average ratings associated with a zipcode is 80 while maximum is 96.69 on scale (0-100)

The cheapest zipcode’s average daily rent is $65.0 and the costliest zipcode’s average daily rent is $367.3

If the company goes for zipcode with cheapest property then each property there will cost on average $363483 and zipcode with costliest property will have properties listed on average at $3290515

Move to Analysis tab

Visualizations and Conclusions

Defining New Variables

We need to analyze the cost of purchasing 2 bedroom properties in different zipcodes and the revenues that can be expected from renting those properties through Airbnb.

We compute (or assume) the following variables to help us with the analysis:

  • Occupancy_Rate = 0.75 (probability that the property is occupied aka booked on any given day)

  • p_daily = 0.75 (probability that booking is done for a day)

  • p_weekly = 0.18 (probability that booking is done for a week)

  • p_month = 0.07 (probability that booking is done for a month)

  • Est_Annual_Revenue(per_property) = Occupancy_Rate x 365 x ((p_daily x Rent_per_Day )+(p_weekly x Rent_per_Week/7)+(p_month x Rent_per_Month/30))

  • Revenue_by_Cost_Ratio = Est_Annual_Revenue/Est_Purchase_Cost_of_the_Property

  • Breakeven_Point = (Est_Purchase_Cost_of_the_Property/(Estimated_Annual_Revenue-(Occupancy Rate x 52 x Cleaning_Fee)))

  • Year_N_est_net. = ((Est_Annual_Revenue x N)-(N x Occupancy_Rate x 52 x Cleaning_Fee))-Est_Purchase_Cost_of_the_Property

  • Percent_units: This column defines the percentage of number of 2 bedroom properties which are present in the given zipcode. A higher value means more number of properties available in the respective zipcode

  • Review_effect: This column denotes the scaled value of the Review_score_ratings column on a scale of 0 to 1

p_daily=.75 # probability that booking is done for a day

p_weekly=.18 # probability that booking is done for a week

p_month=.7 # probability that booking is done for a month

occupancy_rate <- .75

Year_days <- 365 # number of days in a year

# derive the percentage of properties listed in a given zipcode
summary_df$Percent_units <- summary_df$unique_id*100/sum(summary_df$unique_id)


summary_df$Review_effect <- normalize(summary_df$review_scores_rating) # scale the review_scores_rating

summary_df$Review_effect <-ifelse(summary_df$Review_effect>0,summary_df$Review_effect,mean(summary_df$Review_effect)) # impute those properties which had no ratings 

# Generate the revenue for a year per property
summary_df$Revenue_year_per_property<-occupancy_rate*Year_days*((p_daily*summary_df$per_night_price)+(p_weekly*summary_df$weekly_price/7)+(p_month*summary_df$monthly_price/30))

# Obtain the Revenue/Cost ratio
summary_df$Revenue_by_Cost_Ratio <-((summary_df$Revenue_year_per_property)/(summary_df$current_price))

# Breakeven period
summary_df$breakeven_point <-(summary_df$current_price/(summary_df$Revenue_year_per_property-(occupancy_rate*52*summary_df$cleaning_fee)))
summary_df$year_5 <- ((summary_df$Revenue_year_per_property*5)-(5*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_10 <- ((summary_df$Revenue_year_per_property*10)-(10*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_15 <- ((summary_df$Revenue_year_per_property*15)-(15*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_20 <- ((summary_df$Revenue_year_per_property*20)-(20*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price
summary_df$year_25 <- ((summary_df$Revenue_year_per_property*25)-(25*occupancy_rate*52*summary_df$cleaning_fee))-summary_df$current_price

summary_df<-summary_df %>% mutate_at(vars(per_night_price,square_feet,number_of_reviews,Review_effect,Revenue_year_per_property, weekly_price,monthly_price,cleaning_fee,current_price,review_scores_rating,Percent_units, breakeven_point, year_5,year_10,year_15,year_20,year_25), funs(round(., 2)))

summary_df<-summary_df %>% mutate_at(vars(Revenue_by_Cost_Ratio), funs(round(., 3)))

The aggregated data from the previous steps is as follows:

datatable(summary_df[,c("zipcode","per_night_price","weekly_price","monthly_price","cleaning_fee","review_scores_rating","size_rank","current_price","unique_id","neighbourhood","Percent_units","Revenue_year_per_property","Revenue_by_Cost_Ratio","breakeven_point","year_5","year_10","year_15","year_20","year_25")],rownames=FALSE,options=list(pageLength=22,dom='t',searching=FALSE,paging=FALSE,scroller=TRUE,scrollX=TRUE,scrollY="500px",fixedHeader=TRUE, autoWidth=TRUE),colnames=c("Zipcode","Daily Rent","Weekly Rent","Monthly Rent","Cleaning Fee","Scaled Avg. Review Score","Size Rank","Property Cost","No. of Property","Neighbourhood","Percentage of Units","Rev/Property","Revenue/Cost Ratio","Breakeven Point","Year 5 est. net","year 10 est. net","year 15 est. net","Year 20 est. net","Year 25 est. net"))

Please observe that a few zipcodes among the 22 have very few 2 bedroom airbnb listings. Also, noticing from the previous section, the (unique_id) variable’s minimum value is 1, meaning that at least 1 zipcode has just 1 listed 2 bedroom property in airbnb; and the first quartile equals 9, meaning at least 25% of zipcodes have 9 or less number of 2 bedroom properties listed in airbnb. Important values like revenue might get skewed when a zipcode has very few listings. One possible solution would be to write a function that checks if the percentage differences in avg. airbnb rent between 2 bedroom and other types of listings (1/3/4 bedroom) within each such zipcode are comparable to similar percentage difference measures computed for other New York zipcodes (zipcodes with more number of 2 bedroom airbnb listings). For now, I ignore zipcodes with less than 10 two bedroom airbnb listings.

No. of 2 Bedroom Listings - Plot

Before we get into more meaningful analysis charts, we need to make sure that we only consider those zipcodes that have enough listings in the Airbnb data. Else the analysis will be flawed.

Zipcodes with fewer than 10 two bedroom Airbnb listings in the data.

colnames(summary_df)[colnames(summary_df)=="unique_id"]<-"no_of_2bed_listings"
# min_no_of_listings - only if the number of 2 bedroom listings is higher than this value in a zipcode, is the zipcode used in revenue analysis
min_no_of_listings=10 
zipcodes_with_15orless_2bedroom_listings<-summary_df[(summary_df$no_of_2bed_listings<min_no_of_listings),]
zipcodes_with_16ormore_2bedroom_listings<-summary_df[!(summary_df$no_of_2bed_listings<min_no_of_listings),]

# Displaying zipcodes with less than 10 two bedroom listings
datatable(zipcodes_with_15orless_2bedroom_listings[,c('zipcode','no_of_2bed_listings')],rownames=FALSE,options=list(don='t',searching=FALSE,paging=FALSE),colnames=c("zipcode","No. of 2 bedroom listings"))

Let’s glance at the number of 2 bedroom listings in different zipcodes. Only zipcodes with 10 or more 2 bedroom Airbnb listings are visualized

swthm <- hc_theme_merge(
  hc_theme_darkunica(),
  hc_theme(
    credits = list(
      style = list(
        color = "#4bd5ee"
      )
    ),
    title = list(
      style = list(
        color = "#4bd5ee"
        )
      ),
    chart = list(
      backgroundColor = "transparent",
      divBackgroundImage = "https://www.howtogeek.com/wp-content/uploads/2012/12/Plain-Black-Wallpaper.png",
      style = list(fontFamily = "Comic Sans MS")
    )
  )
)
zipcodes_with_16ormore_2bedroom_listings <- zipcodes_with_16ormore_2bedroom_listings[order(zipcodes_with_16ormore_2bedroom_listings$no_of_2bed_listings,decreasing = TRUE),]
highchart() %>% 
  hc_add_theme(swthm) %>% 
  hc_title(text = "Number of Airbnb 2 bedroom listings in a few NYC Zipcodes ") %>%
  hc_xAxis(categories = as.character(zipcodes_with_16ormore_2bedroom_listings$zipcode),
           title = list(text = "Zipcodes")) %>% 
  hc_yAxis(title = list(text = "Number of 2 bedroom properties listed")) %>% 
    hc_add_series_labels_values(zipcodes_with_16ormore_2bedroom_listings$zipcode,zipcodes_with_16ormore_2bedroom_listings$no_of_2bed_listings,name = " ",colorByPoint=FALSE,type="column",dataLabels = list(enabled = TRUE),showInLegend=FALSE)%>%
  hc_tooltip(pointFormat = "Number of 2 bedroom listings: {point.y}")
#hc_add_series(data = zipcodes_with_16ormore_2bedroom_listings$no_of_2bed_listings, name = "Number of 2 bedroom properties listed in Airbnb in the zipcode", color = "#e5b13a") %>%

Purchase Cost - Plot

Needless to say, we need to have a good idea about the average 2 bedroom prices in different regions. Neighbourhood region is encoded in color.

categories_grouped <- summary_df %>%
  group_by(neighbourhood,zipcode)%>%
  do(neighbourhood = .$zipcode) %>% 
  list_parse()

swthm <- hc_theme_merge(
  hc_theme_538(),
  hc_theme(
    credits = list(
      style = list(
        color = "#2F4F4F"
      )
    ),
    title = list(
      style = list(
        color = "#2F4F4F"
        )
      ),
    chart = list(
      backgroundColor = "transparent",
      divBackgroundImage = "https://www.publicdomainpictures.net/download-picture.php?id=199969&check=2cf9ef5a06ace28f60059bb9a1fa5201",
      style = list(fontFamily = "Georgia")
    )
  )
)

highchart() %>% 
    hc_add_theme(swthm) %>% 
    hc_title(text = "Cost of 2 bedroom properties in NYC zipcodes. Hover over the bars to identify neighbourhoods " ) %>%
hc_xAxis(neighbourhood = categories_grouped,categories = as.character(summary_df$zipcode),title = list(text = "Zipcodes")) %>% 
  hc_yAxis(title = list(text = "Avg. cost of 2 bedroom property")) %>%
  hc_add_series(data = summary_df, type = "column", hcaes(x=neighbourhood,y = current_price, color = neighbourhood),showInLegend = FALSE)

Correlation between Cost and Revenue - Plot

Let’s confirm that buying properties in costlier zipcodes would definitely mean that the company can charge a higher rent. Each dot represents the positioning of a zipcode we are interested in analyzing.

We see the positive correlation. Fitting a generic linear model along the path of the scatterplot offers an interesting pattern. I haven’t followed this route in this analysis. Sidenote: Notice the three points that are distinctly seperated above the center of the line. The zipcodes associated with them feature in my final conclusion.

swthm <- hc_theme_merge(
  hc_theme_darkunica(),
  hc_theme(
    credits = list(
      style = list(
        color = "#4bd5ee"
      )
    ),
    title = list(
      style = list(
        color = "#4bd5ee"
        )
      ),
    chart = list(
      backgroundColor = "transparent",
      divBackgroundImage = "https://www.howtogeek.com/wp-content/uploads/2012/12/Plain-Black-Wallpaper.png",
      style = list(fontFamily = "Comic Sans MS")
    )
  )
)

lm.model <- augment(lm(per_night_price ~ current_price, data = summary_df))

highchart() %>%
  hc_add_theme(swthm) %>%
  hc_title(text="Price of property Vs Rent per day in different zipcodes") %>%
  hc_xAxis(title=list(text='Avg. Current Price'))%>%
  hc_yAxis(title=list(text='Avg. Rent per Day'))%>%
  hc_add_series(lm.model, "scatter",zoomType='xy',size=20, hcaes(x = current_price,y = per_night_price),showInLegend=FALSE) %>% 
  hc_add_series(lm.model, "line", hcaes(x = current_price, y = .fitted),showInLegend=FALSE)%>%
  hc_tooltip(crosshairs=TRUE,headerFormat = "Price of property(x) vs Rent per day (y):")

Revenue - Plot

Here’s a colorful visualization of the estimated revenue per year that can be expected from renting out a single 2 bedroom property in different zipcodes.

swthm <- hc_theme_merge(
  hc_theme_darkunica(),
  hc_theme(
    credits = list(
      style = list(
        color = "#4bd5ee"
      )
    ),
    title = list(
      style = list(
        color = "#4bd5ee"
        )
      ),
    chart = list(
      backgroundColor = "transparent",
      divBackgroundImage = "https://www.howtogeek.com/wp-content/uploads/2012/12/Plain-Black-Wallpaper.png",
      style = list(fontFamily = "Comic Sans MS")
    )
  )
)
zipcodes_with_16ormore_2bedroom_listings <- zipcodes_with_16ormore_2bedroom_listings[order(zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property,decreasing = TRUE),]
zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property<-as.integer(zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property)
highchart() %>% 
  hc_add_theme(swthm) %>% 
  hc_title(text = "Comparing estimated Revenue per year per property in Zipcodes ") %>%
  hc_xAxis(categories = as.character(zipcodes_with_16ormore_2bedroom_listings$zipcode),
           title = list(text = "Zipcodes")) %>% 
  hc_yAxis(title = list(text = "Revenue per year per property"), min=65000, max=135000) %>% 
    hc_add_series_labels_values(zipcodes_with_16ormore_2bedroom_listings$zipcode,zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property,name = " ",colorByPoint=TRUE,type="column",dataLabels = list(enabled = FALSE),showInLegend=FALSE) %>%
  hc_add_series_labels_values(labels = zipcodes_with_16ormore_2bedroom_listings$zipcode,values = zipcodes_with_16ormore_2bedroom_listings$Revenue_year_per_property,colorBypoint=TRUE,center = c('85%', '30%'),size = 200,type="pie",dataLabels = list(enabled = FALSE))%>%
  hc_tooltip(pointFormat = "Revenue/year/property: {point.y}")

As mentioned earlier, I ignore 7 zipcodes for revenue analysis because of insufficient number of 2 bedroom Airbnb listings in those zipcodes. One possible solution to this issue, as mentioned earlier is as follows:

Important values like revenue might get skewed when a zipcode has very few listings. One possible solution would be to write a function that checks if the percentage differences in avg. airbnb rent between 2 bedroom and other types of listings (1/3/4 bedroom) within each such zipcode are comparable to similar percentage difference measures computed for other New York zipcodes (zipcodes with more number of 2 bedroom airbnb listings). For now, I ignore zipcodes with less than 15 two bedroom airbnb listings.

The ignored zipcodes, the number of listings, and the estimated revenue are below:

datatable(zipcodes_with_15orless_2bedroom_listings[,c("zipcode","neighbourhood","no_of_2bed_listings","Revenue_year_per_property","current_price","Revenue_by_Cost_Ratio")],options=list(dom='t',searching=FALSE, paging=FALSE),colnames = c("Zipcode","Neighbourhood","No. of 2 Bed Listings", "Revenue/Year/Property","Cost of Property","Revenue/Cost Ratio"))

Revenue & Cost Plot

Relating revenue with cost of the property would be the most meaningful way to decide which zipcode to purchase property in.

In this plot, expenses that will be incurred after the purchase has been made have been neglected. So a plain ratio, Revenue/Forecasted-purchase-cost (Jan 2019), has been used.

A high Revenue/Cost ratio associated with buying and renting 2 bedroom properties in a zipcode means that barring outliers, it is safe to assume that investing in 2 bedroom properties in the zipcode would be safer because the investment would break even sooner than similar 2 bedroom investments in zipcodes which are associated with lower ratios.

The left y-axis holds revenue per year per property and the right y-axis holds the revenue/cost ratio. Revenue/cost value would help the company understand when they’d break even on a purchase.

swthm <- hc_theme_merge(
  hc_theme_monokai(),
  hc_theme(
    credits = list(
      style = list(
        color = "#4bd5ee"
      )
    ),
    title = list(
      style = list(
        color = "#4bd5ee"
        )
      ),
    chart = list(
      backgroundColor = "transparent",
      divBackgroundImage = "https://www.publicdomainpictures.net/pictures/130000/nahled/charcoal-background.jpg",
      style = list(fontFamily = "Comic Sans MS")
    )
  )
)
highchart()%>%
  hc_add_theme(swthm) %>% 
  hc_title(text = "Revenue per year per property and Revenue/Cost ratio" ) %>%
  hc_subtitle(text="(bars-left yaxis) and (line-right yaxis); to view seperately, select the legends on the bottom right") %>%
  hc_xAxis(categories = as.character(zipcodes_with_16ormore_2bedroom_listings$zipcode), title = list(text = "Zipcodes")) %>%
  hc_yAxis_multiples(list(title = list(text = "Revenue/year per property"),labels = list(format = "{value}"),min=65000,max=150000),list(title = list(text = "Revenue/Cost ratio"),labels=list(format ='{value}'),dataLabels=list(enabled=FALSE),opposite=TRUE,min=0.04,max=0.080)) %>%
  hc_plotOptions(column = list(stacking = "normal")) %>%
  hc_add_series(zipcodes_with_16ormore_2bedroom_listings,type="column",hcaes(x=zipcode,y=Revenue_year_per_property),showInLegend=TRUE) %>%
  hc_add_series(zipcodes_with_16ormore_2bedroom_listings,type="line",hcaes(x=zipcode,y=Revenue_by_Cost_Ratio),yAxis=1,showInLegend=TRUE) %>%
    hc_tooltip(pointFormat = "Stat: {point.y}")

Breakeven Analysis and Conclusions

df1<-zipcodes_with_16ormore_2bedroom_listings
df<-zipcodes_with_16ormore_2bedroom_listings[,c("zipcode","current_price","Revenue_year_per_property","year_5","year_10","year_15","year_20","year_25")]


rr=range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[2]-range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[1]
r1=range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[2]
r0=range(abs(df[,c("year_5","year_10","year_15","year_20","year_25")]))[1]

myDT2<-datatable(
  df,rownames = FALSE,
  options = list(pageLength = 15,dom='Blfrtip',searching=FALSE,paging=FALSE,
        rowCallback=JS(paste0("function(row, data) {",
                              paste(lapply(3:ncol(df),function(i){
                                paste0("var value=Math.abs(data[",i,"]);
                                        var value2=data[",i,"];
                                       if (value!==null){
                                       if(value2<0){
                                       $(this.api().cell(row,",i,").node()).css({'background':isNaN(parseFloat(value)) || value <=",r0," ? '' : 'linear-gradient(90deg, transparent ' + (",r1," - value)/",rr," * 100 + '%, #d30c3e  ' + (",r1," - value)/",rr," * 100 + '%)','background-size':'98% 80%','background-repeat':'no-repeat','background-position':'center'});
                                       }else{
                                       $(this.api().cell(row,",i,").node()).css({'background':isNaN(parseFloat(value)) || value <=",r0," ? '' : 'linear-gradient(90deg, transparent ' + (",r1," - value)/",rr," * 100 + '%, #018c3a ' + (",r1," - value)/",rr," * 100 + '%)','background-size':'98% 80%','background-repeat':'no-repeat','background-position':'center'},10000);
                                       }
                                       } 
if(data[0]!==null ){
        $(row).css('color', '#fff');
        $(row).css('background', '#212121');
    }                                       
if(data[0]== 10036 || data[0]== 10025 || data[0]== 11215 || data[0]== 10022 || data[0]== 11231){
        $(row).css('color', '#FFff00');
    }") 
}),collapse="\n"),
"}"))
,initComplete = JS(
                                                       "function(settings, json) {",
                                                       "$(this.api().table().header()).css({'background-color': '#121212', 'color': '#fff'});",
                                                       "var css = document.createElement('style');
                                                        css.type = 'text/css';
                                                        css.innerHTML = '.table.dataTable.hover tbody tr:hover, table.dataTable.display tbody tr:hover { background-color: #939393 !important }';
                                                        document.body.appendChild(css);",
                                                       "}")

                        ),
container = tags$table(
        class="cell-border stripe",
        tags$thead(tags$tr(lapply(c("Zipcode", "Avg Price","Rev/Property", "Year 5 est. net","Year 10 est. net", "Year 15 est. net","Year 20 est. net", "Year 25 est. net"), tags$th)))
      )
)

myDT2 <- formatStyle(myDT2,
                        columns = c(1),
                        backgroundColor = '#535353',
                        borderBottomColor = "rgb(255, 255, 255)",
                        borderBottomStyle = "solid",
                        borderBottomWidth = "1px",
                        borderCollapse = "collapse",
                        borderRightColor = "rgb(255, 255, 255)",
                        borderRightStyle = "solid",
                        borderRightWidth = "1px",
                        color = "#fff",
                        fontWeight = "bold",
                        lineHeight = "normal"
                        )

myDT2

Conclusions drawn from analysis

Top 5 location choices :
  1. Zipcode 10036 in Manhattan is the best zipcode to purchase 2 bedroom properties. The Average revenue ($134,335/year per property) is one of the highest. The avg 2 bedroom price is cheaper than all but one zipcode in Manhattan - $1,729,317 (see plot 2). It has 108 two bedroom listings in Airbnb (not the highest and far from the lowest); and it has the highest Revenue/cost ratio, 0.078, meaning the investment has the highest probability of breaking-even the soonest.

  2. Zipcode 10025 in Manhattan is not far behind. The Average revenue ($107,973) ranks among the middle of the 15 zipcodes. But the most important catch is that it has the least avg. purchase price for 2 bedroom properties in Manhattan at $1,459,632 (see plot 2). It has more 2 bedroom properties listed on Airbnb than 10036 at 112 listings and has a Revenue/Cost ratio of 0.074.

  3. Zipcode 10022 in Manhattan is another great option to purchase 2 bedroom properties. Revenue/year/property - $133,291, Avg. cost of a 2 bedroom property - $1,901,116, Revenue/Cost ratio - 0.07. One thing to note is that the number of 2 bedroom Airbnb listings associated with this zipcode is 38, which is the lowest among these top 5 choices.

  4. Zipcode 11215 in Brooklyn is also a very good location to purchase 2 bedroom properties. It is one of the cheapest options (avg. cost of $987,001). Eventhough the Avg. revenue obtained by renting through Airbnb is $70,272/year/property which is not as high as revenues associated with a few other zipcodes, the cheap purchase cost offsets the disadvantage of reaping lesser revenue and makes zipcode 11215 a great location for investing in 2 bedroom properties. Also, it has the highest number of 2 bedrooms listed on Airbnb at 141 listings. It too has a revenue/cost ratio of 0.071.

  5. Zipcode 11231 in Brooklyn is a lot similar to zipcode 11215 in that the avg. cost of a 2 bedroom property and the avg. revenue per year obtained from a property are pretty close.

Non-optimal choices :

The following zipcodes shouldn’t be first preferences to purchase 2 bedroom properties in. The reasons are:

The avg. 2 bedroom properties in the zipcodes that follow are not the most ideal for investment because they don’t bring in commensurate revenue like zipcodes 10036 or 10022 to justify the investment.

  • Two bedroom properties in zipcodes 11217 and 11201 are priced at $1.3 mil and $1.4mil respectively. 2 bedroom Airbnb listings in these two zipcodes have an avg. annual revenue of NA and NA respectively. Their revenue/cost ratios are 0.062 and 0.056 respectively. These 2 zipcodes have 86 and 68 two bedroom listings on Airbnb.

  • 10011 and 10014 have pricey 2 bedroom properties on average ($2.62mil and $2.25mil repectively). They command high revenues (140,211 and 122,007 respectively) through Airbnb rentals. Their revenue/cost ratios are similar (0.053 and 0.054). These 2 zipcodes have 102 and 95 two bedroom properties listed on Airbnb. They can be ‘okay’ investments if the investment horizon is close to 25 years.

  • Zipcodes 10023 and 10003 have avg. 2 bedroom properties priced at $2.14 mil and $2.17 mil respectively. Their revenue/cost ratios are 0.052 and 0.055 Revenues associated are $112,405 and $120,051 respectively.

  • Zipcode 10021 has two bedroom properties priced on avg. at $1.88 mil. Associated revenue/cost ratio is 0.055; the issue is it has only 19 two bedroom listings on Airbnb. Est. Revenue - 104,272.

Bad Options

  • Zipcodes 10028,10128,10013 have poor revenue/cost ratios- (0.048, 0.046, 0.04) respectively.

New York city zipcodes that offer the best returns are:

  1. 10036

  2. 10025

  3. 10022

  4. 10215

  5. 11231

Data Challenge - END

Some Business Concerns:

  • Economic Risks - Tourism generally booms if the economy is healthy. This needs to be factored in before making large investments in properties since the breakeven period is almost always several years.

  • Regulatory Issues - Airbnb has a lot of regulartory and legal hurdles. Last fall, the New York State Senate passed a bill that makes online apartment listings for stays shorter than 30 days illegal, which, not surprisingly, thwarts Airbnb in their goal to expand its market.

  • Customer Segmentation - It is crucial before entering a new market segment to understand the choice patters of consumer behaviour. Hence segmenting consumers on the basis of type(leisure, business,..), financial spending power, age etc would be helpful.

Data Quality Issues:

  • Due to missing zip codes in Zillow data, match rate between Zillow data file and Airbnb file is around 15%.

  • There are missing values for important columns in both datasets.

  • There are about 53 zipcodes for New York in airbnb file which don’t have information about 2 bedrooms properties.

Incorporate data from additional sources

  • Crime Data - Many cities have made their crime data available online.

  • Weather Data - A city’s weather affects mobility.

  • Social Media Data

  • Traffic Data