data_challenge.utf8

1. Synopsis

You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two-bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.

The real estate company has engaged your firm to build out a data product and provide your conclusions to help them understand which zip codes would generate the most profit on short term rentals within New York City.

2. Packages Required

The following packages has been used for the analysis:

library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)

tidyr : For changing the layout of the data sets, to convert data into the tidy format.

DT : For HTML display of data.

ggplot2 : For customizable graphical representation.

dplyr : For data manipulation.

tidyverse : Collection of R packages designed for data science that works harmoniously with other packages.

3. Assumptions

The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
Occupancy rate of 75% throughout the year for Airbnb properties.
The company will put properties on rent every day throughout the year.
The company is interested in making profit via renting only not via selling.

4. Data Prepration

a. Data Source

The publicly available data used for the analysis come from Zillow and Airbnb.

Zillow dataset: This dataset provides the cost for 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017.

Airbnb dataset: This dataset provides revenue data generated from properties. Information about property listings in New York including location, number of bedrooms, reviews, price, availability, property description, etc.

b. Data Cleaning Process

Airbnb Dataset - Analysis

air_bnb <- read.csv("listings.csv")

After reading the file, we check the dimension and column names of the file.

dim(air_bnb)

## [1] 48895   106

colnames(air_bnb)

##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"

We see that air_bnb dataset has 48895 rows and 106 columns and the colnames() gives us the names of all the columns. Many of these columns do not seem important for our analysis hence we will filter them out to keep the important columns only. Also, we will filter out the data where the number of bedroom is 2 as the business question specify 2 bedroom property only where the investor wants to invest. We also use distinct() to remove any duplicated observations.

  relevant_cols <- c("zipcode","bedrooms","price","weekly_price","monthly_price","cleaning_fee","neighbourhood_group_cleansed","number_of_reviews","review_scores_rating")
  air_bnb_filter <- air_bnb[,relevant_cols] %>% filter(bedrooms==2) %>% distinct()
 # Select relevant columns and filter bedroom equal to 2 and keep only unique observations

Once we check the summary statistics of the remaining columns of air_bnb_filter, we can make a better decision to determine whether to remove more columns or not from further analysis. We also check the percentage of missing values in each column.

str(air_bnb_filter)

## 'data.frame':    6408 obs. of  9 variables:
##  $ zipcode                     : Factor w/ 200 levels "","07093","07302",..: 31 126 111 4 50 120 46 116 33 5 ...
##  $ bedrooms                    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ price                       : Factor w/ 674 levels "$0.00","$1,000.00",..: 156 75 219 381 244 216 189 107 70 282 ...
##  $ weekly_price                : Factor w/ 599 levels "","$1,000.00",..: 1 544 1 208 181 1 1 550 567 1 ...
##  $ monthly_price               : Factor w/ 681 levels "","$1,000.00",..: 1 1 1 586 540 1 1 343 318 482 ...
##  $ cleaning_fee                : Factor w/ 207 levels "","$0.00","$1,000.00",..: 1 193 28 22 71 1 107 186 5 5 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 2 2 3 3 2 3 2 3 3 ...
##  $ number_of_reviews           : int  0 11 82 5 66 4 142 61 54 70 ...
##  $ review_scores_rating        : int  NA 94 94 100 93 100 88 88 94 87 ...

summary(air_bnb_filter)

##     zipcode        bedrooms     price         weekly_price 
##  11211  : 345   Min.   :2   $150.00: 366            :5619  
##  11221  : 259   1st Qu.:2   $200.00: 361   $1,000.00:  50  
##  11215  : 188   Median :2   $250.00: 319   $1,200.00:  37  
##  11222  : 187   Mean   :2   $100.00: 192   $1,500.00:  33  
##  11216  : 186   3rd Qu.:2   $300.00: 180   $800.00  :  30  
##  11238  : 184   Max.   :2   $120.00: 163   $900.00  :  27  
##  (Other):5059               (Other):4827   (Other)  : 612  
##    monthly_price   cleaning_fee  neighbourhood_group_cleansed
##           :5712   $100.00: 963   Bronx        :  94          
##  $3,500.00:  39          : 899   Brooklyn     :2884          
##  $3,000.00:  35   $50.00 : 471   Manhattan    :2814          
##  $4,500.00:  33   $150.00: 455   Queens       : 560          
##  $4,000.00:  31   $75.00 : 342   Staten Island:  56          
##  $2,500.00:  27   $80.00 : 288                               
##  (Other)  : 531   (Other):2990                               
##  number_of_reviews review_scores_rating
##  Min.   :  0.00    Min.   : 20.00      
##  1st Qu.:  1.00    1st Qu.: 92.00      
##  Median :  6.00    Median : 96.00      
##  Mean   : 24.97    Mean   : 94.06      
##  3rd Qu.: 28.00    3rd Qu.:100.00      
##  Max.   :403.00    Max.   :100.00      
##                    NA's   :1312

colSums(is.na(air_bnb_filter))/nrow(air_bnb_filter)*100

##                      zipcode                     bedrooms 
##                      0.00000                      0.00000 
##                        price                 weekly_price 
##                      0.00000                      0.00000 
##                monthly_price                 cleaning_fee 
##                      0.00000                      0.00000 
## neighbourhood_group_cleansed            number_of_reviews 
##                      0.00000                      0.00000 
##         review_scores_rating 
##                     20.47441

From the above results, we observe the following:

We see that zipcode and neighbourhood_group_cleansed are categorical variables, which are desired
Number of bedroom is 2, hence we can remove this column as it doesn’t provide any further information
We see that price, weekly_price, monthly_price, cleaning_fee are categorical variables having a $ sign in front of them, which are not desired.
review_scores_rating has a high percentage of missing values, hence we can remove this column.

cols <- c("price", "weekly_price", "monthly_price","cleaning_fee") 
# selecting columns to remove $ sign

replace_dollar <- function(x){
  price <- as.numeric(gsub("[$,]","",x)) 
  return(price)
}

air_bnb_filter[cols] <- lapply(air_bnb_filter[cols], replace_dollar) 
# Running the above defined function on columns to replace $ with whitespace

# Again check the structure of the data
str(air_bnb_filter)

## 'data.frame':    6408 obs. of  9 variables:
##  $ zipcode                     : Factor w/ 200 levels "","07093","07302",..: 31 126 111 4 50 120 46 116 33 5 ...
##  $ bedrooms                    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ price                       : num  190 115 228 375 250 225 200 145 110 285 ...
##  $ weekly_price                : num  NA 850 NA 2750 2000 NA NA 860 900 NA ...
##  $ monthly_price               : num  NA NA NA 7500 6000 ...
##  $ cleaning_fee                : num  NA 85 128 120 200 NA 30 79 100 100 ...
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 2 2 3 3 2 3 2 3 3 ...
##  $ number_of_reviews           : int  0 11 82 5 66 4 142 61 54 70 ...
##  $ review_scores_rating        : int  NA 94 94 100 93 100 88 88 94 87 ...

dim(air_bnb_filter)

## [1] 6408    9

summary(air_bnb_filter)

##     zipcode        bedrooms     price         weekly_price  
##  11211  : 345   Min.   :2   Min.   :   0.0   Min.   :  184  
##  11221  : 259   1st Qu.:2   1st Qu.: 129.0   1st Qu.:  800  
##  11215  : 188   Median :2   Median : 185.0   Median : 1050  
##  11222  : 187   Mean   :2   Mean   : 231.7   Mean   : 1329  
##  11216  : 186   3rd Qu.:2   3rd Qu.: 260.0   3rd Qu.: 1500  
##  11238  : 184   Max.   :2   Max.   :9999.0   Max.   :14000  
##  (Other):5059                                NA's   :5619   
##  monthly_price    cleaning_fee    neighbourhood_group_cleansed
##  Min.   :  651   Min.   :  0.00   Bronx        :  94          
##  1st Qu.: 2800   1st Qu.: 59.00   Brooklyn     :2884          
##  Median : 3700   Median : 95.00   Manhattan    :2814          
##  Mean   : 4404   Mean   : 94.43   Queens       : 560          
##  3rd Qu.: 5000   3rd Qu.:120.00   Staten Island:  56          
##  Max.   :40000   Max.   :621.00                               
##  NA's   :5712    NA's   :899                                  
##  number_of_reviews review_scores_rating
##  Min.   :  0.00    Min.   : 20.00      
##  1st Qu.:  1.00    1st Qu.: 92.00      
##  Median :  6.00    Median : 96.00      
##  Mean   : 24.97    Mean   : 94.06      
##  3rd Qu.: 28.00    3rd Qu.:100.00      
##  Max.   :403.00    Max.   :100.00      
##                    NA's   :1312

colSums(is.na(air_bnb_filter))/nrow(air_bnb_filter)*100

##                      zipcode                     bedrooms 
##                      0.00000                      0.00000 
##                        price                 weekly_price 
##                      0.00000                     87.68727 
##                monthly_price                 cleaning_fee 
##                     89.13858                     14.02934 
## neighbourhood_group_cleansed            number_of_reviews 
##                      0.00000                      0.00000 
##         review_scores_rating 
##                     20.47441

From the above, we see that the weekly_price and monthly_price have a high percentage of missing values (around 88% and 89% respectively) hence we can remove them from further analysis. cleaning_fee has a small percentage of missing value that can be imputed with the existing values. We will take care of that after we merge the dataset

We will change the column names of price and neighbourhood_group_cleansed to daily_price and neighbourhood respectively for easy understanding.

air_bnb_filter <- air_bnb_filter %>% select(-c(bedrooms, weekly_price, review_scores_rating,monthly_price))%>% rename(daily_price= price, neighbourhood=neighbourhood_group_cleansed)
head(air_bnb_filter)

##   zipcode daily_price cleaning_fee neighbourhood number_of_reviews
## 1   10029         190           NA     Manhattan                 0
## 2   11221         115           85      Brooklyn                11
## 3   11206         228          128      Brooklyn                82
## 4   10001         375          120     Manhattan                 5
## 5   10162         250          200     Manhattan                66
## 6   11215         225           NA      Brooklyn                 4

Zillow Dataset - Analysis

zillow <- read.csv("Zip_Zhvi_2bedroom.csv")

After reading the file, we check the dimension and column names of the file.

dim(zillow)

## [1] 8946  262

colnames(zillow)

##   [1] "RegionID"   "RegionName" "City"       "State"      "Metro"     
##   [6] "CountyName" "SizeRank"   "X1996.04"   "X1996.05"   "X1996.06"  
##  [11] "X1996.07"   "X1996.08"   "X1996.09"   "X1996.10"   "X1996.11"  
##  [16] "X1996.12"   "X1997.01"   "X1997.02"   "X1997.03"   "X1997.04"  
##  [21] "X1997.05"   "X1997.06"   "X1997.07"   "X1997.08"   "X1997.09"  
##  [26] "X1997.10"   "X1997.11"   "X1997.12"   "X1998.01"   "X1998.02"  
##  [31] "X1998.03"   "X1998.04"   "X1998.05"   "X1998.06"   "X1998.07"  
##  [36] "X1998.08"   "X1998.09"   "X1998.10"   "X1998.11"   "X1998.12"  
##  [41] "X1999.01"   "X1999.02"   "X1999.03"   "X1999.04"   "X1999.05"  
##  [46] "X1999.06"   "X1999.07"   "X1999.08"   "X1999.09"   "X1999.10"  
##  [51] "X1999.11"   "X1999.12"   "X2000.01"   "X2000.02"   "X2000.03"  
##  [56] "X2000.04"   "X2000.05"   "X2000.06"   "X2000.07"   "X2000.08"  
##  [61] "X2000.09"   "X2000.10"   "X2000.11"   "X2000.12"   "X2001.01"  
##  [66] "X2001.02"   "X2001.03"   "X2001.04"   "X2001.05"   "X2001.06"  
##  [71] "X2001.07"   "X2001.08"   "X2001.09"   "X2001.10"   "X2001.11"  
##  [76] "X2001.12"   "X2002.01"   "X2002.02"   "X2002.03"   "X2002.04"  
##  [81] "X2002.05"   "X2002.06"   "X2002.07"   "X2002.08"   "X2002.09"  
##  [86] "X2002.10"   "X2002.11"   "X2002.12"   "X2003.01"   "X2003.02"  
##  [91] "X2003.03"   "X2003.04"   "X2003.05"   "X2003.06"   "X2003.07"  
##  [96] "X2003.08"   "X2003.09"   "X2003.10"   "X2003.11"   "X2003.12"  
## [101] "X2004.01"   "X2004.02"   "X2004.03"   "X2004.04"   "X2004.05"  
## [106] "X2004.06"   "X2004.07"   "X2004.08"   "X2004.09"   "X2004.10"  
## [111] "X2004.11"   "X2004.12"   "X2005.01"   "X2005.02"   "X2005.03"  
## [116] "X2005.04"   "X2005.05"   "X2005.06"   "X2005.07"   "X2005.08"  
## [121] "X2005.09"   "X2005.10"   "X2005.11"   "X2005.12"   "X2006.01"  
## [126] "X2006.02"   "X2006.03"   "X2006.04"   "X2006.05"   "X2006.06"  
## [131] "X2006.07"   "X2006.08"   "X2006.09"   "X2006.10"   "X2006.11"  
## [136] "X2006.12"   "X2007.01"   "X2007.02"   "X2007.03"   "X2007.04"  
## [141] "X2007.05"   "X2007.06"   "X2007.07"   "X2007.08"   "X2007.09"  
## [146] "X2007.10"   "X2007.11"   "X2007.12"   "X2008.01"   "X2008.02"  
## [151] "X2008.03"   "X2008.04"   "X2008.05"   "X2008.06"   "X2008.07"  
## [156] "X2008.08"   "X2008.09"   "X2008.10"   "X2008.11"   "X2008.12"  
## [161] "X2009.01"   "X2009.02"   "X2009.03"   "X2009.04"   "X2009.05"  
## [166] "X2009.06"   "X2009.07"   "X2009.08"   "X2009.09"   "X2009.10"  
## [171] "X2009.11"   "X2009.12"   "X2010.01"   "X2010.02"   "X2010.03"  
## [176] "X2010.04"   "X2010.05"   "X2010.06"   "X2010.07"   "X2010.08"  
## [181] "X2010.09"   "X2010.10"   "X2010.11"   "X2010.12"   "X2011.01"  
## [186] "X2011.02"   "X2011.03"   "X2011.04"   "X2011.05"   "X2011.06"  
## [191] "X2011.07"   "X2011.08"   "X2011.09"   "X2011.10"   "X2011.11"  
## [196] "X2011.12"   "X2012.01"   "X2012.02"   "X2012.03"   "X2012.04"  
## [201] "X2012.05"   "X2012.06"   "X2012.07"   "X2012.08"   "X2012.09"  
## [206] "X2012.10"   "X2012.11"   "X2012.12"   "X2013.01"   "X2013.02"  
## [211] "X2013.03"   "X2013.04"   "X2013.05"   "X2013.06"   "X2013.07"  
## [216] "X2013.08"   "X2013.09"   "X2013.10"   "X2013.11"   "X2013.12"  
## [221] "X2014.01"   "X2014.02"   "X2014.03"   "X2014.04"   "X2014.05"  
## [226] "X2014.06"   "X2014.07"   "X2014.08"   "X2014.09"   "X2014.10"  
## [231] "X2014.11"   "X2014.12"   "X2015.01"   "X2015.02"   "X2015.03"  
## [236] "X2015.04"   "X2015.05"   "X2015.06"   "X2015.07"   "X2015.08"  
## [241] "X2015.09"   "X2015.10"   "X2015.11"   "X2015.12"   "X2016.01"  
## [246] "X2016.02"   "X2016.03"   "X2016.04"   "X2016.05"   "X2016.06"  
## [251] "X2016.07"   "X2016.08"   "X2016.09"   "X2016.10"   "X2016.11"  
## [256] "X2016.12"   "X2017.01"   "X2017.02"   "X2017.03"   "X2017.04"  
## [261] "X2017.05"   "X2017.06"

We see that the Zillow file contains 8946 rows and 262 columns. Apart from the first 7 columns, the rest of the columns consists of the price of the property per month from April 1996 to June 2017. We will select relevant columns such as Region Name (zipcode), the city, and the cost of the property only from the last 5 years. Since we need the information about New York City, we will filter the relevant data.

Assuming that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on differences between previous values (Moving Average MA), we apply ARIMA model to predict the cost of the properties in Zipcodes from July 2017 to Jan 2020 with the help of 5 years of cost data.

Also we will replace the column name RegionName to zipcode so that we can merge with the Zillow dataset with the air_bnb dataset.

  # Select zillow cost information from last 5 years and modify zillow data to only hold relevant columns, filter new york city and change the region name to zipcode
zillow_filter <- zillow[,c(2,3,(ncol(zillow)-59):ncol(zillow))] %>% filter(str_detect(City, "New York") == TRUE) %>% rename(zipcode = RegionName)

zillow_filter$current_property_value <- NA 
# Create a new column to store the latest price in January 2020
  
# we define a for loop to iterate over each zipcode to obtain latest cost of property
dim(zillow_filter)

## [1] 26 63

 for(i in 1:nrow(zillow_filter)){

    temp = ts(as.vector(t(zillow_filter[,c(3:62)])[,i]),start = c(2012,6),frequency = 12) 
    # Convert the monthly cost data into time series data 
  
    ARIMA_fit = arima(temp, order=c(1,1,1), seasonal=list(order=c(1,0,1),period=NA), method="ML")
    # Define ARIMA model to be used for prediction
    
    prediction = predict(ARIMA_fit, n.ahead = 30)
    # use the ARIMA model to predict the price from July 2017 to Jan 2020
    
    pred_value <- prediction$pred 
    # Store the predicted values in a variable 
    
    zillow_filter$current_property_value [i] <- pred_value[length(pred_value)] 
    # set the value of current price for the specific zipcode as price in Jan 2020
  }
  dim(zillow_filter)

## [1] 26 63

  zillow_filter <- zillow_filter[,c(1,2,63)] # return the filtered data containing only relevant columns
  head(zillow_filter)

##   zipcode     City current_property_value
## 1   10025 New York                1441110
## 2   10023 New York                2007906
## 3   10128 New York                1988430
## 4   10011 New York                2635197
## 5   10003 New York                2190146
## 6   11201 New York                1486526

  dim(zillow_filter)

## [1] 26  3

Merge Dataset - Analysis

Now both the datasets are ready to be merged.

housing_data <- merge(air_bnb_filter,zillow_filter,by = c("zipcode")) # merge data sets on zipcode
dim(housing_data)

## [1] 1539    7

head(housing_data)

##   zipcode daily_price cleaning_fee neighbourhood number_of_reviews
## 1   10003         140           NA     Manhattan                 0
## 2   10003         200           85     Manhattan                27
## 3   10003         250           35     Manhattan                 2
## 4   10003         850          100     Manhattan                 0
## 5   10003         245          200     Manhattan                18
## 6   10003         180          100     Manhattan                 2
##       City current_property_value
## 1 New York                2190146
## 2 New York                2190146
## 3 New York                2190146
## 4 New York                2190146
## 5 New York                2190146
## 6 New York                2190146

str(housing_data)

## 'data.frame':    1539 obs. of  7 variables:
##  $ zipcode               : Factor w/ 200 levels "","07093","07302",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ daily_price           : num  140 200 250 850 245 180 300 299 90 62 ...
##  $ cleaning_fee          : num  NA 85 35 100 200 100 175 89 50 40 ...
##  $ neighbourhood         : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ number_of_reviews     : int  0 27 2 0 18 2 0 11 1 4 ...
##  $ City                  : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
##  $ current_property_value: num  2190146 2190146 2190146 2190146 2190146 ...

summary(housing_data)

##     zipcode     daily_price      cleaning_fee         neighbourhood 
##  11215  :188   Min.   :  50.0   Min.   :  0.0   Bronx        :   0  
##  10036  :141   1st Qu.: 165.0   1st Qu.: 75.0   Brooklyn     : 496  
##  10003  :135   Median : 225.0   Median :100.0   Manhattan    :1001  
##  11217  :123   Mean   : 283.7   Mean   :109.3   Queens       :  16  
##  10025  :122   3rd Qu.: 320.0   3rd Qu.:150.0   Staten Island:  26  
##  10011  :105   Max.   :4000.0   Max.   :482.0                       
##  (Other):725                    NA's   :204                         
##  number_of_reviews       City      current_property_value
##  Min.   :  0.00    New York:1539   Min.   : 344553       
##  1st Qu.:  1.00    Aberdeen:   0   1st Qu.:1335294       
##  Median :  4.00    Abilene :   0   Median :1729041       
##  Mean   : 20.11    Abingdon:   0   Mean   :1782448       
##  3rd Qu.: 17.00    Abington:   0   3rd Qu.:2190146       
##  Max.   :403.00    Acton   :   0   Max.   :3323768       
##                    (Other) :   0

Now we notice some maximum price in the price column as 4000 which seems like an outlier as the cleaning fee for that is just 40. Hence we will remove that. Also, we will remove observations(if any) where cleaning fees greater than price indicating some wrong entry of data. Since cleaning_fee contains NA value which is around 13% of the data, we cannot delete that but will impute with the median.

house_data <- housing_data %>% filter(daily_price!=4000)  %>% filter(cleaning_fee< daily_price) 
house_data$City <- factor(house_data$City, levels=c("New York"))
house_data$cleaning_fee[is.na(house_data$cleaning_fee)] <- median(house_data$cleaning_fee, na.rm=TRUE)

str(house_data)

## 'data.frame':    1254 obs. of  7 variables:
##  $ zipcode               : Factor w/ 200 levels "","07093","07302",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ daily_price           : num  200 250 850 245 180 300 299 90 62 151 ...
##  $ cleaning_fee          : num  85 35 100 200 100 175 89 50 40 95 ...
##  $ neighbourhood         : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ number_of_reviews     : int  27 2 0 18 2 0 11 1 4 214 ...
##  $ City                  : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
##  $ current_property_value: num  2190146 2190146 2190146 2190146 2190146 ...

summary(house_data)

##     zipcode     daily_price      cleaning_fee         neighbourhood
##  11215  :158   Min.   :  50.0   Min.   :  0.0   Bronx        :  0  
##  10036  :120   1st Qu.: 178.0   1st Qu.: 70.0   Brooklyn     :415  
##  10003  :117   Median : 245.0   Median :100.0   Manhattan    :804  
##  11217  :100   Mean   : 279.8   Mean   :104.9   Queens       : 13  
##  10013  : 91   3rd Qu.: 334.5   3rd Qu.:135.0   Staten Island: 22  
##  10011  : 88   Max.   :2500.0   Max.   :400.0                      
##  (Other):580                                                       
##  number_of_reviews       City      current_property_value
##  Min.   :  0.00    New York:1254   Min.   : 344553       
##  1st Qu.:  1.00                    1st Qu.:1335294       
##  Median :  5.00                    Median :1729041       
##  Mean   : 22.51                    Mean   :1793501       
##  3rd Qu.: 20.75                    3rd Qu.:2190146       
##  Max.   :403.00                    Max.   :3323768       
##

length(unique(house_data$zipcode))

## [1] 24

From the above, we can see 24 unique zipcodes. Now the data is ready for analysis without having any NA values

c. Cleaned Data

The cleaned data can be found below:

datatable(house_data, filter = 'top')

5. Exploratory Data Analysis

The analysis have been perfomed based on different criteria. They are as follow:

a. Based on Neighbourhood

house_data %>% group_by(neighbourhood) %>% 
  count(neighbourhood) %>% 
  ggplot(aes(x=neighbourhood, y=n)) + 
  geom_col(fill= 'maroon') + geom_text(aes(label = n), size = 3, hjust=0.3, vjust=-0.5) +
  ggtitle("Number of properties across neighbourhoods")+
  xlab("Neighbourhood")+ 
  ylab("Count of Properties")

From the above graph, we can see that Manhattan has the maximum number of properties listed in the area, followed by Brooklyn. This gives us the idea that these two places can be considered to buy properties based on the count.

b. Based on Count of Properties in each Zipcode

house_data %>% 
  group_by(zipcode, neighbourhood) %>% 
  count(zipcode) %>% 
  arrange(-n) %>% 
  head(10)%>%
  ggplot(aes( x = reorder(zipcode,-n), y =n, fill= neighbourhood)) + 
  geom_bar(stat ="identity") + 
  scale_fill_manual(values=c("#FC766AFF", "#5B84B1FF"))+
  geom_text(aes(label = n), size = 3, hjust=0.3, vjust=-0.5) +
  ggtitle("Number of Properties in Each Zipcode")+ 
  xlab("Zipcodes") + 
  ylab("Count of Property") + 
  guides(fill=guide_legend(" "))

From the above plot, we can see that zipcode 11215 seems to be a popular choice based on the count of properties as that zipcode contains 158 properties followed by zipcode 10036, 10003, 11217 and 10013. Out of the top 5 zipcodes based on this criteria, 2 are in Brooklyn while 3 are in Manhattan.

c. Based on Number of Reviews

house_data %>% group_by(zipcode, neighbourhood) %>% 
  summarize(average_review= round(mean(number_of_reviews),0)) %>% 
  arrange(-average_review) %>%
  head(10) %>%
  ggplot(aes( x = reorder(zipcode,- average_review), y =average_review, fill = neighbourhood)) + 
  geom_bar(stat ="identity") + 
  scale_fill_manual(values=c("#FC766AFF", "#5B84B1FF",  "#4B878BFF", "#DAA03DFF"))+
  geom_text(aes(label = average_review), size = 3, hjust=0.3, vjust=-0.5) +
  ggtitle("Average Number of Review of Property in Each Zipcode")+
  xlab("Zipcodes") + 
  ylab("Total Number of Reviews")+
  guides(fill=guide_legend(" "))

From the above plot, we can see that zipcode 10304 seems to be a popular choice based on the average number of reviews of properties as that zipcode contains maximum reviews followed by zipcode 11434, 10308, 11215 and 11231. Out of the top 5 zipcodes based on this criteria, 2 are in Brooklyn, 2 are in Staten Island while 1 is in Queens.

d. Based on Cost of Properties

house_data %>% group_by(zipcode, neighbourhood) %>% 
  summarize(average_cost= round(mean(current_property_value),0)) %>% 
  arrange(average_cost) %>% 
  head(10) %>%
  ggplot(aes( x = reorder(zipcode,average_cost), y =average_cost, fill= neighbourhood)) +
  geom_bar(stat ="identity") + 
  scale_fill_manual(values=c("#FC766AFF",  "#4B878BFF","#DAA03DFF"))+
  geom_text(aes(label = average_cost), size = 3, hjust=0.5, vjust=-1) +
  ggtitle("Average cost of Property in Each Zipcode")+
  xlab("Zipcodes") + ylab("Average Cost")+
  guides(fill=guide_legend(" "))

From the above plot, we can see that zipcode 10303 seems to be a popular choice based on the low average price of properties in that zipcode followed by zipcode 10306, 10314, 10304 and 11434. Out of the top 5 zipcodes based on this criteria, 4 are in Staten Island while 1 is in Queens. This indicates properties in Manhattan and Brooklyn are very costly. This may not be the best criteria to judge as low cost properties will generate less revenue per year.

e. Based on Revenue Generated

house_data %>% group_by(zipcode, neighbourhood) %>% 
  summarize(average_yearly_revenue= round(mean(daily_price*.75*365+ cleaning_fee*.75*365),0)) %>% arrange(-average_yearly_revenue) %>% 
  head(10) %>%
  ggplot(aes( x = reorder(zipcode, -average_yearly_revenue), y =average_yearly_revenue, 
                         fill= neighbourhood)) + 
  geom_bar(stat ="identity") + 
  scale_fill_manual(values=c("#5B84B1FF"))+
  geom_text(aes(label = average_yearly_revenue), size = 3, hjust=0.5, vjust=-1) +
 ggtitle("Average Revenue Generated from Property in Each Zipcode")+
  xlab("Zipcodes") +
  ylab("Average Revenue")+
  guides(fill=guide_legend(" "))

From the above plot, we can see that zipcode 10013 seems to be a popular choice based on the high average revenue generated per year from properties in that zipcode followed by zipcode 10022, 10011, 10014 and 10023. It is interesting to note that all top zipcodes generated using the above criteria fall in Manhattan. But it is not necessary that these zipcodes will generate maximum profit as the cost of properties are also high.

f. Based on Year to achieve Break Even

house_data %>% group_by(zipcode, neighbourhood) %>% 
  summarize(average_yearly_revenue=mean(daily_price*.75*365+ cleaning_fee*.75*365),current_value= mean(current_property_value),year =current_value/average_yearly_revenue) %>% 
  arrange(year) %>%
  head(10) %>%
  ggplot(aes( x = reorder(zipcode, year), y =year, fill= neighbourhood)) + 
  geom_bar(stat ="identity") + 
  scale_fill_manual(values=c("#FC766AFF",  "#4B878BFF","#DAA03DFF"))+ 
  geom_text(aes(label = round(year,2)), size = 3, hjust=0.5, vjust=-1) +
  ggtitle("Years Required for Break Even in Each Zipcode")+
  xlab("Zipcodes") + ylab("Year")+
  guides(fill=guide_legend(" "))

From the above plot, we can see that zipcode 10306 seems to be a popular choice based on the lowest break-even time period followed by zipcode 11234, 10303, 11434 and 10304. Out of the top 5 zipcodes based on this criteria, 3 are in Staten Island while 1 is in each Brooklyn Queens.

g. Based on Long Term Profit

house_data %>% group_by(zipcode, neighbourhood) %>% 
  summarize(average_yearly_revenue=mean(daily_price*.75*365+ cleaning_fee*.75*365),
  current_value= mean(current_property_value), 
  profit_in_10_years= 10*average_yearly_revenue-current_value,
  profit_in_20_years= 20*average_yearly_revenue-current_value, 
  profit_in_30_years = 30*average_yearly_revenue-current_value) %>% 
  arrange(-profit_in_30_years) %>%
  head(10) %>%  
  gather(key=profit_year, value = profit, profit_in_10_years: profit_in_30_years) %>%
  ggplot(aes(x=zipcode, y= profit, fill = neighbourhood)) + 
  geom_bar(stat ="identity") + 
  scale_fill_manual(values=c("#FC766AFF", "#5B84B1FF", "#DAA03DFF"))+
  facet_wrap(~profit_year) + coord_flip() + 
  ggtitle("Long Term Profit from Each Zipcode") + xlab("Zipcode") +ylab("Profit Made in Various Years")+
  guides(fill=guide_legend(" "))+ 
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

From the above graph, we consider long term profit generated from a property. Surprisingly the trend seems to be very different from the rest of the analysis indicating that after the break-even year, profit is highly affected by revenue generated. Hence properties located in Manhattan and Brooklyn seem to be a better choice for buying property.

6. Conclusion

Zipcodes

Based on the combined effect of all analyses, here are my suggestions for five zipcodes from each of the three neighbourhood in New York City. :

11215 This zipcode is located in the Brooklyn neighbourhood. This is one of the zipcode which stands out in almost every analysis. The maximum number of properties are listed in this zipcode. Properties in this zipcode have the 4th largest number of reviews compared to other zipcodes. It should also be noted properties in this zipcode are not that expensive(10th cheapest) and also among the top 15 to achieve break-even. Also to conclude, this zipcode is in the top 15 zipcodes based on long term profit after 30 years.
11234 This zipcode is also located in the Brooklyn neighbourhood. Properties in this zipcode have the 5th largest number of reviews compared to other zipcodes. It should also be noted properties in this zipcode are also not expensive(8th cheapest). Properties in this zipcode are second to achieve break-even. Also to conclude, this zipcode is also in the top 10 zipcodes based on long term profit after 30 years.
10036 This zipcode is located in the Manhattan neighbourhood. This is one of the zipcode which should be considered for long term gain purpose. The second-largest number of properties are listed in this zipcode. Properties in this zipcode lie in the top 6 to generate high yearly revenue based on other zipcodes. It should also be noted that this zipcode is also in the top 2 zipcodes based on long term profit after 30 years.
10022 This zipcode is also located in the Manhattan neighbourhood. This is also one of the zipcode which should be considered for long term gain purpose as properties in this zipcode lie in top 2 to generate high yearly revenue based on other zipcodes. It should also be noted that this zipcode is also in the top zipcodes based on long term profit after 30 years.
10306 This zipcode is located in the Staten Island neighbourhood. This is one of the zipcode which should be considered for short term gain purpose as well as long term gain. Cheapest Properties are listed in this zipcode. Properties in this zipcode achieve break-even fastest based on other zipcodes. It should also be noted that this zipcode is also lie in the top 10 zipcodes based on long term profit after 30 years.

Future Plans

Business Decisions One of the business challenges will be to decide whether we want to make money from a property only via renting or selling the property eventually after some years. If yes, what time period are we looking at? This is an important decision that can help us to choose the zipcode where the investors can buy property now.

Incorporating data from additional sources Next steps in this analysis would be to make a better judgement with the help of additional data about:

Weather and seasonality is an important factors for travelers to choose to place in a city during brutal climates. Hence these data will help us to make a better decision.
Security is one of the significant elements for travelers and they, for the most part, lean toward wrongdoing free neighborhoods for leasing properties. By consolidating the crime data, we can have better judgment.
If a zipcode is intensely clogged with traffic or in the event that it doesn’t have good connectivity via public transport nearby, at that point, there are high possibilities that tourists will stay away from such zipcodes. So examination of this information should be performed.

Two Bedroom Properties in New York Analysis

Varun Varma