You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two-bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
The real estate company has engaged your firm to build out a data product and provide your conclusions to help them understand which zip codes would generate the most profit on short term rentals within New York City.
The following packages has been used for the analysis:
library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
tidyr : For changing the layout of the data sets, to convert data into the tidy format.
DT : For HTML display of data.
ggplot2 : For customizable graphical representation.
dplyr : For data manipulation.
tidyverse : Collection of R packages designed for data science that works harmoniously with other packages.
The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
Occupancy rate of 75% throughout the year for Airbnb properties.
The company will put properties on rent every day throughout the year.
The company is interested in making profit via renting only not via selling.
The publicly available data used for the analysis come from Zillow and Airbnb.
Zillow dataset: This dataset provides the cost for 2 bedroom properties in each zipcode for various cities. The cost information is available from April 1996 to June 2017.
Airbnb dataset: This dataset provides revenue data generated from properties. Information about property listings in New York including location, number of bedrooms, reviews, price, availability, property description, etc.
air_bnb <- read.csv("listings.csv")
After reading the file, we check the dimension and column names of the file.
dim(air_bnb)
## [1] 48895 106
colnames(air_bnb)
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "name"
## [6] "summary"
## [7] "space"
## [8] "description"
## [9] "experiences_offered"
## [10] "neighborhood_overview"
## [11] "notes"
## [12] "transit"
## [13] "access"
## [14] "interaction"
## [15] "house_rules"
## [16] "thumbnail_url"
## [17] "medium_url"
## [18] "picture_url"
## [19] "xl_picture_url"
## [20] "host_id"
## [21] "host_url"
## [22] "host_name"
## [23] "host_since"
## [24] "host_location"
## [25] "host_about"
## [26] "host_response_time"
## [27] "host_response_rate"
## [28] "host_acceptance_rate"
## [29] "host_is_superhost"
## [30] "host_thumbnail_url"
## [31] "host_picture_url"
## [32] "host_neighbourhood"
## [33] "host_listings_count"
## [34] "host_total_listings_count"
## [35] "host_verifications"
## [36] "host_has_profile_pic"
## [37] "host_identity_verified"
## [38] "street"
## [39] "neighbourhood"
## [40] "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed"
## [42] "city"
## [43] "state"
## [44] "zipcode"
## [45] "market"
## [46] "smart_location"
## [47] "country_code"
## [48] "country"
## [49] "latitude"
## [50] "longitude"
## [51] "is_location_exact"
## [52] "property_type"
## [53] "room_type"
## [54] "accommodates"
## [55] "bathrooms"
## [56] "bedrooms"
## [57] "beds"
## [58] "bed_type"
## [59] "amenities"
## [60] "square_feet"
## [61] "price"
## [62] "weekly_price"
## [63] "monthly_price"
## [64] "security_deposit"
## [65] "cleaning_fee"
## [66] "guests_included"
## [67] "extra_people"
## [68] "minimum_nights"
## [69] "maximum_nights"
## [70] "minimum_minimum_nights"
## [71] "maximum_minimum_nights"
## [72] "minimum_maximum_nights"
## [73] "maximum_maximum_nights"
## [74] "minimum_nights_avg_ntm"
## [75] "maximum_nights_avg_ntm"
## [76] "calendar_updated"
## [77] "has_availability"
## [78] "availability_30"
## [79] "availability_60"
## [80] "availability_90"
## [81] "availability_365"
## [82] "calendar_last_scraped"
## [83] "number_of_reviews"
## [84] "number_of_reviews_ltm"
## [85] "first_review"
## [86] "last_review"
## [87] "review_scores_rating"
## [88] "review_scores_accuracy"
## [89] "review_scores_cleanliness"
## [90] "review_scores_checkin"
## [91] "review_scores_communication"
## [92] "review_scores_location"
## [93] "review_scores_value"
## [94] "requires_license"
## [95] "license"
## [96] "jurisdiction_names"
## [97] "instant_bookable"
## [98] "is_business_travel_ready"
## [99] "cancellation_policy"
## [100] "require_guest_profile_picture"
## [101] "require_guest_phone_verification"
## [102] "calculated_host_listings_count"
## [103] "calculated_host_listings_count_entire_homes"
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms"
## [106] "reviews_per_month"
We see that air_bnb dataset has 48895 rows and 106 columns and the colnames() gives us the names of all the columns. Many of these columns do not seem important for our analysis hence we will filter them out to keep the important columns only. Also, we will filter out the data where the number of bedroom is 2 as the business question specify 2 bedroom property only where the investor wants to invest. We also use distinct() to remove any duplicated observations.
relevant_cols <- c("zipcode","bedrooms","price","weekly_price","monthly_price","cleaning_fee","neighbourhood_group_cleansed","number_of_reviews","review_scores_rating")
air_bnb_filter <- air_bnb[,relevant_cols] %>% filter(bedrooms==2) %>% distinct()
# Select relevant columns and filter bedroom equal to 2 and keep only unique observations
Once we check the summary statistics of the remaining columns of air_bnb_filter, we can make a better decision to determine whether to remove more columns or not from further analysis. We also check the percentage of missing values in each column.
str(air_bnb_filter)
## 'data.frame': 6408 obs. of 9 variables:
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 31 126 111 4 50 120 46 116 33 5 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ price : Factor w/ 674 levels "$0.00","$1,000.00",..: 156 75 219 381 244 216 189 107 70 282 ...
## $ weekly_price : Factor w/ 599 levels "","$1,000.00",..: 1 544 1 208 181 1 1 550 567 1 ...
## $ monthly_price : Factor w/ 681 levels "","$1,000.00",..: 1 1 1 586 540 1 1 343 318 482 ...
## $ cleaning_fee : Factor w/ 207 levels "","$0.00","$1,000.00",..: 1 193 28 22 71 1 107 186 5 5 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 2 2 3 3 2 3 2 3 3 ...
## $ number_of_reviews : int 0 11 82 5 66 4 142 61 54 70 ...
## $ review_scores_rating : int NA 94 94 100 93 100 88 88 94 87 ...
summary(air_bnb_filter)
## zipcode bedrooms price weekly_price
## 11211 : 345 Min. :2 $150.00: 366 :5619
## 11221 : 259 1st Qu.:2 $200.00: 361 $1,000.00: 50
## 11215 : 188 Median :2 $250.00: 319 $1,200.00: 37
## 11222 : 187 Mean :2 $100.00: 192 $1,500.00: 33
## 11216 : 186 3rd Qu.:2 $300.00: 180 $800.00 : 30
## 11238 : 184 Max. :2 $120.00: 163 $900.00 : 27
## (Other):5059 (Other):4827 (Other) : 612
## monthly_price cleaning_fee neighbourhood_group_cleansed
## :5712 $100.00: 963 Bronx : 94
## $3,500.00: 39 : 899 Brooklyn :2884
## $3,000.00: 35 $50.00 : 471 Manhattan :2814
## $4,500.00: 33 $150.00: 455 Queens : 560
## $4,000.00: 31 $75.00 : 342 Staten Island: 56
## $2,500.00: 27 $80.00 : 288
## (Other) : 531 (Other):2990
## number_of_reviews review_scores_rating
## Min. : 0.00 Min. : 20.00
## 1st Qu.: 1.00 1st Qu.: 92.00
## Median : 6.00 Median : 96.00
## Mean : 24.97 Mean : 94.06
## 3rd Qu.: 28.00 3rd Qu.:100.00
## Max. :403.00 Max. :100.00
## NA's :1312
colSums(is.na(air_bnb_filter))/nrow(air_bnb_filter)*100
## zipcode bedrooms
## 0.00000 0.00000
## price weekly_price
## 0.00000 0.00000
## monthly_price cleaning_fee
## 0.00000 0.00000
## neighbourhood_group_cleansed number_of_reviews
## 0.00000 0.00000
## review_scores_rating
## 20.47441
From the above results, we observe the following:
cols <- c("price", "weekly_price", "monthly_price","cleaning_fee")
# selecting columns to remove $ sign
replace_dollar <- function(x){
price <- as.numeric(gsub("[$,]","",x))
return(price)
}
air_bnb_filter[cols] <- lapply(air_bnb_filter[cols], replace_dollar)
# Running the above defined function on columns to replace $ with whitespace
# Again check the structure of the data
str(air_bnb_filter)
## 'data.frame': 6408 obs. of 9 variables:
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 31 126 111 4 50 120 46 116 33 5 ...
## $ bedrooms : int 2 2 2 2 2 2 2 2 2 2 ...
## $ price : num 190 115 228 375 250 225 200 145 110 285 ...
## $ weekly_price : num NA 850 NA 2750 2000 NA NA 860 900 NA ...
## $ monthly_price : num NA NA NA 7500 6000 ...
## $ cleaning_fee : num NA 85 128 120 200 NA 30 79 100 100 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 2 2 3 3 2 3 2 3 3 ...
## $ number_of_reviews : int 0 11 82 5 66 4 142 61 54 70 ...
## $ review_scores_rating : int NA 94 94 100 93 100 88 88 94 87 ...
dim(air_bnb_filter)
## [1] 6408 9
summary(air_bnb_filter)
## zipcode bedrooms price weekly_price
## 11211 : 345 Min. :2 Min. : 0.0 Min. : 184
## 11221 : 259 1st Qu.:2 1st Qu.: 129.0 1st Qu.: 800
## 11215 : 188 Median :2 Median : 185.0 Median : 1050
## 11222 : 187 Mean :2 Mean : 231.7 Mean : 1329
## 11216 : 186 3rd Qu.:2 3rd Qu.: 260.0 3rd Qu.: 1500
## 11238 : 184 Max. :2 Max. :9999.0 Max. :14000
## (Other):5059 NA's :5619
## monthly_price cleaning_fee neighbourhood_group_cleansed
## Min. : 651 Min. : 0.00 Bronx : 94
## 1st Qu.: 2800 1st Qu.: 59.00 Brooklyn :2884
## Median : 3700 Median : 95.00 Manhattan :2814
## Mean : 4404 Mean : 94.43 Queens : 560
## 3rd Qu.: 5000 3rd Qu.:120.00 Staten Island: 56
## Max. :40000 Max. :621.00
## NA's :5712 NA's :899
## number_of_reviews review_scores_rating
## Min. : 0.00 Min. : 20.00
## 1st Qu.: 1.00 1st Qu.: 92.00
## Median : 6.00 Median : 96.00
## Mean : 24.97 Mean : 94.06
## 3rd Qu.: 28.00 3rd Qu.:100.00
## Max. :403.00 Max. :100.00
## NA's :1312
colSums(is.na(air_bnb_filter))/nrow(air_bnb_filter)*100
## zipcode bedrooms
## 0.00000 0.00000
## price weekly_price
## 0.00000 87.68727
## monthly_price cleaning_fee
## 89.13858 14.02934
## neighbourhood_group_cleansed number_of_reviews
## 0.00000 0.00000
## review_scores_rating
## 20.47441
From the above, we see that the weekly_price and monthly_price have a high percentage of missing values (around 88% and 89% respectively) hence we can remove them from further analysis. cleaning_fee has a small percentage of missing value that can be imputed with the existing values. We will take care of that after we merge the dataset
We will change the column names of price and neighbourhood_group_cleansed to daily_price and neighbourhood respectively for easy understanding.
air_bnb_filter <- air_bnb_filter %>% select(-c(bedrooms, weekly_price, review_scores_rating,monthly_price))%>% rename(daily_price= price, neighbourhood=neighbourhood_group_cleansed)
head(air_bnb_filter)
## zipcode daily_price cleaning_fee neighbourhood number_of_reviews
## 1 10029 190 NA Manhattan 0
## 2 11221 115 85 Brooklyn 11
## 3 11206 228 128 Brooklyn 82
## 4 10001 375 120 Manhattan 5
## 5 10162 250 200 Manhattan 66
## 6 11215 225 NA Brooklyn 4
zillow <- read.csv("Zip_Zhvi_2bedroom.csv")
After reading the file, we check the dimension and column names of the file.
dim(zillow)
## [1] 8946 262
colnames(zillow)
## [1] "RegionID" "RegionName" "City" "State" "Metro"
## [6] "CountyName" "SizeRank" "X1996.04" "X1996.05" "X1996.06"
## [11] "X1996.07" "X1996.08" "X1996.09" "X1996.10" "X1996.11"
## [16] "X1996.12" "X1997.01" "X1997.02" "X1997.03" "X1997.04"
## [21] "X1997.05" "X1997.06" "X1997.07" "X1997.08" "X1997.09"
## [26] "X1997.10" "X1997.11" "X1997.12" "X1998.01" "X1998.02"
## [31] "X1998.03" "X1998.04" "X1998.05" "X1998.06" "X1998.07"
## [36] "X1998.08" "X1998.09" "X1998.10" "X1998.11" "X1998.12"
## [41] "X1999.01" "X1999.02" "X1999.03" "X1999.04" "X1999.05"
## [46] "X1999.06" "X1999.07" "X1999.08" "X1999.09" "X1999.10"
## [51] "X1999.11" "X1999.12" "X2000.01" "X2000.02" "X2000.03"
## [56] "X2000.04" "X2000.05" "X2000.06" "X2000.07" "X2000.08"
## [61] "X2000.09" "X2000.10" "X2000.11" "X2000.12" "X2001.01"
## [66] "X2001.02" "X2001.03" "X2001.04" "X2001.05" "X2001.06"
## [71] "X2001.07" "X2001.08" "X2001.09" "X2001.10" "X2001.11"
## [76] "X2001.12" "X2002.01" "X2002.02" "X2002.03" "X2002.04"
## [81] "X2002.05" "X2002.06" "X2002.07" "X2002.08" "X2002.09"
## [86] "X2002.10" "X2002.11" "X2002.12" "X2003.01" "X2003.02"
## [91] "X2003.03" "X2003.04" "X2003.05" "X2003.06" "X2003.07"
## [96] "X2003.08" "X2003.09" "X2003.10" "X2003.11" "X2003.12"
## [101] "X2004.01" "X2004.02" "X2004.03" "X2004.04" "X2004.05"
## [106] "X2004.06" "X2004.07" "X2004.08" "X2004.09" "X2004.10"
## [111] "X2004.11" "X2004.12" "X2005.01" "X2005.02" "X2005.03"
## [116] "X2005.04" "X2005.05" "X2005.06" "X2005.07" "X2005.08"
## [121] "X2005.09" "X2005.10" "X2005.11" "X2005.12" "X2006.01"
## [126] "X2006.02" "X2006.03" "X2006.04" "X2006.05" "X2006.06"
## [131] "X2006.07" "X2006.08" "X2006.09" "X2006.10" "X2006.11"
## [136] "X2006.12" "X2007.01" "X2007.02" "X2007.03" "X2007.04"
## [141] "X2007.05" "X2007.06" "X2007.07" "X2007.08" "X2007.09"
## [146] "X2007.10" "X2007.11" "X2007.12" "X2008.01" "X2008.02"
## [151] "X2008.03" "X2008.04" "X2008.05" "X2008.06" "X2008.07"
## [156] "X2008.08" "X2008.09" "X2008.10" "X2008.11" "X2008.12"
## [161] "X2009.01" "X2009.02" "X2009.03" "X2009.04" "X2009.05"
## [166] "X2009.06" "X2009.07" "X2009.08" "X2009.09" "X2009.10"
## [171] "X2009.11" "X2009.12" "X2010.01" "X2010.02" "X2010.03"
## [176] "X2010.04" "X2010.05" "X2010.06" "X2010.07" "X2010.08"
## [181] "X2010.09" "X2010.10" "X2010.11" "X2010.12" "X2011.01"
## [186] "X2011.02" "X2011.03" "X2011.04" "X2011.05" "X2011.06"
## [191] "X2011.07" "X2011.08" "X2011.09" "X2011.10" "X2011.11"
## [196] "X2011.12" "X2012.01" "X2012.02" "X2012.03" "X2012.04"
## [201] "X2012.05" "X2012.06" "X2012.07" "X2012.08" "X2012.09"
## [206] "X2012.10" "X2012.11" "X2012.12" "X2013.01" "X2013.02"
## [211] "X2013.03" "X2013.04" "X2013.05" "X2013.06" "X2013.07"
## [216] "X2013.08" "X2013.09" "X2013.10" "X2013.11" "X2013.12"
## [221] "X2014.01" "X2014.02" "X2014.03" "X2014.04" "X2014.05"
## [226] "X2014.06" "X2014.07" "X2014.08" "X2014.09" "X2014.10"
## [231] "X2014.11" "X2014.12" "X2015.01" "X2015.02" "X2015.03"
## [236] "X2015.04" "X2015.05" "X2015.06" "X2015.07" "X2015.08"
## [241] "X2015.09" "X2015.10" "X2015.11" "X2015.12" "X2016.01"
## [246] "X2016.02" "X2016.03" "X2016.04" "X2016.05" "X2016.06"
## [251] "X2016.07" "X2016.08" "X2016.09" "X2016.10" "X2016.11"
## [256] "X2016.12" "X2017.01" "X2017.02" "X2017.03" "X2017.04"
## [261] "X2017.05" "X2017.06"
We see that the Zillow file contains 8946 rows and 262 columns. Apart from the first 7 columns, the rest of the columns consists of the price of the property per month from April 1996 to June 2017. We will select relevant columns such as Region Name (zipcode), the city, and the cost of the property only from the last 5 years. Since we need the information about New York City, we will filter the relevant data.
Assuming that there is seasonality in the price and also that values depend not only on previous values (Auto Regressive AR) but also on differences between previous values (Moving Average MA), we apply ARIMA model to predict the cost of the properties in Zipcodes from July 2017 to Jan 2020 with the help of 5 years of cost data.
Also we will replace the column name RegionName to zipcode so that we can merge with the Zillow dataset with the air_bnb dataset.
# Select zillow cost information from last 5 years and modify zillow data to only hold relevant columns, filter new york city and change the region name to zipcode
zillow_filter <- zillow[,c(2,3,(ncol(zillow)-59):ncol(zillow))] %>% filter(str_detect(City, "New York") == TRUE) %>% rename(zipcode = RegionName)
zillow_filter$current_property_value <- NA
# Create a new column to store the latest price in January 2020
# we define a for loop to iterate over each zipcode to obtain latest cost of property
dim(zillow_filter)
## [1] 26 63
for(i in 1:nrow(zillow_filter)){
temp = ts(as.vector(t(zillow_filter[,c(3:62)])[,i]),start = c(2012,6),frequency = 12)
# Convert the monthly cost data into time series data
ARIMA_fit = arima(temp, order=c(1,1,1), seasonal=list(order=c(1,0,1),period=NA), method="ML")
# Define ARIMA model to be used for prediction
prediction = predict(ARIMA_fit, n.ahead = 30)
# use the ARIMA model to predict the price from July 2017 to Jan 2020
pred_value <- prediction$pred
# Store the predicted values in a variable
zillow_filter$current_property_value [i] <- pred_value[length(pred_value)]
# set the value of current price for the specific zipcode as price in Jan 2020
}
dim(zillow_filter)
## [1] 26 63
zillow_filter <- zillow_filter[,c(1,2,63)] # return the filtered data containing only relevant columns
head(zillow_filter)
## zipcode City current_property_value
## 1 10025 New York 1441110
## 2 10023 New York 2007906
## 3 10128 New York 1988430
## 4 10011 New York 2635197
## 5 10003 New York 2190146
## 6 11201 New York 1486526
dim(zillow_filter)
## [1] 26 3
Now both the datasets are ready to be merged.
housing_data <- merge(air_bnb_filter,zillow_filter,by = c("zipcode")) # merge data sets on zipcode
dim(housing_data)
## [1] 1539 7
head(housing_data)
## zipcode daily_price cleaning_fee neighbourhood number_of_reviews
## 1 10003 140 NA Manhattan 0
## 2 10003 200 85 Manhattan 27
## 3 10003 250 35 Manhattan 2
## 4 10003 850 100 Manhattan 0
## 5 10003 245 200 Manhattan 18
## 6 10003 180 100 Manhattan 2
## City current_property_value
## 1 New York 2190146
## 2 New York 2190146
## 3 New York 2190146
## 4 New York 2190146
## 5 New York 2190146
## 6 New York 2190146
str(housing_data)
## 'data.frame': 1539 obs. of 7 variables:
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ daily_price : num 140 200 250 850 245 180 300 299 90 62 ...
## $ cleaning_fee : num NA 85 35 100 200 100 175 89 50 40 ...
## $ neighbourhood : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ number_of_reviews : int 0 27 2 0 18 2 0 11 1 4 ...
## $ City : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 2702 2702 2702 2702 2702 2702 2702 2702 2702 ...
## $ current_property_value: num 2190146 2190146 2190146 2190146 2190146 ...
summary(housing_data)
## zipcode daily_price cleaning_fee neighbourhood
## 11215 :188 Min. : 50.0 Min. : 0.0 Bronx : 0
## 10036 :141 1st Qu.: 165.0 1st Qu.: 75.0 Brooklyn : 496
## 10003 :135 Median : 225.0 Median :100.0 Manhattan :1001
## 11217 :123 Mean : 283.7 Mean :109.3 Queens : 16
## 10025 :122 3rd Qu.: 320.0 3rd Qu.:150.0 Staten Island: 26
## 10011 :105 Max. :4000.0 Max. :482.0
## (Other):725 NA's :204
## number_of_reviews City current_property_value
## Min. : 0.00 New York:1539 Min. : 344553
## 1st Qu.: 1.00 Aberdeen: 0 1st Qu.:1335294
## Median : 4.00 Abilene : 0 Median :1729041
## Mean : 20.11 Abingdon: 0 Mean :1782448
## 3rd Qu.: 17.00 Abington: 0 3rd Qu.:2190146
## Max. :403.00 Acton : 0 Max. :3323768
## (Other) : 0
Now we notice some maximum price in the price column as 4000 which seems like an outlier as the cleaning fee for that is just 40. Hence we will remove that. Also, we will remove observations(if any) where cleaning fees greater than price indicating some wrong entry of data. Since cleaning_fee contains NA value which is around 13% of the data, we cannot delete that but will impute with the median.
house_data <- housing_data %>% filter(daily_price!=4000) %>% filter(cleaning_fee< daily_price)
house_data$City <- factor(house_data$City, levels=c("New York"))
house_data$cleaning_fee[is.na(house_data$cleaning_fee)] <- median(house_data$cleaning_fee, na.rm=TRUE)
str(house_data)
## 'data.frame': 1254 obs. of 7 variables:
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ daily_price : num 200 250 850 245 180 300 299 90 62 151 ...
## $ cleaning_fee : num 85 35 100 200 100 175 89 50 40 95 ...
## $ neighbourhood : Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ number_of_reviews : int 27 2 0 18 2 0 11 1 4 214 ...
## $ City : Factor w/ 1 level "New York": 1 1 1 1 1 1 1 1 1 1 ...
## $ current_property_value: num 2190146 2190146 2190146 2190146 2190146 ...
summary(house_data)
## zipcode daily_price cleaning_fee neighbourhood
## 11215 :158 Min. : 50.0 Min. : 0.0 Bronx : 0
## 10036 :120 1st Qu.: 178.0 1st Qu.: 70.0 Brooklyn :415
## 10003 :117 Median : 245.0 Median :100.0 Manhattan :804
## 11217 :100 Mean : 279.8 Mean :104.9 Queens : 13
## 10013 : 91 3rd Qu.: 334.5 3rd Qu.:135.0 Staten Island: 22
## 10011 : 88 Max. :2500.0 Max. :400.0
## (Other):580
## number_of_reviews City current_property_value
## Min. : 0.00 New York:1254 Min. : 344553
## 1st Qu.: 1.00 1st Qu.:1335294
## Median : 5.00 Median :1729041
## Mean : 22.51 Mean :1793501
## 3rd Qu.: 20.75 3rd Qu.:2190146
## Max. :403.00 Max. :3323768
##
length(unique(house_data$zipcode))
## [1] 24
From the above, we can see 24 unique zipcodes. Now the data is ready for analysis without having any NA values
The cleaned data can be found below:
datatable(house_data, filter = 'top')
The analysis have been perfomed based on different criteria. They are as follow:
house_data %>% group_by(neighbourhood) %>%
count(neighbourhood) %>%
ggplot(aes(x=neighbourhood, y=n)) +
geom_col(fill= 'maroon') + geom_text(aes(label = n), size = 3, hjust=0.3, vjust=-0.5) +
ggtitle("Number of properties across neighbourhoods")+
xlab("Neighbourhood")+
ylab("Count of Properties")
From the above graph, we can see that Manhattan has the maximum number of properties listed in the area, followed by Brooklyn. This gives us the idea that these two places can be considered to buy properties based on the count.
house_data %>%
group_by(zipcode, neighbourhood) %>%
count(zipcode) %>%
arrange(-n) %>%
head(10)%>%
ggplot(aes( x = reorder(zipcode,-n), y =n, fill= neighbourhood)) +
geom_bar(stat ="identity") +
scale_fill_manual(values=c("#FC766AFF", "#5B84B1FF"))+
geom_text(aes(label = n), size = 3, hjust=0.3, vjust=-0.5) +
ggtitle("Number of Properties in Each Zipcode")+
xlab("Zipcodes") +
ylab("Count of Property") +
guides(fill=guide_legend(" "))
From the above plot, we can see that zipcode 11215 seems to be a popular choice based on the count of properties as that zipcode contains 158 properties followed by zipcode 10036, 10003, 11217 and 10013. Out of the top 5 zipcodes based on this criteria, 2 are in Brooklyn while 3 are in Manhattan.
house_data %>% group_by(zipcode, neighbourhood) %>%
summarize(average_review= round(mean(number_of_reviews),0)) %>%
arrange(-average_review) %>%
head(10) %>%
ggplot(aes( x = reorder(zipcode,- average_review), y =average_review, fill = neighbourhood)) +
geom_bar(stat ="identity") +
scale_fill_manual(values=c("#FC766AFF", "#5B84B1FF", "#4B878BFF", "#DAA03DFF"))+
geom_text(aes(label = average_review), size = 3, hjust=0.3, vjust=-0.5) +
ggtitle("Average Number of Review of Property in Each Zipcode")+
xlab("Zipcodes") +
ylab("Total Number of Reviews")+
guides(fill=guide_legend(" "))
From the above plot, we can see that zipcode 10304 seems to be a popular choice based on the average number of reviews of properties as that zipcode contains maximum reviews followed by zipcode 11434, 10308, 11215 and 11231. Out of the top 5 zipcodes based on this criteria, 2 are in Brooklyn, 2 are in Staten Island while 1 is in Queens.
house_data %>% group_by(zipcode, neighbourhood) %>%
summarize(average_cost= round(mean(current_property_value),0)) %>%
arrange(average_cost) %>%
head(10) %>%
ggplot(aes( x = reorder(zipcode,average_cost), y =average_cost, fill= neighbourhood)) +
geom_bar(stat ="identity") +
scale_fill_manual(values=c("#FC766AFF", "#4B878BFF","#DAA03DFF"))+
geom_text(aes(label = average_cost), size = 3, hjust=0.5, vjust=-1) +
ggtitle("Average cost of Property in Each Zipcode")+
xlab("Zipcodes") + ylab("Average Cost")+
guides(fill=guide_legend(" "))
From the above plot, we can see that zipcode 10303 seems to be a popular choice based on the low average price of properties in that zipcode followed by zipcode 10306, 10314, 10304 and 11434. Out of the top 5 zipcodes based on this criteria, 4 are in Staten Island while 1 is in Queens. This indicates properties in Manhattan and Brooklyn are very costly. This may not be the best criteria to judge as low cost properties will generate less revenue per year.
house_data %>% group_by(zipcode, neighbourhood) %>%
summarize(average_yearly_revenue= round(mean(daily_price*.75*365+ cleaning_fee*.75*365),0)) %>% arrange(-average_yearly_revenue) %>%
head(10) %>%
ggplot(aes( x = reorder(zipcode, -average_yearly_revenue), y =average_yearly_revenue,
fill= neighbourhood)) +
geom_bar(stat ="identity") +
scale_fill_manual(values=c("#5B84B1FF"))+
geom_text(aes(label = average_yearly_revenue), size = 3, hjust=0.5, vjust=-1) +
ggtitle("Average Revenue Generated from Property in Each Zipcode")+
xlab("Zipcodes") +
ylab("Average Revenue")+
guides(fill=guide_legend(" "))
From the above plot, we can see that zipcode 10013 seems to be a popular choice based on the high average revenue generated per year from properties in that zipcode followed by zipcode 10022, 10011, 10014 and 10023. It is interesting to note that all top zipcodes generated using the above criteria fall in Manhattan. But it is not necessary that these zipcodes will generate maximum profit as the cost of properties are also high.
house_data %>% group_by(zipcode, neighbourhood) %>%
summarize(average_yearly_revenue=mean(daily_price*.75*365+ cleaning_fee*.75*365),current_value= mean(current_property_value),year =current_value/average_yearly_revenue) %>%
arrange(year) %>%
head(10) %>%
ggplot(aes( x = reorder(zipcode, year), y =year, fill= neighbourhood)) +
geom_bar(stat ="identity") +
scale_fill_manual(values=c("#FC766AFF", "#4B878BFF","#DAA03DFF"))+
geom_text(aes(label = round(year,2)), size = 3, hjust=0.5, vjust=-1) +
ggtitle("Years Required for Break Even in Each Zipcode")+
xlab("Zipcodes") + ylab("Year")+
guides(fill=guide_legend(" "))
From the above plot, we can see that zipcode 10306 seems to be a popular choice based on the lowest break-even time period followed by zipcode 11234, 10303, 11434 and 10304. Out of the top 5 zipcodes based on this criteria, 3 are in Staten Island while 1 is in each Brooklyn Queens.
house_data %>% group_by(zipcode, neighbourhood) %>%
summarize(average_yearly_revenue=mean(daily_price*.75*365+ cleaning_fee*.75*365),
current_value= mean(current_property_value),
profit_in_10_years= 10*average_yearly_revenue-current_value,
profit_in_20_years= 20*average_yearly_revenue-current_value,
profit_in_30_years = 30*average_yearly_revenue-current_value) %>%
arrange(-profit_in_30_years) %>%
head(10) %>%
gather(key=profit_year, value = profit, profit_in_10_years: profit_in_30_years) %>%
ggplot(aes(x=zipcode, y= profit, fill = neighbourhood)) +
geom_bar(stat ="identity") +
scale_fill_manual(values=c("#FC766AFF", "#5B84B1FF", "#DAA03DFF"))+
facet_wrap(~profit_year) + coord_flip() +
ggtitle("Long Term Profit from Each Zipcode") + xlab("Zipcode") +ylab("Profit Made in Various Years")+
guides(fill=guide_legend(" "))+
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
From the above graph, we consider long term profit generated from a property. Surprisingly the trend seems to be very different from the rest of the analysis indicating that after the break-even year, profit is highly affected by revenue generated. Hence properties located in Manhattan and Brooklyn seem to be a better choice for buying property.
Based on the combined effect of all analyses, here are my suggestions for five zipcodes from each of the three neighbourhood in New York City. :
11215 This zipcode is located in the Brooklyn neighbourhood. This is one of the zipcode which stands out in almost every analysis. The maximum number of properties are listed in this zipcode. Properties in this zipcode have the 4th largest number of reviews compared to other zipcodes. It should also be noted properties in this zipcode are not that expensive(10th cheapest) and also among the top 15 to achieve break-even. Also to conclude, this zipcode is in the top 15 zipcodes based on long term profit after 30 years.
11234 This zipcode is also located in the Brooklyn neighbourhood. Properties in this zipcode have the 5th largest number of reviews compared to other zipcodes. It should also be noted properties in this zipcode are also not expensive(8th cheapest). Properties in this zipcode are second to achieve break-even. Also to conclude, this zipcode is also in the top 10 zipcodes based on long term profit after 30 years.
10036 This zipcode is located in the Manhattan neighbourhood. This is one of the zipcode which should be considered for long term gain purpose. The second-largest number of properties are listed in this zipcode. Properties in this zipcode lie in the top 6 to generate high yearly revenue based on other zipcodes. It should also be noted that this zipcode is also in the top 2 zipcodes based on long term profit after 30 years.
10022 This zipcode is also located in the Manhattan neighbourhood. This is also one of the zipcode which should be considered for long term gain purpose as properties in this zipcode lie in top 2 to generate high yearly revenue based on other zipcodes. It should also be noted that this zipcode is also in the top zipcodes based on long term profit after 30 years.
10306 This zipcode is located in the Staten Island neighbourhood. This is one of the zipcode which should be considered for short term gain purpose as well as long term gain. Cheapest Properties are listed in this zipcode. Properties in this zipcode achieve break-even fastest based on other zipcodes. It should also be noted that this zipcode is also lie in the top 10 zipcodes based on long term profit after 30 years.
Business Decisions One of the business challenges will be to decide whether we want to make money from a property only via renting or selling the property eventually after some years. If yes, what time period are we looking at? This is an important decision that can help us to choose the zipcode where the investors can buy property now.
Incorporating data from additional sources Next steps in this analysis would be to make a better judgement with the help of additional data about: