Executive Summary
Introduction
In the following case study, I am consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
The following case study helps them understand which zip codes would generate the most profit on short term rentals within New York City.
Datasets
1. AirBnb(Revenue Data) AirBnB is the medium through which the investor plans to lease out their investment property
2. Zillow(Cost Data) Zillow provides us an estimate of value for two-bedroom properties
Assumptions
- The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
- The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
- All properties and all square feet within each locale can be assumed to be homogeneous.
- The occupancy rate is assumed to be 75%.
- The average number of days for which the property will be rented is taken as 3. (In a recent survey done by AirBnb, it was found that majority of their stays last about 3 days)
Procedure
- Data Cleaning and Merging:
- Both Revenue and Cost Data have a very large number of columns - 106 and 262 respectively. They were filtered based on predefined assumptions, data quality metrics(low percent of missing values) and keeping the end goal in mind.
- The datasets were then merged on zipcodes. Any possible outliers were identified and removed and it was ensured all the datatypes and names of the fields were consistent with the documentation.
- All the steps were carried out in a scalable way so that we don’t have to redo everything in our code if the dataset changes or the company approaches a new market.
- Exploratory Data Analysis: The final data was used to create insightful visualizations showing the zipcodes and boroughs where our investor could possibly profit from. The zipcodes were analysed and ranked according to various metrics like Payback period, ROI%(Return on Investment), Number of properties etc. Similar analysis was done for boroughs comparing their prices and property costs.
Key Insights
The zipcodes were analysed using a number of parameters: Payback period, ROI%, Appreciation Rate of Property Price(for the past 5 years) and Number of properties in that zipcode. The following were the key insights:
The zipcodes with least amount of payback period are:
- 10306(Staten Island) with Payback period of 9.5 years
- 10303(Staten Island) with Payback period of 11.9 years
- 11234(Brooklyn) with Payback period of 12.07 years
- 11434(Queens) with Payback period of 12.77 years
- 10305(Staten Island) with Payback period of 13.36 years
Coming to Return on Investement(ROI%), these zipcodes performed really well: 11217(Brooklyn), 11231(Brooklyn), 11201(Brooklyn), 11215(Brooklyn), 11434(Queens)
The average growth rate in property price for the 5 year period between 2012-17 has been highest for following: 11217(Brooklyn), 11231(Brooklyn), 11201(Brooklyn), 11215(Brooklyn), 10128(Manhattan)
Out of all the above zipcodes, 11217 and 11215(Brooklyn) have a high number of listings indicating the popularity of the area
Other key insights include :
- Both prices and property costs are comparatively high in Manhattan.
- There is no correlation between price(per night) and the population of the area(SizeRank).
- Queens and Staten Island are cheaper in terms of both property and rental prices as compared to the other two.
Recommendations
I would recommend investing in Staten Island, particularly in zipcodes 10306,10303 and 10305 because each of them has very less payback period and at the same time they guarantee ROI% higher than the average. 11434(Queens) is also a great choice in terms of the above metrics. These zipcodes will involve lower risks as the investor will be able to break even quickly. In case the investor wants to hold the property for certain period of time(suppose for the next 5 years) and have high returns, Brooklyn(Zipcodes : 11217,11215) would be a great pick as these two zipcodes have high returns and they have seen highest increase rate in their property price from 2012-17. I would not personally recommend investing in Manhattan as both property cost and price rents are too high.
To Summarise my Recommendations:
- Staten Island(Zipcodes 10303,10305 and 10306): Lower Payback Period, Decent ROI%
- Brooklyn(Zipcodes 11217,11215): High ROI%, High Appreciation Rate in Property Prices(2012-17), High number of properties
- Queens(Zipcode 11434): Lower Payback Period, Good ROI%
- Manhattan: Not recommended
Out of all the above, Zipcode 10306 would be my first choice since it has the lowest payback period.
What’s Next
- The data for other expenses like maintenance, taxes and repairs is very important since it could help us to predict metrics like Payback period,ROI% etc. with greater accuracy.
- The analysis could be further improved if we can have more data for weekly price and monthly price since many people tend to stay for a week or a month.
- We can use Time series and Forecasting to better predict the future prices and bringing both the data on the same time scale.
- Further variables and factors like Crime rate, distance from tourist places/economic centers and other socio-economic factors could help us further in our analysis.
Installing and Loading required packages
## prerequisites (Please install these packages before loading)
## The commented code below install the list of packages
#pkgs <- c("ggplot2","dplyr","tidyverse","plotly","knitr","kableExtra")
## Install required (CRAN) packages
#for (pkg in pkgs) {
# if (!(pkg %in% installed.packages()[, "Package"])) {
# install.packages(pkg)
# }
# }
# Load the above packages
library(ggplot2)
library(dplyr)
library(tidyverse)
library(plotly)
library(knitr)
library(kableExtra)Data Preparation and Analysis
Reading the data
Analysis of Revenue Data (Airbnb Dataset)
## [1] 48895 106
There are a total number of 48895 observations and 106 columns in the bnb dataset.
After analyzing the column names, we have to select the columns which seem relevant to our study. After selecting these columns, we have to check whether of any these columns have a large percent of missing values(greater than 50). Such a large number of missing values can be a problem for our analysis and lead to wrong results. So in this analysis, these columns have been removed accordingly.
The variables which were considered important for analysis are:
“id”,“neighbourhood_cleansed”,“neighbourhood_group_cleansed”,“city”,“zipcode”, “state” “latitude”,“longitude”,“room_type”,“price”,“bedrooms”,“cleaning_fee”,“square_fee t”,“weekly_price”,“monthly_price”,“review_scores_location”,“review_scores_rating”
bnbColNames <- c("id","neighbourhood_cleansed","neighbourhood_group_cleansed","city","state","zipcode","latitude",
"longitude","room_type","bedrooms","price","cleaning_fee","square_feet",
"weekly_price","monthly_price","review_scores_location","review_scores_rating")Common functionality is developed for checking missing values of any dataframe so that it can be used later for other datasets in our analysis.
Checking for the percent of missing values for columns
checkMiss <- function(columnNames,df) {
missData <- df %>%
select(columnNames) %>%
sapply(function(x) sum(is.na(x))*100/length(x))
missDf <- as.data.frame(round(missData[missData > 0],2))
colnames(missDf) <- "Missing_Values"
return(missDf)
}Calling the above function
bnbMissDf <- checkMiss(bnbColNames,bnbData)
kable(bnbMissDf %>%
rownames_to_column(var = "Variables") %>%
arrange(desc(Missing_Values)))| Variables | Missing_Values |
|---|---|
| square_feet | 99.17 |
| monthly_price | 89.27 |
| weekly_price | 87.72 |
| review_scores_location | 22.66 |
| review_scores_rating | 22.54 |
| cleaning_fee | 21.77 |
| zipcode | 1.06 |
| city | 0.13 |
| bedrooms | 0.04 |
| state | 0.01 |
Looking at the above values, three columns namely “square_feet”,“weekly_price” and “monthly_price” have a high percent of missing values (each greater than 70). So we have to remove them in our analysis.
bnbMissDf1 <- rownames_to_column(bnbMissDf,var = "Variables")
bnbColMiss <- bnbMissDf1[bnbMissDf1$Missing_Values > 70,"Variables"] # Cols with > 70% missing values are stored Removing Redundant Columns
Since we are looking only for 2 bedroom properties in NYC, so columns like “city”,“state” and “bedrooms” will no longer be required after filtering.
Similar analysis has to be done for “Reviews” related columns to see if they are adding any value to our dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 20.00 92.00 96.00 93.93 100.00 100.00 11022
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 9.000 10.000 9.565 10.000 10.000 11082
Apart from 1 or 2 outliers in the lower range both overall ratings and location reviews seem to be positive with both median and mean lying close to the max values in each case. Since all these reviews are positive, they are not adding any value in comparing two listings. Therefore, they are also removed.
bnbColsRedundant <- c("city","state","review_scores_location","review_scores_rating","bedrooms")
bnbColsRemoved <- c(bnbColsRedundant,bnbColMiss) # Final Cols which are getting removedFiltering for final Revenue Dataset
Creating a function so that it becomes easy for the client in case they want to have data for different bedroom properties.
filterRevData <- function(bed,cols) {
data <- bnbData %>%
filter(bedrooms == bed) %>%
select(cols)
return(data)
}Calling the above function
noBedroom <- 2
cols <- bnbColNames[-match(bnbColsRemoved,bnbColNames)] #Removing Columns containing high missing values% and those are no longer required
filteredBnbData <- filterRevData(noBedroom,cols)Since the data is going to be merged on zipcodes, we have to check the number of missing values for zipcode column. There are still 50 missing zipcode values in our dataset.
Furthermore it should be checked whether all the zipcodes have consistent length.
##
## 5
## 6447
All the zipcodes (except those with missing values) have 5 digits.
Similar to the checking missing values functionality, a common process has been followed for replacing missing values of important fields like zipcodes. These values are imputed by using the median values in the same neighbourhood and the borough as that of missing value.
replaceMissing <- function(df,missCol) {
sum_df <- data.frame(df)
data <- sum_df %>%
select("id","neighbourhood_group_cleansed","neighbourhood_cleansed",missCol) %>%
group_by(neighbourhood_group_cleansed,neighbourhood_cleansed) %>%
mutate(replaceMissing = median(as.numeric(get(missCol)),na.rm = TRUE))
for(i in 1:nrow(df))
{
if(is.na(df[i,missCol]))
{
df[i,missCol] = ifelse(is.numeric(df[i,missCol]),data[i,"replaceMissing"],as.character(data[i,"replaceMissing"]))
}
}
return(df)
}Calling the above function
filteredBnbData$zipcode <- as.character(filteredBnbData$zipcode)
zipReplacedData <- replaceMissing(filteredBnbData,'zipcode')Final Revenue Dataset
finalRevData <- data.frame(zipReplacedData)
kable(head(finalRevData), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")| id | neighbourhood_cleansed | neighbourhood_group_cleansed | zipcode | latitude | longitude | room_type | price | cleaning_fee |
|---|---|---|---|---|---|---|---|---|
| 7750 | East Harlem | Manhattan | 10029 | 40.79685 | -73.94872 | Entire home/apt | $190.00 | NA |
| 13050 | Bedford-Stuyvesant | Brooklyn | 11221 | 40.68554 | -73.94090 | Entire home/apt | $115.00 | $85.00 |
| 14290 | Williamsburg | Brooklyn | 11206 | 40.70420 | -73.93560 | Entire home/apt | $228.00 | $128.00 |
| 15396 | Chelsea | Manhattan | 10001 | 40.74623 | -73.99530 | Entire home/apt | $375.00 | $120.00 |
| 15711 | Upper East Side | Manhattan | 10162 | 40.77065 | -73.95269 | Entire home/apt | $250.00 | $200.00 |
| 16458 | Park Slope | Brooklyn | 11215 | 40.67343 | -73.98338 | Entire home/apt | $225.00 | NA |
Data Quality Insights for Revenue Data: High percent of missing values were present in large number of columns.Data type for price fields is not numeric since they contain the $ sign.State and Zipcodes had many inconsistent/wrong entries but most of them were automatically removed after filtering for 2 bedroom properties.
Analysis of Cost Data (Zillow Dataset)
## [1] 8946 262
The columns contained in the above dataset are:
- RegionID
- RegionName
- Location Details (City,State,Metro,CountyName)
- SizeRank
- Property price from 1996-04 to 2017-06
Before merging the datasets we need to bring both the prices on the same time scale so as to make even comparison.
Average Price Trends across the years
## The following code was used to produce time series line chart in Tableau
zTime <- zillowData %>%
filter(City == 'New York',State == 'NY') %>%
gather(key = "time",value ="price",-c("RegionID","RegionName","City","State","Metro","CountyName","SizeRank"))
zTime$time = gsub("[X]","",zTime$time)
zTime$time =gsub("[.]","-",zTime$time)
zTime$time = as.Date(paste(zTime$time,"01",sep = "-"))
# Write 'zTime' dataframe in csv and use it in Tableau
# write.csv(zTime,file = "timeData.csv")Data Quality Insights for Cost Data: As seen in the above graph, a lot of data is missing from 1996 to 2004. So it will not be a good idea to incorporate them for calculating future prices. A better picture can be portrayed by using the data from the last 5 years. (2012-2017)
Cleaning Zillow Dataset
cleanZillow <- function(cityname,statename){
data <- zillowData %>%
filter(City == cityname,State == statename)
return(data)
}Calling the above function for suitable parameters : for New York City
Since there are a lot of fluctuations in the monthly data, we have to average out the prices for past each year. Using these prices, then we compute the average appreciation rate for the past 5 years. The function has been created in such a way that we can get growth rate for any past ‘n’ number of years no matter the latest data. So it would be useful in case the zillow data gets available even for the next two/three years after 2017.
growthRate <- function(df,y) { # df = dataset,y = no. of years
costdf <- df %>%
select((ncol(df) - y*12):(ncol(df)-1)) # Selecting price columns for previous 'y' years
for(i in 1:y){
costdf[,paste('cost',i,sep = '_')] = rowMeans(costdf[,(12*(i-1)+1):(12*i)],na.rm = TRUE) # calculating price means
}
costdf <- costdf[,-c(1:(y*12))] # Retaining only price means for 'y' years
for(j in 1:(y-1)){
costdf[,paste('growth',j,sep = '_')] <- (costdf[,paste('cost',j+1,sep = '_')])/(costdf[,paste('cost',j,sep = '_')])-1 #Calculating appreciation rate from j to j+1 year
}
avg_growth <- rowMeans(costdf[grepl("(growth)", colnames(costdf))]) # taking means of appreciation rate calculated above
return(avg_growth)
}Creating Final Cost Data
colnames(filteredZillow)[colnames(filteredZillow)=="RegionName"] <- "zipcode" # Rename to zipcode
colnames(filteredZillow)[ncol(filteredZillow)] <- "propertyCost" # Renaming the most recent cost as the propertyCost
zillowCols <- c('zipcode','SizeRank','propertyCost') # Columns considered important for our analysis
finalCostData <- filteredZillow %>%
select(zillowCols)
n <- 5 # Number of past years data we are considering for analysis
finalCostData$avg_growth <- growthRate(filteredZillow,n) # Combining the avg growth rate computed for the past 'n' yearsA look at the Final Cost data
| zipcode | SizeRank | propertyCost | avg_growth |
|---|---|---|---|
| 10025 | 1 | 1431000 | 0.0965309 |
| 10023 | 3 | 2142300 | 0.0892191 |
| 10128 | 14 | 1787100 | 0.0975620 |
| 10011 | 15 | 2480400 | 0.0920633 |
| 10003 | 21 | 2147000 | 0.0850568 |
| 11201 | 32 | 1420700 | 0.1261907 |
Merging the two datasets
## 'data.frame': 1569 obs. of 12 variables:
## $ zipcode : chr "10003" "10003" "10003" "10003" ...
## $ id : int 19288937 13866112 31835260 5207973 17843998 4126452 35415489 2150328 31408487 3984168 ...
## $ neighbourhood_cleansed : Factor w/ 221 levels "Allerton","Arden Heights",..: 65 65 87 65 65 87 87 65 65 65 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ latitude : num 40.7 40.7 40.7 40.7 40.7 ...
## $ longitude : num -74 -74 -74 -74 -74 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ price : Factor w/ 674 levels "$0.00","$1,000.00",..: 244 129 208 222 244 189 244 635 640 134 ...
## $ cleaning_fee : Factor w/ 206 levels "$0.00","$1,000.00",..: 12 192 21 162 181 25 25 4 4 43 ...
## $ SizeRank : int 21 21 21 21 21 21 21 21 21 21 ...
## $ propertyCost : int 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 ...
## $ avg_growth : num 0.0851 0.0851 0.0851 0.0851 0.0851 ...
Data Cleaning for the final dataset
finalData$price = as.numeric(gsub("[$,]","",as.character(finalData$price)))
finalData$cleaning_fee = as.numeric(gsub("[$,]","",as.character(finalData$cleaning_fee)))
finalData$zipcode = as.factor(finalData$zipcode)
finalData$id = as.character(finalData$id)
finalData$SizeRank = as.factor(finalData$SizeRank)Checking Missing Values in the final data
## Missing_Values
## cleaning_fee 13.7
Cleaning Fee has about 13.7% missing values. Since it is an important parameter in determining the returns for a listing so we have to impute it.
Impute Missing values of the Cleaning fee
finalData_V1 <- replaceMissing(finalData,'cleaning_fee') #Using the function we created earlier for replacing zipcodes
checkMiss('cleaning_fee',finalData_V1)## Missing_Values
## cleaning_fee 0.13
finalData_V1[which(finalData_V1$cleaning_fee %in% NA),'cleaning_fee'] <- 0 # Since there are only 1.3% missing values we
# can impute them as 0Checking outliers in rent
p1 <- ggplot(finalData_V1,aes(x = neighbourhood_group_cleansed,y = price,fill = neighbourhood_group_cleansed)) +
geom_boxplot() + theme(legend.position = "none")
ggplotly(p1)From the boxplot, it is visible that there are 3 listings with quite a high price(greater than 2000$/day): 2 in Manhattan and 1 in Brooklyn. Such high prices could affect our analysis as they could be a result of some data entry error and there are not many listings with such astronomically high charges so it would be better to remove those observations.
Removing outliers in rent
Checking whether zipcodes are unique for a Neighbourhood group
finalData_V2 %>%
select("zipcode","neighbourhood_group_cleansed") %>% group_by(zipcode) %>%
summarise(total = n_distinct(neighbourhood_group_cleansed)) %>% filter(total > 1)## # A tibble: 1 x 2
## zipcode total
## <fct> <int>
## 1 10013 2
10013 zipcode is present in two neighbourhoods - Brooklyn and Manhattan. However it is an area belonging to Lower Manhattan. So updating the neighbourhood of that observation from Brooklyn to Manhattan.
index <- which(finalData_V2$zipcode %in% 10013 & finalData_V2$neighbourhood_group_cleansed %in% 'Brooklyn')
finalData_V2[index,'neighbourhood_group_cleansed'] <- 'Manhattan'A glimpse at the final dataset
| zipcode | id | neighbourhood_cleansed | neighbourhood_group_cleansed | latitude | longitude | room_type | price | cleaning_fee | SizeRank | propertyCost | avg_growth |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10003 | 19288937 | East Village | Manhattan | 40.72784 | -73.98883 | Entire home/apt | 250 | 110 | 21 | 2147000 | 0.0850568 |
| 10003 | 13866112 | East Village | Manhattan | 40.72394 | -73.98872 | Entire home/apt | 165 | 85 | 21 | 2147000 | 0.0850568 |
| 10003 | 31835260 | Gramercy | Manhattan | 40.73782 | -73.98547 | Entire home/apt | 218 | 120 | 21 | 2147000 | 0.0850568 |
| 10003 | 5207973 | East Village | Manhattan | 40.72950 | -73.98623 | Entire home/apt | 230 | 60 | 21 | 2147000 | 0.0850568 |
| 10003 | 17843998 | East Village | Manhattan | 40.72571 | -73.98789 | Entire home/apt | 250 | 75 | 21 | 2147000 | 0.0850568 |
| 10003 | 4126452 | Gramercy | Manhattan | 40.73327 | -73.98503 | Entire home/apt | 200 | 125 | 21 | 2147000 | 0.0850568 |
Structure of the final data
## 'data.frame': 1566 obs. of 12 variables:
## $ zipcode : Factor w/ 24 levels "10003","10011",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ id : chr "19288937" "13866112" "31835260" "5207973" ...
## $ neighbourhood_cleansed : Factor w/ 221 levels "Allerton","Arden Heights",..: 65 65 87 65 65 87 87 65 65 65 ...
## $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ latitude : num 40.7 40.7 40.7 40.7 40.7 ...
## $ longitude : num -74 -74 -74 -74 -74 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 250 165 218 230 250 200 250 850 88 170 ...
## $ cleaning_fee : num 110 85 120 60 75 125 125 100 100 150 ...
## $ SizeRank : Factor w/ 24 levels "1","3","14","15",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ propertyCost : int 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 2147000 ...
## $ avg_growth : num 0.0851 0.0851 0.0851 0.0851 0.0851 ...
Visualization & Insights
Zipcode Analysis
Payback period
The zipcodes with less payback period will be better for investment as the investor will be able to quickly recover cash from their initial investment.
occupancy <- 0.75
nodays <- 3 # Number of days the apartment will be rented
finalData_v3 <- finalData_V2 %>%
mutate(presentCost = propertyCost*(1 + avg_growth)**2) %>%
mutate(paybackPeriod = presentCost/(occupancy*price + ((occupancy*cleaning_fee)/nodays)))
kable(finalData_v3 %>%
group_by(zipcode) %>%
summarise(years = mean(paybackPeriod)/365) %>%
arrange(years) %>%
head(n = 5),caption = "Zipcodes with least payback period") %>% kable_styling('striped')| zipcode | years |
|---|---|
| 10306 | 9.478526 |
| 10303 | 11.966507 |
| 11234 | 12.077044 |
| 11434 | 12.778073 |
| 10305 | 13.361443 |
finalData_v3 %>%
group_by(zipcode,neighbourhood_group_cleansed) %>%
summarise(years = mean(paybackPeriod)/365) %>%
arrange(years) %>%
ggplot(aes(x = reorder(zipcode,-years),y = years,fill=neighbourhood_group_cleansed)) + geom_col() + labs(y = "Payback Period",x = "zipcodes")+ coord_flip() + geom_hline(aes(yintercept = mean(years)), color="blue",linetype="dashed") + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())Zipcodes : 10306,10303,11234,11434 and 10305 have the least amount of payback period.(Dashed blue line shows the average)
Return on Investment(ROI%)
ROI = (Gain from Investment - Cost of Investment)/Cost of Investment
We can use simple ROI% calculation here since the investor will be paying for the property in cash. Suppose they are going to hold the property for 5 years and want to get high ROI%.
noYears <- 5 #The period for which we are calculating ROI%
finalData_v4 <- finalData_v3 %>%
mutate(ROI = (((presentCost*(1+avg_growth)**noYears) + (occupancy*price*365*noYears) +((cleaning_fee/nodays)*occupancy*365*noYears) - presentCost)/presentCost)*100)
finalData_v4 %>%
group_by(zipcode,neighbourhood_group_cleansed) %>%
summarise(avg_ROI = mean(ROI)) %>%
arrange(desc(avg_ROI)) %>%
ggplot(aes(x = reorder(zipcode,avg_ROI),y = avg_ROI,fill = neighbourhood_group_cleansed)) + geom_col() + coord_flip() + labs(y = "ROI%",x = "zipcodes") + geom_hline(aes(yintercept = mean(avg_ROI)), color="blue",linetype="dashed") + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank())Zipcodes : 11217,11231,11201,11215 and 11434 have the highest ROI% among all the others.(Dashed blue line shows the average)
Number of properties by zipcodes
finalData_V2 %>%
ggplot(aes(x = zipcode,fill = neighbourhood_group_cleansed)) +
geom_bar() + geom_text(stat='count', aes(label=..count..), vjust=-0.3) +
theme_bw() + labs(y = "Number of Properties") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),panel.grid.major = element_blank(),panel.grid.minor = element_blank(),axis.ticks.y = element_blank(),axis.text.y = element_blank(),panel.border = element_blank()
,axis.ticks.x = element_blank())Zipcodes : 11215,10036,10003,10025 and 11217 have the highest number of property listings indicating the popularity of the area.
kable(finalData_V2 %>%
group_by(zipcode,neighbourhood_group_cleansed) %>%
summarise(growthPercent = round(mean(avg_growth),4)*100,number_of_properties = n()) %>%
arrange(desc(growthPercent)) %>%
head(n = 5),caption = "Zipcodes with Highest Property Appreciation Rate(from 2012-17)")| zipcode | neighbourhood_group_cleansed | growthPercent | number_of_properties |
|---|---|---|---|
| 11217 | Brooklyn | 14.18 | 123 |
| 11231 | Brooklyn | 13.78 | 93 |
| 11201 | Brooklyn | 12.62 | 86 |
| 11215 | Brooklyn | 11.82 | 190 |
| 10128 | Manhattan | 9.76 | 65 |
Out of the above, both 11217 and 11215 had high property appreciation rates and also large number of properties.
Borough Analysis
Price(per night) distribution across neighbourhood
p2 <- ggplot(finalData_v3,aes(price,fill = neighbourhood_group_cleansed)) + geom_density(alpha = 0.2) +
scale_x_log10() + ggtitle("Price(per night) distribution")
ggplotly(p2)Property Cost distribution across neighbourhood
finalData_v3 %>%
group_by(neighbourhood_group_cleansed) %>%
summarise(avg_cost = mean(presentCost)) %>%
ggplot(aes(neighbourhood_group_cleansed,avg_cost,fill = neighbourhood_group_cleansed)) + geom_col(position = "dodge") + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank()) + labs(y = "Avg Property Cost")- Manhattan has high property prices along with high rental prices as compared to the other three.
- Queens and Staten Island have comparatively low prices in terms of both property and rent as compared to Brooklyn and Manhattan.
Further Insights
Does population of an area(SizeRank) affect Price(per night)?
p3 <- ggplot(finalData_V2,aes(x=SizeRank,y = price,col = neighbourhood_group_cleansed)) +
geom_point() +
geom_jitter() + theme(legend.title = element_blank(),axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p3)Most of the prices lie in the region below 1000$ mark indicating there is not any noticeable relation between price per night and the population of that area.A more visible relation was seen in the earlier graph when the price distribution of all the boroughs was compared showing that Manhattan and Brooklyn had higher prices as compared to the other two.
Do Queens and Staten Island have lower prices because of higher number of private rooms?
p4 <- finalData_V2 %>%
ggplot(aes(neighbourhood_group_cleansed,price,fill = room_type)) + geom_col(position = "dodge",width = 0.8) + theme_minimal() + theme(panel.grid.major = element_blank(),panel.grid.minor = element_blank()) + labs(y = "Price(per night)")
ggplotly(p4)Since private rooms are much cheaper as compared to renting an entire home, we wanted to know whether low prices in Staten Island and Queens are because they contain higher proportion of private rooms. However, from the graph it is easily visible that prices of both room types tend to follow same pattern across the neighbourhoods. Hence, we can conclude that rental prices are low in Staten Island and Queens irrespective of whether it’s a private room or an entire home.