We are helping a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. They want us to build a data product and provide conclusions to help them understand which zip codes would generate the most profit on short term rentals within New York City.
Code for Loading Tables
datasets<-c("listings.csv.gz","Zip_Zhvi_2bedroom.csv")
for (i in seq_along(datasets))
{
file_path=paste0(datasets[i])
if(file.exists(file_path))
{
df<-read_csv(file_path)
assign(datasets[i],df)
} else
print("No such file exits")
}
## Parsed with column specification:
## cols(
## .default = col_character(),
## id = col_double(),
## scrape_id = col_double(),
## last_scraped = col_date(format = ""),
## host_id = col_double(),
## host_since = col_date(format = ""),
## host_is_superhost = col_logical(),
## host_listings_count = col_double(),
## host_total_listings_count = col_double(),
## host_has_profile_pic = col_logical(),
## host_identity_verified = col_logical(),
## latitude = col_double(),
## longitude = col_double(),
## is_location_exact = col_logical(),
## accommodates = col_double(),
## bathrooms = col_double(),
## bedrooms = col_double(),
## beds = col_double(),
## square_feet = col_double(),
## guests_included = col_double(),
## minimum_nights = col_double()
## # ... with 24 more columns
## )
## See spec(...) for full column specifications.
## Parsed with column specification:
## cols(
## .default = col_double(),
## RegionName = col_character(),
## City = col_character(),
## State = col_character(),
## Metro = col_character(),
## CountyName = col_character()
## )
## See spec(...) for full column specifications.
Missing Values -
for (i in seq_along(datasets))
{
missing_values<-any(is.na(get(datasets[i])))
if (missing_values)
{
print(paste(datasets[i],"has Missing values"))
} else
print(paste(datasets[i],"has no Missing values"))
}
## [1] "listings.csv.gz has Missing values"
## [1] "Zip_Zhvi_2bedroom.csv has Missing values"
Column Names
Percentage of missing values Zip,bedrooms,availability_30,number_of_reviews,price
## [1] 0.01499276
## [1] 0.001693127
## [1] 0
## [1] 0
## [1] 0
## [1] 0
Cleaning Data
## [1] 40753 97
## [1] 8940 102
Inputs
Following inputs were used which can be changed to see results for different values-
cutoff_review-Only properties with reviews greater than cutoff_review were considered in the analysis
price_min_cutoff/price_max_cutoff-These values were used in deciding which properties should be included based on price
cutoff_review<-4
price_min_cutoff<-0.05
price_max_cutoff<-0.95
Initially, we just checked the count of properties across buroughs and zips
Analysis of Bedroom count
Table 1 : Table showing cost factor that will be used in the analysis
## # A tibble: 9 x 3
## bedrooms mean_sq_feet mean_factor_2bedroom
## <dbl> <dbl> <dbl>
## 1 0 408. 0.442
## 2 1 585. 0.634
## 3 2 924. 1.00
## 4 3 1333. 1.44
## 5 4 3350 3.63
## 6 5 4000 4.33
## 7 6 2300 2.49
## 8 7 3700 4.00
## 9 NA 400 0.433
Analysis of Reviews
We are only considering 50% of the properties with 5 or more than 5 review. This value is dynamic and can be changed in the anaysis (Input at the top Input 1)
## [1] 0.5063758
Using review score location to define Occupancy rate
This graph shows that higher review scores lcation result in higher occupancy rate
Table 2 : This table shows the occupancy rate across review_scores_location values
## # A tibble: 6 x 3
## review_scores_location count_reviews mean_occupancy
## <dbl> <int> <dbl>
## 1 6 1 0
## 2 7 5 0.107
## 3 8 50 0.589
## 4 9 809 0.745
## 5 10 3646 0.800
## 6 NA 16 0.744
This table will be used to calculate occupancy rate for properties based on their location review score
Analysing Price to look for outliers
After outlier removal the box plot looks like this
Defining Revenue, property cost and years to breakeven using the dataset created above
Annual Revenue is defined as the price per day X 365 X Occupancy (Occupancy was based on the review score location as discussed earlier )
Property cost is based on Latest cost of 2 bedroom apartment X Factor calculated earlier based on bedroom count
years to breakeven = Property cost/Annual Revenue
Bar graph showing avg price and property cost across Boroughs(Neighbourhood Group) Insights from the charts creted at neighbourhood level
Price per day - Average price per day is highest for manhattan(~$190), followed by brooklyn(~$170), Staten Island(~$80) and Queens(~$60)
Property cost per day - Property cost also follows the same order-Manhattan, brooklyn,Staten Island and Queens
Years to breakeven - Breakeven Years is minimum for staten Island followed by Queens, Brooklyn and manhatten. Deep dive at zip level is needed to understand which zips are most profittable
Count of properties across Zips
In case same zip was associated with multiple neighbourhood, zip was mapped to neighbourhood with maximum mappings
Insights from the charts creted at neighbourhood level
Price per day - Average price per day is highest for manhattan zips(10014,10011,10023), followed by brooklyn zips(11217)
Property cost per day - Property cost also follows the same order-Manhattan, brooklyn,Staten Island and Queens
Graph showing Breakeven time period across Zips in order from lowest to Highest
Zips present in Staten Island gives the best ROI
Although, most of the properties are in Manhattan as it receives the highest number of guests . Our Analysis suggests that the breakeven is also longest for Manhattan due to high property cost
Manhattan presents tough competition as there are many competitors and it has high property cost. Overall, it has long breakeven point and doesn’t present a good opportunity in terms of ROI
Staten Island has considerable lower property cost but it also gets fewer guests. Also, travellers and tourist who are visiting NY for business and leisure, it is much better for them to stay in manhattan which is closer to offices and other tourist attractions
But going just by numbers it seems it is much safer to invest in a property at Staten Island as it offers shorter Breakeven point and is less risky compared to other locations
In case the company wants to diversify, they should pick the top zips from different neighbourhoods to minimize risk
Our recommendation would be to diversify and buy properties in top performing zips of different neighbourhoods with prime focus on staten Island
Property cost - Property cost used for this analysis was from Jun’17. Latest property cost should be used if it is available. We can also use forecasting techniques like ARIMA to predict latest property cost from past trend. We can also take the actual cost of properties which would give more accurate results
Factors impacting Occupancy- In the present analysis, although we have used occupancy rate based on review score location, there are other factors which impact occupancy rate. We should also normalize our results for other factors like cleanliness, staff behavior etc which definitely impacts Occupancy rate using regression model.We can also use Availability_30 at property level to get more accurate results
Discount/Interest Rate- In this case, we have taken 0% discount rate as our assumption but that assumption is not practical. Some reasonable percentage rate can be taken to calculate NPV value and make a more accurate prediction