Predicting the most profitable Zipcodes in NYC for Real Estate

We have taken publicly available data from Airbnb and Zillow to investigate purchasing properties in which Zipcodes would yield most profits for our client. Our investors would outsource their properties through AirBnB and are interested to make purchases only in New York City. Our client has already mentioned that 2 Bedroom properties would be most profitable according to their analysis. Based on few assumptions and the raw data available to us, the following analyses has been made.

Please click on each of the tabs bellow to navigate through the report.

Introduction

Introduction

To accomplish the results of this project, I worked with two data sets, as mentioned below. Metadata for these datasets is already given and information for newly created columns would be highlighted as needed.

  1. Zillow Dataset - provdies information of property costs from April 1996 to June 2017

  2. AirBnB Dataset - Raw Data taken from Airbnb that has all the information related to Airbnb bookings, locations, prices, availability etc.,

Assumptions

  1. I considered that 25% of the cleaning fee would go to revenues, this could also be adjusted and tested for different rates with the code.
  2. Assumed occupancy rate of 75% for initial model, seems to be aligning with the mean availability rate as well. Coded to change the occupancy rate and review results, if necessary.
  3. The investor will pay for the property in cash. If the party is going to take loan, additional costs would be incurred in the form of interest rates.
  4. The discount rate is 0% for Time value Money - A dollar worth would remain same 100 years from now. Will be required to calculate profits more accurately in the future
  5. All properties and all square feet within each locale can be assumed to be homogeneous - Stated, but never used, as I did not consider square feet to build this model.
  6. Ignored security deposit, as it would not acccount for revenues. Also disregarded extra guests charges as we could not predict the number of additional guests for a listing with given information.

Packages Used

Below are the R packages used for the purpose of this project.

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(reshape2)
library(Hmisc)
library(gridExtra)
library(kableExtra)

Have used below theme for all the plots visualised in the report. Also, for better presentation and control over data, I have used scrollbars, which are also defined in the below snippet.

#them for plotting
theme <-  theme(panel.grid.major = element_blank(), 
                panel.grid.minor = element_blank(),
                panel.background = element_blank(),
                axis.line = element_line(colour = "black"))  

#function for scrollbars
scrollbar <- function(dataf, rows_disp = 4){
kable(head(dataf, n = rows_disp)) %>%
  kable_styling("striped", full_width = F) %>%
  scroll_box(width = "100%", height = "230px")
}

Data Processing - Zillow

Rearranging the data

From the datafile provided initially, data is loaded into R, and as a first step of cleaning the data, I have filtered out the observations that are specific to New York City. For better understanding of data,I have also combined almost 200+ columns into several rows with new variables indication month and year.

Zillow <- read_csv("data/Zip_Zhvi_2bedroom.csv")
#Loading data
Zillow_NY <- Zillow %>%  
  gather(Time, Price, colnames(Zillow[-c(1:7)])) %>%
  filter(City == "New York") %>%
  rename(zipcode = RegionName) %>%
  select(-c(1,3:5)) %>%
  separate(Time, c("Year", "Month")) %>%
  mutate_at(-c(3,6) , as.factor) %>%
  mutate(Month = as.factor(month.name[Month]))

Quick glance at the modified data

This dataset is pretty clean and the summary of variables can be seen below. From the below dimensions it is quite clear that the dataset is pretty clean and organised. There are not many issues with the data in regards to improrper formatting and there arent any observed outliers.

zipcode CountyName SizeRank Year Month Price
10025 New York 1 1996 April NA
10023 New York 3 1996 April NA
10128 New York 14 1996 April NA
10011 New York 15 1996 April NA
##     zipcode        CountyName      SizeRank         Year           Month     
##  10003  : 255   Kings   :1275   Min.   :   1   1997   : 300   April   : 550  
##  10011  : 255   New York:2805   1st Qu.:  52   1998   : 300   June    : 550  
##  10013  : 255   Queens  : 255   Median : 580   1999   : 300   May     : 550  
##  10014  : 255   Richmond:2040   Mean   :1045   2000   : 300   August  : 525  
##  10021  : 255                   3rd Qu.:1744   2001   : 300   December: 525  
##  10022  : 255                   Max.   :4647   2002   : 300   February: 525  
##  (Other):4845                                  (Other):4575   (Other) :3150  
##      Price        
##  Min.   :  99200  
##  1st Qu.: 277700  
##  Median : 425100  
##  Mean   : 779289  
##  3rd Qu.:1306000  
##  Max.   :3347100  
##  NA's   :1696

The new data is now condensed to 6375 observations of just 6 variables from 8946 observations of 252 variables.

Missing values

Eventhough, the dataset looks pretty clean, there are about 1600 missing values for price. I have broken down the number of missing price values for each zipcode and the same can be visualised below.

## `summarise()` ungrouping output (override with `.groups` argument)

The zipcodes in Kings and New York counties, which are probably new localities have lot of missing price data. This makes sense becuase, the area would not have been commercialised as early as 1994 and hence no data was available.

Data Processing - AirBnb

Loading the data

Due to the bulk size of the Airbnb Dataset, I directly downloaded it from the link provided. As shown below, the gzfile function is used to decompress the zip file and read the comma separated file format.

temp <- tempfile()
download.file("http://data.insideairbnb.com/united-states/ny/new-york-city/2017-05-02/data/listings.csv.gz", temp)
Airbnb <- read.csv(gzfile(temp), as.is = TRUE, na.strings = c("","NA"))
unlink(temp)

Airbnb Dataset is directly downloaded from their webiste and like any real world data, it needs to be cleaned and checked thoroughly before processing the same. The dataset has around 95 variables with a lot of unwanted and unrelated data. Below is a quick glance at the original Airbnb dataset.

## [1] 40753    95
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview notes transit access interaction house_rules thumbnail_url medium_url picture_url xl_picture_url host_id host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified street neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed city state zipcode market smart_location country_code country latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities square_feet price weekly_price monthly_price security_deposit cleaning_fee guests_included extra_people minimum_nights maximum_nights calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews first_review last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value requires_license license jurisdiction_names instant_bookable cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month
7949480 https://www.airbnb.com/rooms/7949480 2.01705e+13 2017-05-03 City Island Sanctuary relaxing BR & Bath w Parking Come relax on City Island in our quiet guest room w/adjacent private bath. Enjoy a Continental Breakfast served by Didier at your table downstairs. Walk to restaurants, shops, the park or the beach. Park one car on our private driveway. If you are looking to explore Manhattan you can take the 29 bus to the nearest subway number 6. Or you can take the BXM8 express bus at the Pelham Bay Subway station which goes directly into Manhattan. There are also green taxis at the subway. On parle français et anglais, (lire Français ci-dessous). I am a native New Yorker and Didier grew up in Biarritz, France. Our three friendly cats do not go into the room. The bathroom is yours during your entire stay (we have our own en suite). We will provide a Continental breakfast with fresh bread butter & jams, fresh squeezed orange juice or fresh fruit, tea or coffee served by my French husband who is a talented cook! You might even get a special treat of canelés de Bordeaux, banana bread or home made scones! City Island Sanctuary relaxing BR & Bath w Parking Pièce calme, ensoleillée et à proximité d’une salle de bain privée. Marchez jusqu, au parc ou à la plage. On peut aller à pied à quelques restaurants, une petite supermarché, petits magasins et le City Island Nautical Museum. Prenez le metro ou le bus express à Manhattan (une heure et demi à centre-ville. Les chats sont en résidence, mais ne vont pas dans la chambre. Cable TV avec TV5 Monde, lecteur DVD, Wi-fi . Petit Come relax on City Island in our quiet guest room w/adjacent private bath. Enjoy a Continental Breakfast served by Didier at your table downstairs. Walk to restaurants, shops, the park or the beach. Park one car on our private driveway. If you are looking to explore Manhattan you can take the 29 bus to the nearest subway number 6. Or you can take the BXM8 express bus at the Pelham Bay Subway station which goes directly into Manhattan. There are also green taxis at the subway. On parle français et anglais, (lire Français ci-dessous). I am a native New Yorker and Didier grew up in Biarritz, France. Our three friendly cats do not go into the room. The bathroom is yours during your entire stay (we have our own en suite). We will provide a Continental breakfast with fresh bread butter & jams, fresh squeezed orange juice or fresh fruit, tea or coffee served by my French husband who is a talented cook! You might even get a special treat of canelés de Bordeaux, banana bread or home mad none City Island is a unique sanctuary in New York City’s the Bronx. It boasts many wonderful seafood restaurants, two great pubs and the only true French bistro in the borough. We are walking distance to the largest park in the City of New York, Pelham Bay Park and also to lovely Orchard Beach, the Riviera of the Bronx! We are walkable to Pelham Park which is the largest park in NYC. It has picnic areas, tennis courts, horseback riding, Orchard Beach and a lovely rocky coast. Check out their website for all there is to do! On the island, you can walk to everything. A local bus will take you to shopping at the Mall at Bay Plaza near Coop City. If you want to go to Manhattan you take the local 29 bus to the subway No. 6 to Manhattan. The local bus does not run after midnight but there are green taxis at the subway station. There is an express bus early in the morning which goes directly from the island to Manhattan. It will take at least one hour to get to midtown by subway and a bit longer on the express bus depending upon traffic. Feel free to enjoy our large backyard in season, our downstairs large recreation area with TV, WIFI. Walk to the local beaches for the view or a dip. Please, no food or drinks except water in the guest bedroom. We speak fluent French and English. We will be in residence during your stay to help you with recommendations for a real New York experience or a unique City Island French experience. But, we do respect your privacy! No extra guests or visitors who are not in the reservation. Maximum 2 registered guests only. No late check ins or check outs. Please, no food or drinks in the guest bedroom except water. No smoking on the property. It’s a residential neighborhood so we keep it quiet after 10. Please only inquire about booking for yourself, we will not accept 3rd party inquiries. https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=small https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=medium https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=large https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=x_large 119445 https://www.airbnb.com/users/show/119445 Linda & Didier 2010-05-06 New York, New York, United States I am a designer, former instructor at a design school and my husband is a talented French cook (he is from France). He is also an artist, writer and a Zen practitioner. We are sociable but respectful of your privacy.
I am a native New Yorker having relocated back to NYC from So Cal because I Love New York! We are the only French/American Airbnb in the Bronx!
within an hour 100% N/A t https://a0.muscache.com/im/pictures/2de30fe4-bebe-4145-a86a-b8a8a970078d.jpg?aki_policy=profile_small https://a0.muscache.com/im/pictures/2de30fe4-bebe-4145-a86a-b8a8a970078d.jpg?aki_policy=profile_x_medium City Island 1 1 [‘email’, ‘phone’, ‘reviews’, ‘jumio’] t t City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.85205 -73.78868 t House Private room 2 1 1 1 Real Bed {“Cable TV”,“Wireless Internet”,“Air conditioning”,“Free parking on premises”,Breakfast,“Pets live on this property”,Cat(s),“Indoor fireplace”,Heating,“Smoke detector”,“Carbon monoxide detector”,“Fire extinguisher”,Essentials,Shampoo,Hangers,“Hair dryer”,Iron,“translation missing: en.hosting_amenity_49”,“translation missing: en.hosting_amenity_50”} NA $99.00 NA NA $100.00 NA 1 $20.00 1 7 yesterday NA 24 54 80 170 2017-05-03 25 2016-01-18 2017-04-23 100 10 10 10 10 10 10 f NA NA f moderate t t 1 1.59
16042478 https://www.airbnb.com/rooms/16042478 2.01705e+13 2017-05-04 WATERFRONT STUDIO APARTMENT My place is close to Sea Shore. You’ll love my place because of the coziness, the location, the views, and the people. My place is good for couples, solo adventurers, business travelers, and families (with kids). (URL HIDDEN) My place is close to Sea Shore. You’ll love my place because of the coziness, the location, the views, and the people. My place is good for couples, solo adventurers, business travelers, and families (with kids). (URL HIDDEN) public transport from Grand Central, NYC or from JFK, Laguadia is about 30 minutes away. (URL HIDDEN) Fine sea food restaurant, bars and night clubs all within walking distance, Sailing, fishing, canoeing, marinas and just 30 minutes away from central NYC or major airports like JFK or LaGuardia airports. Enjoy NYC none (URL HIDDEN) Fine sea food restaurant, bars and night clubs all within walking distance, Sailing, fishing, canoeing, marinas and just 30 minutes away from central NYC or major airports like JFK or LaGuardia airports. Enjoy NYC NA public transport from Grand Central, NYC or from JFK, Laguadia is about 30 minutes away. NA NA https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=small https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=medium https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=large https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=x_large 9117975 https://www.airbnb.com/users/show/9117975 Collins 2013-09-29 New York, New York, United States I am married with 3 children and 2 grandkid. Family and community is the most important thing besides God. I am a Tax consultant and a I have travelled to a few countries including Germany, France, Holland, Netherlands, England in Europe. I have also travel extensively within the Caribbean and Mexico so I appreciate the importance of comfortable and affordable lodging when one is away. a few days or more 0% N/A f https://a0.muscache.com/im/users/9117975/profile_pic/1423794446/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/9117975/profile_pic/1423794446/original.jpg?aki_policy=profile_x_medium City Island 1 1 [‘phone’, ‘facebook’] t f City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.85349 -73.78861 t Apartment Private room 4 1 1 1 Real Bed {TV,Internet,“Wireless Internet”,“Air conditioning”,Kitchen,Heating,“Family/kid friendly”,“Smoke detector”,“Carbon monoxide detector”,“First aid kit”,“Fire extinguisher”,Essentials,Shampoo,Hangers,“Hair dryer”,Iron} NA $200.00 NA NA NA NA 1 $0.00 7 29 6 months ago NA 30 60 90 180 2017-05-04 0 NA NA NA NA NA NA NA NA NA f NA NA t flexible f f 1 NA
1886820 https://www.airbnb.com/rooms/1886820 2.01705e+13 2017-05-04 Quaint City Island Community. Quiet island boating town on Long Island Sound. 10 miles from Manhattan. Public trans. Walk to shops, restaurants, bars,parks. Near Botanical gardens, Bronx Zoo and Fordham University. Water view. off street parking. 3BR, 3 Full Baths. Washer/dryer. full kitchen.Fully furnished comfortable home with deck/backyard. local caretaker. Master bed with queen bed, full bath and office. Mid bedroom has queen bed. Small bed has single. TV room/den queen sleeper sofa. Galley kitchen off dining room. Piano. Winter water view. Large back deck with seating. Small front porch. Great for family of 4. Will fit 7. Additional$25.00 per person per night over 4 people. Quiet island boating town on Long Island Sound. 10 miles from Manhattan. Public trans. Walk to shops, restaurants, bars,parks. Near Botanical gardens, Bronx Zoo and Fordham University. Water view. off street parking. 3BR, 3 Full Baths. Washer/dryer. full kitchen.Fully furnished comfortable home with deck/backyard. local caretaker. Master bed with queen bed, full bath and office. Mid bedroom has queen bed. Small bed has single. TV room/den queen sleeper sofa. Galley kitchen off dining room. Piano. Winter water view. Large back deck with seating. Small front porch. Great for family of 4. Will fit 7. Additional$25.00 per person per night over 4 people. 2 story 1800 sq ft house all to yourselves. Local caretaker available during stay. We will be available by phone Small New England type town in the middle of the big city. Small Island in the Long Island Sound. Local marinas, seafood restaurants, yacht clubs, supermarket, library, shops. Parking available in driveway. Bus stop none Small New England type town in the middle of the big city. Small Island in the Long Island Sound. Local marinas, seafood restaurants, yacht clubs, supermarket, library, shops. NA Parking available in driveway. Bus stop around the corner. Near Orchard Beach, Bronx Zoo, World Famous Botanical Gardens, Fordham University and of course, Manhattan 2 story 1800 sq ft house all to yourselves. Local caretaker available during stay. We will be available by phone No smoking. No pets. This is our private home. We expect that you would treat it with care and enjoy it as we do. NA NA https://a0.muscache.com/im/pictures/26266928/14c688d9_original.jpg?aki_policy=large NA 9815788 https://www.airbnb.com/users/show/9815788 Steve 2013-11-04 US NA N/A N/A N/A f https://a0.muscache.com/im/users/9815788/profile_pic/1383845917/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/9815788/profile_pic/1383845917/original.jpg?aki_policy=profile_x_medium City Island 1 1 [‘email’, ‘phone’] t f City Island, Bronx, NY 10464, United States City Island City Island Bronx Bronx NY 10464 New York Bronx, NY US United States 40.84114 -73.78305 t House Entire home/apt 4 3 3 3 Real Bed {TV,“Cable TV”,Internet,“Wireless Internet”,“Air conditioning”,Kitchen,“Free parking on premises”,Heating,Washer,Dryer} NA $300.00 NA NA $800.00 $100.00 4 $25.00 7 90 11 months ago NA 30 60 90 365 2017-05-04 0 NA NA NA NA NA NA NA NA NA f NA NA f strict f f 1 NA
6627449 https://www.airbnb.com/rooms/6627449 2.01705e+13 2017-05-05 Large 1 BDRM in Great location This ground floor apartment is light and airy with a fully equipped kitchen. Located in the heart of City Island w/ shops, restaurants and transportation nearby. If needed there is extra sleeping space on the modular sofa. We are close to fishing, boating, biking, horseback riding or hiking. We are near Orchard Beach and Pelham Bay Park. This is a very convenient area. This ground floor apartment is light and airy with a fully equipped kitchen. Located in the heart of City Island w/ shops, restaurants and transportation nearby. If needed there is extra sleeping space on the modular sofa. We are close to fishing, boating, biking, horseback riding or hiking. We are near Orchard Beach and Pelham Bay Park. This is a very convenient area. The entire apartment is available as I am only there part of the time. There is a patio in back off the living room. You are welcome to help yourself to coffee or tea. My daughter and her family live on the Island and will be there to greet you and help you out with any information you might like. City Island is a unique and a hidden gem of New York City. Many New Yorkers do not even know it exists. It is known for its seafood and sailing, with many seafood restaurants in various price ranges. We are close to grocery stores, library, City Island Diner, antique stores and art galleries. It is a very safe and walk none City Island is a unique and a hidden gem of New York City. Many New Yorkers do not even know it exists. It is known for its seafood and sailing, with many seafood restaurants in various price ranges. We are close to grocery stores, library, City Island Diner, antique stores and art galleries. It is a very safe and walkable neighborhood. City Island is only 11/2 miles long and (3) blocks wide at its widest, with 2 seafood restaurants at the end with outdoor do it yourself seating There is off street parking in the apartment complex. I have a reserved space which you would be able to use. We are (2) blocks from the BX29 bus which you can take to the Pelham Bay subway station. ( a 10 minute ride) From there you can transfer to the #6 train to Manhattan; about 45 minutes. The entire apartment is available as I am only there part of the time. There is a patio in back off the living room. You are welcome to help yourself to coffee or tea. My daughter and her family live on the Island and will be there to greet you and help you out with any information you might like. There is no smoking and no pets. Otherwise use the place as you would your own. https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=small https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=medium https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=large https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=x_large 13886510 https://www.airbnb.com/users/show/13886510 Arlene 2014-04-04 Los Angeles, California, United States NA within a few hours 100% N/A f https://a0.muscache.com/im/users/13886510/profile_pic/1433129533/original.jpg?aki_policy=profile_small https://a0.muscache.com/im/users/13886510/profile_pic/1433129533/original.jpg?aki_policy=profile_x_medium City Island 1 1 [‘email’, ‘phone’, ‘reviews’, ‘kba’] t t City Island, City Island, NY 10464, United States City Island City Island Bronx City Island NY 10464 New York City Island, NY US United States 40.84977 -73.78661 t Apartment Entire home/apt 3 1 1 1 Real Bed {TV,Internet,“Wireless Internet”,“Air conditioning”,Kitchen,“Free parking on premises”,“Buzzer/wireless intercom”,Heating,“Family/kid friendly”,Washer,Dryer,“Smoke detector”,“Carbon monoxide detector”,Essentials,Shampoo,Hangers,“Hair dryer”,Iron} NA $125.00 $775.00 NA NA $75.00 1 $0.00 3 21 2 weeks ago NA 8 30 60 335 2017-05-05 12 2015-07-04 2016-10-24 93 10 10 10 10 10 10 f NA NA f strict f f 1 0.54

As per the requirements, I have filtered and cleaned the AIrbnb data as coded below. First, I have filtered out the 2 BHK properties as our investor is only interested in those at the moment. I have defined a function to convert the currency formats to numeric, which can be reused for any project in similar context.

#Function to convert price from $-chr format to integer
priceformat <- function(x) {
  as.numeric(gsub('\\$|,', '', x))
}

#Retaining only required columns
Airbnb_NY_raw <- Airbnb %>%
  filter(bedrooms == 2) %>%
  select(c(1,41,44,60:63,65:69,72:75)) %>%
  mutate_at(c(2,3) , as.factor) %>%
  mutate_at(c(5:8,10), priceformat) 

Below is a glimpse of the filtered data from Airbnb dataset.

## [1] 4894   16
id neighbourhood_group_cleansed zipcode square_feet price weekly_price monthly_price cleaning_fee guests_included extra_people minimum_nights maximum_nights availability_30 availability_60 availability_90 availability_365
9513511 Bronx 10462 NA 130 NA NA NA 3 0 7 1125 0 0 0 0
5046189 Bronx 10469 NA 150 NA NA 75 2 25 2 28 22 50 75 346
4357134 Queens 11102 NA 200 NA NA NA 1 0 1 1125 0 0 0 0
16027061 Queens 11102 NA 250 NA NA NA 1 0 1 7 30 60 90 180

I have taken the neighbourhood, zipcode, squarefeet of the property - which could play an imporant role in the pricing. Along with these, the weekly and monthly prices are chosen as they are discounted in general for long stays. The availability columns indicate the status of bookings before respective number of days.

Dealing with Missing Values

Before diving any deeepr, we need to format the data and identify any potential problems with it. Starting with identifying the missing values for each variable.

As it can be seen from the above bar plot, there are many number of missing values for square feet, monthly price and weekly price. Since the number of missing values is higher than 90%, it is impratical to recode and use these variables. In addition to dropping these variables, I have also deleted 62 observations which do not have any zipcodes. As it is only 62 observations that do not have a zipcodes, my judgement is to drop them as they only constitue to 0.01% of the entire data. Also, to ensure that the zipcodes length is not any longer than 5 characters, I have instituted a trim function to omit white spaces and additional hyphenated values that are attached to zipcode.

#Dropping variables with lots of missing data
Airbnb_NY <- Airbnb_NY_raw[-c(4,6,7)] %>%
  filter(zipcode != is.na(zipcode)) %>%
  mutate_at('zipcode', strtrim, 5) %>%
  mutate_at('zipcode', factor)

Imputing Missing Values

Now that most of the variables are taken care of, lets take a closer look at cleaning fee. Cleaning fee has around 25% of the missing values, and is almost mandatory for all the bookings. Considering that it could play a part in calculating revenues, we retain this variable, by replacing the missing values with mean of the entire column, making sure that the distribution of whole dataset is minimally disturbed.

#Imputing Missing values with mean
Airbnb_NY$cleaning_fee <- as.numeric(impute(Airbnb_NY$cleaning_fee,mean))

Checking for Outliers

Building a boxplot with whiskers is the best way to visualize for outliers. Based on summary and preliminary analysis, I found that below are the variables with huge number of outliers. The ones that catch our eye are a high price of 10000$ and a cleaning fee of 600$ as seen in the maximum values in the summary, far from respective thrid quantile values.

##      price         cleaning_fee    guests_included   extra_people   
##  Min.   :  10.0   Min.   :  0.00   Min.   : 1.000   Min.   :  0.00  
##  1st Qu.: 135.0   1st Qu.: 60.00   1st Qu.: 1.000   1st Qu.:  0.00  
##  Median : 190.0   Median : 85.76   Median : 1.000   Median : 10.00  
##  Mean   : 232.2   Mean   : 85.76   Mean   : 2.148   Mean   : 17.32  
##  3rd Qu.: 275.0   3rd Qu.:100.00   3rd Qu.: 3.250   3rd Qu.: 25.00  
##  Max.   :9999.0   Max.   :600.00   Max.   :15.000   Max.   :300.00  
##  minimum_nights    maximum_nights  
##  Min.   :  1.000   Min.   :   1.0  
##  1st Qu.:  2.000   1st Qu.:  30.0  
##  Median :  2.000   Median :1125.0  
##  Mean   :  4.181   Mean   : 679.1  
##  3rd Qu.:  4.000   3rd Qu.:1125.0  
##  Max.   :180.000   Max.   :9000.0

Imputing the Outliers

Similar to what we did for the missing values of cleaning fee, we impute the outliers with median value of the respective column, for all observations with abnormal values. This step ensures that the data is not skewed to one direction and yields us better results.

outlier <- function(x) {
  x[x < quantile(x,0.25) - 1.5 * IQR(x) | 
      x > quantile(x,0.75) + 1.5 * IQR(x)] <- median(x)
  x
}
Airbnb_NY[4:9] <- lapply(Airbnb_NY[4:9], outlier)

The final cleaned and processed data for Airbnb looks like this. Summary of the dataset also shows that there no outliers and missing values in the cleaned dataset.

##        id           neighbourhood_group_cleansed    zipcode         price      
##  Min.   :    5107   Bronx        :  53           11211  : 289   Min.   : 10.0  
##  1st Qu.: 4424452   Brooklyn     :2131           11238  : 187   1st Qu.:135.0  
##  Median : 9648892   Manhattan    :2284           10002  : 168   Median :190.0  
##  Mean   : 9531250   Queens       : 334           11221  : 165   Mean   :198.2  
##  3rd Qu.:14808537   Staten Island:  30           10009  : 156   3rd Qu.:250.0  
##  Max.   :18519989                                11216  : 147   Max.   :485.0  
##                                                  (Other):3720                  
##   cleaning_fee    guests_included  extra_people   minimum_nights 
##  Min.   :  0.00   Min.   :1.000   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 60.00   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:2.000  
##  Median : 85.76   Median :1.000   Median :10.00   Median :2.000  
##  Mean   : 81.62   Mean   :2.119   Mean   :13.83   Mean   :2.575  
##  3rd Qu.:100.00   3rd Qu.:3.000   3rd Qu.:25.00   3rd Qu.:3.000  
##  Max.   :160.00   Max.   :6.000   Max.   :60.00   Max.   :7.000  
##                                                                  
##  maximum_nights   availability_30  availability_60 availability_90
##  Min.   :   1.0   Min.   : 0.000   Min.   : 0.00   Min.   : 0.0   
##  1st Qu.:  30.0   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.0   
##  Median :1125.0   Median : 2.000   Median : 9.00   Median :20.0   
##  Mean   : 675.9   Mean   : 6.409   Mean   :16.67   Mean   :29.3   
##  3rd Qu.:1125.0   3rd Qu.:10.000   3rd Qu.:29.00   3rd Qu.:54.0   
##  Max.   :1365.0   Max.   :30.000   Max.   :60.00   Max.   :90.0   
##                                                                   
##  availability_365
##  Min.   :  0.0   
##  1st Qu.:  0.0   
##  Median :102.0   
##  Mean   :139.1   
##  3rd Qu.:273.0   
##  Max.   :365.0   
## 
id neighbourhood_group_cleansed zipcode price cleaning_fee guests_included extra_people minimum_nights maximum_nights availability_30 availability_60 availability_90 availability_365
9513511 Bronx 10462 130 85.75786 3 0 7 1125 0 0 0 0
5046189 Bronx 10469 150 75.00000 2 25 2 28 22 50 75 346
4357134 Queens 11102 200 85.75786 1 0 1 1125 0 0 0 0
16027061 Queens 11102 250 85.75786 1 0 1 7 30 60 90 180

In addition to the above addressed issues, there are more issues with the data like incorrect and improper State/City names, inconsistent values in various columns, abrupt text mixed with data in some of the descriptive columns etc., However, since we do not have to look at those columns for this case, I chose to ignore them and more forward with the below dataset.

Cost Data

Now that we have two cleaned datasets, we can start working on the polished data, to build models and to generate more data as needed. One of our main agendas with the Zillow dataset is to obtain cost of the properties as of today. As we already have data of the asset value of properties located in several regions until mid-2017, we can predict current value of these properties with the help of time series forecasting.

Preparing data for Time Series

To model the data using Timeseries, we need to create a Time series object that has data regarding each dependent variable over a period of time. For simplicity, I have built a model using univariate time series.

Zillow_model <- Zillow %>%  
   filter(City == "New York") %>%
  rename(zipcode = RegionName) %>%
  select(-c(1,3:148)) %>%
  gather(Time, Price, -zipcode) %>%
  arrange(zipcode)

The dataset can be viewed below.

zipcode Time Price
10003 2008-01 1481900
10003 2008-02 1478800
10003 2008-03 1509000
10003 2008-04 1532700

Separating time series data for each zipcode

After, I reduced the dataset into just 3 columns, I separated the data by zip. As prices at each zipcode are not dependent on prices at other places, and as we only have 25 zipcodes, creating a time series for each indidvidual zipcode seemed plausible.

Zipcodes <- unique(Zillow_model$zipcode)
Zillow_zip <- ARIMA_fit <- Forecast_Price <- Price_plot <- vector("list", 7)

#Loop to separate data for all 25 zip codes
for (i in seq_along(Zipcodes)) {
    Zillow_zip[[i]] <- Zillow_model %>%
    filter(zipcode == Zipcodes[i]) 
    Zillow_zip[[i]] <- ts(Zillow_zip[[i]]['Price'],
                      start = c(2008,1), frequency = 12) 
}
names(Zillow_zip) <- Zipcodes # For clearer representation

After the for loop is executed, we have 25 objects in the list Zillow_zip with time series objects for respective zip codes. A sample time series object for the first zip code can be viewed below

#sample time series object for zip - 10003
Zillow_zip[1]
## $`10003`
##          Jan     Feb     Mar     Apr     May     Jun     Jul     Aug     Sep
## 2008 1481900 1478800 1509000 1532700 1524300 1520600 1533200 1556600 1586000
## 2009 1627200 1648100 1614400 1553000 1486000 1417000 1375100 1361300 1333400
## 2010 1227800 1202600 1207100 1220100 1252400 1285700 1278300 1279100 1326700
## 2011 1381300 1380700 1368500 1372700 1378000 1361700 1357800 1364400 1358000
## 2012 1348500 1349500 1352200 1354100 1351900 1364200 1376600 1384200 1387900
## 2013 1435300 1460300 1466500 1458100 1465500 1502300 1563900 1592000 1596200
## 2014 1718500 1734300 1748600 1763700 1766700 1772200 1762700 1736700 1712400
## 2015 1716300 1720500 1721800 1741800 1775800 1796500 1821500 1870100 1901000
## 2016 1932200 1936700 1945200 1935600 1911200 1918700 1947600 1951300 1932800
## 2017 1915700 1916500 1965700 2045300 2109100 2147000                        
##          Oct     Nov     Dec
## 2008 1595200 1578900 1587000
## 2009 1299700 1296500 1273900
## 2010 1376500 1368900 1366000
## 2011 1329800 1317800 1333200
## 2012 1404200 1419200 1425700
## 2013 1625200 1672300 1699500
## 2014 1703700 1702500 1708800
## 2015 1904900 1914000 1926400
## 2016 1930400 1937500 1935100
## 2017

Predictive Modelling using ARIMA

Since I can use my univariate time series created above for forecasting, I am going to apply the the most popular time series forecasting method - ARIMA (Autoregressive integrated moving average). Built-in and customized R functions such as auto.arima, forecast and autoplot from the forecast package are readily available to model the data and perform the complicated processes.

Below for loop is completely resuable, even if there are multiple zipcodes added to the dataset or if we event to model the same for a different city. The function coded below is building an ARIMA model for each zipcode, forecasting the price for respective zip code for 30 months, which is until December 2019 and then plotting the results with 95% confidence interval.

#ARIMA model
for (i in seq_along(Zipcodes)) {
  ARIMA_fit[[i]] <- auto.arima(Zillow_zip[[i]], approximation = F,trace = F)
  Forecast_Price[[i]] <- forecast(ARIMA_fit[[i]],level = c(95), h = 30)
  Price_plot[[i]] <- autoplot( Forecast_Price[[i]]) + 
    theme(legend.position="none", axis.text.x = element_blank(),
          axis.ticks.x = element_blank(), axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),plot.title = element_text(size = 7)) + theme +
    labs(title = paste("Zipcode - ", Zipcodes[i]),x = "", y = "") 
}

Forecasted prices using ARIMA for all the zipcodes can be viewed below.

Extracting Cost Data

Now that we have cost data for all zip codes from July 2017 to December 2019, we need to extract the same from lists and make it accessible by storing it in the data frame format. Below code cleans up the list of time series objects and combines to a data frame.

#Cost Data
Zillow_Cost1 <- list()
Zillow_Cost2 <- data.frame()
for (i in seq_along(Zipcodes)) {
Zillow_Cost1[[i]] <- data.frame(cbind(
                    zipcode = Zipcodes[i],                  
                    New_price = round(Forecast_Price[[i]]$mean[7:30]), 
                     Year = rep(2018:2019, each = 12),
                     Month = rep(month.name[1:12], times = 2)
                    ))
Zillow_Cost2 <- rbind(Zillow_Cost2,Zillow_Cost1[[i]])
}

Zillow_Cost <- left_join(Zillow_Cost2,unique(Zillow_NY[1:3]), by = "zipcode") %>%
  mutate(New_price = as.numeric(paste(New_price)))

I have filtered the data to only show forecasted prices for 2018 and 2019. Below is a snapshot of the same.

zipcode New_price Year Month CountyName SizeRank
10003 2365663 2018 January New York 21
10003 2393296 2018 February New York 21
10003 2397258 2018 March New York 21
10003 2386068 2018 April New York 21

Since the purchasing and Real estate decisions could take even months to finalise the decisions, I have added below code to filter and draw forecasted price for any month until end of 2019. At any point during the year, client may use the simple filter below and view prices for respective month.

#to obtain values by Month required
Zillow_April_2019 <- Zillow_Cost %>%
    filter(Year == 2019 & Month ==  "April") %>%
    select(-c(Year,Month)) %>%
 unique()

Sample of April 2019 data.

zipcode New_price CountyName SizeRank
10003 2555948 New York 21
10011 2996088 New York 15
10013 3571885 New York 1744
10014 2567540 New York 379
10021 2146272 New York 190
10022 2491790 New York 894
10023 2538860 New York 3
10025 1707666 New York 1
10028 2668846 New York 109
10036 1824162 New York 580
10128 2523582 New York 14
10303 440313 Richmond 4647
10304 424480 Richmond 1958
10305 549036 Richmond 2087
10306 449000 Richmond 668
10308 566181 Richmond 4149
10309 488557 Richmond 3682
10312 407871 Richmond 764
10314 439918 Richmond 68
11201 1611478 Kings 32
11215 1224740 Kings 71
11217 1542713 Kings 1555
11231 1290337 Kings 1817
11234 492619 Kings 52
11434 452679 Queens 622

Revenue Data

Now that we have the cost data ready, we have to create new data to calculate revenues and then profits. By applying the below function to cleaned Airbnb dataset, we can generate data 3 more columns that show Rent Revenues, Cleaning Fee Revenues and Total Revenues.

Reuse of code for changing parameters

As we have made assumptions for the occupancy rate (75%) and revenue from cleaning fee (25%), I built below function to ensure that in the model would be least disrupted if there is a new update or if client wishes to see more variations in these rates for his analysis.

Simply inputting thr occupancy rate and clean rate( Revenue % from cleaning fee), new data can be generated.

clean_rate <- Occ_rate <- NULL
Revenues <- function(Occ_rate,clean_rate){
  rate <- Occ_rate*365
  Data <- Airbnb_NY %>%
    mutate( Clean_Rev = cleaning_fee*rate,
           Rent_Rev = price*rate,
           Tot_Rev = Clean_Rev + Rent_Rev)
}

Below is the revenue data I have generated by populating values with 75% occupancy rate and assuming that 25% returns would be gained from cleaning fee.

Rev_Data <- Revenues(Occ_rate = 0.75,clean_rate = 0.25) %>%
  group_by(zipcode,neighbourhood_group_cleansed ) %>%
  summarise( Total_Listings = n(),
            Mean_Tot_Rev = mean(Tot_Rev),
            Mean_Rent_Rev = mean(Rent_Rev),
            Mean_Clean_Rev = mean(Clean_Rev),
            Mean_avail_30 = mean(availability_30),
            Mean_avail_60 = mean(availability_60),
            Mean_avail_365 = mean(availability_365)) %>%
  arrange(-Mean_Tot_Rev)

I have created several new variables as it could be seen below. Metadata for these varaibles is as follows. All the variables are populated for each zipcode

  • Total Listings - Number of listings
  • Mean Revenues - Total, Cleaning and Rent revenues
  • Mean availabilities - Availability of listing within 30,60,365 days
zipcode neighbourhood_group_cleansed Total_Listings Mean_Tot_Rev Mean_Rent_Rev Mean_Clean_Rev Mean_avail_30 Mean_avail_60 Mean_avail_365
10111 Manhattan 1 147825.0 123187.50 24637.50 16.00000 37.00000 342.0000
10460 Bronx 1 139820.0 116343.75 23476.22 28.00000 58.00000 363.0000
10280 Manhattan 6 114463.5 88056.25 26407.26 14.16667 33.66667 206.1667
10069 Manhattan 4 113830.3 91090.31 22739.97 15.00000 30.00000 185.7500
11109 Queens 5 105367.2 82891.50 22475.74 10.60000 23.80000 57.6000
10017 Manhattan 25 104378.4 75949.20 28429.20 9.32000 18.08000 143.3200

Using the above data, I visualized the Revenue generated by each neighbourhood, scattered by zipcode. As shown in the below grpah, Manhattan is clearly generating highest revenues and is also the largest group in contrast to Staten Island, which has very few listings.

Overall, except for a few outliers, the revenues generated are around 1.5M$.

Also, plotting how availability would give us an understanding on which areas are most busy and which ones are less busy. From the below figure, clearly, Queens has the most availibility, implying that the bookings in the area are slow compared to other neighbourhoods. Manhattan tops this list as well, being least available and hence most occupied.

Drawing Inslights

In the previous sections, analysis was done separately on both Zillow and Airbnb Dataset and now we combine both of them to draw many more insights for our client.

#Combining both the Datasets
Master_data <- inner_join(Zillow_April_2019,Rev_Data, by = "zipcode") %>%
    rename(neighbourhood = neighbourhood_group_cleansed) %>%
  mutate(profit_year00 = Mean_Tot_Rev - New_price,
         profit_year10 = Mean_Tot_Rev*10 - New_price,
         profit_year25 = Mean_Tot_Rev*20 - New_price,
         profit_year30 = Mean_Tot_Rev*30 - New_price,
         profit_year40 = Mean_Tot_Rev*40 - New_price,
         profit_year50 = Mean_Tot_Rev*50 - New_price)

Our final dataset, after joining both the indidvidual datasets has 22 unique zipcodes and all the data as shown below. * The profit columns show the net profit before taxes, gained/lost by the investor, within next 50 years.

zipcode New_price CountyName SizeRank neighbourhood Total_Listings Mean_Tot_Rev Mean_Rent_Rev Mean_Clean_Rev Mean_avail_30 Mean_avail_60 Mean_avail_365 profit_year00 profit_year10 profit_year25 profit_year30 profit_year40 profit_year50
10003 2555948 New York 21 Manhattan 134 87927.94 64169.86 23758.08 6.343284 16.67164 130.2537 -2468020 -1676669 -797389.2 81890.16 961169.6 1840449
10011 2996088 New York 15 Manhattan 102 96519.59 71132.06 25387.53 6.588235 16.66667 124.2745 -2899568 -2030892 -1065696.3 -100500.41 864695.5 1829891
10013 3571885 New York 1744 Manhattan 82 96624.96 71395.34 25229.63 9.414634 23.21951 159.5366 -3475260 -2605635 -1639385.7 -673136.10 293113.5 1259363
10014 2567540 New York 379 Manhattan 95 99624.58 74364.91 25259.68 5.515790 13.84211 116.7158 -2467915 -1571294 -575048.3 421197.53 1417443.4 2413689

Plotting the profits for next 50 years, for all available zipcodes, we get the below line plot. Zipcode `10128’ seems to have the highest profits, 50 years from now. Also, notice how there is a jump in profits right around the 25th year, denoting the change to profits at large, once costs are covered.

Now only considering the profit after 50 years, below is a chart, orderd based on returns, 50 years from now. Investing altogether in Manhattan would be a great idea for higher yields.

As we also need to keep in mind the current trend of bookings in Aitbnb, based on locality, before making a purchase, I created a simple bar chart that would show the total number of listings per neighbourhood. Staten Island and Queens are closely competing suggesting that there is some scope for growth in those areas. In sync with earlier findings, Manhattan obviously has the highest numer of listings.

Before wrapping up, I have also considered mapping the neighbourhoods based on how popolated they are. These ranks are taken from the Zillow dataset and are unique for each zipcode. Manhattan and Brooklyn dominate other two regions by population. Staten Island seems to be the least populated area with least rankings.

Final Recommendations

As a consultant, I would strongly recommend investing in Manhattan, which has the highest number of listings, is most populated and the most profitable of all other areas. Specifically investing in zip codes 10036, 10011, 10025 and 10003 would be most promising, as all of them also have more than 100 listings.

Though zipcode 10312 is amongst the top 5 profitable areas, investing there is not recommended as the area only had one listing. It is way too risky to invest there based on just one account data.

Investing in Brooklyn at 11217 and 11215 would also be a good strategy, as the prices are low now. Eventhough the profits are not at the highest compared to other Manhattan zipcodes, I believe it is best to grab it before the demand goes too high in the future. Especially, since Brooklyn seems to be catching up with Manhattan, with its competetive number of listings.

Futurescope

Since we are not considering any inflation (discount as 0%) or loan rate, the calculation for profits and revenue is straight forward. As it is for any big investments, it takes a bunch of years to yield profits.

Scope within the available code

The project is coded many functions and is reusable, allowing the client to come back and test several scenarios as and when needed. A couple examples for the same would be

  • Ocupancy Rate - If the client or our team determines a different occupancy rate
  • Cleaning Revenue - If the revenue generated from cleaning is updates
  • Insights in the final section are drawn considering new price for April 2019. With slight change in the filter, we can easily obtain the same results for any month in 2018 or 2018.

Scope beyond the current data and model

We can also add a few more parameters to the project to make our predictions more accurate, though they would make it complex.

  • Additonal Guests - Though information regarding additional guests allowed by the host is provided, I could not find any data related to the number of additional guests that actually stayed per each listing. I think this factor in heavily to our revenue data
  • Customized occupancy Rate - As we already have information regarding the availability for each listing, we could add an additional variable to calculate the occupancy rate for each zipcode. Occupancy rate in Busy areas like Manhattan would definitely be higher than 75% wheras it could be towards the lower side for less popular areas.