We have taken publicly available data from Airbnb and Zillow to investigate purchasing properties in which Zipcodes would yield most profits for our client. Our investors would outsource their properties through AirBnB and are interested to make purchases only in New York City. Our client has already mentioned that 2 Bedroom properties would be most profitable according to their analysis. Based on few assumptions and the raw data available to us, the following analyses has been made.
Please click on each of the tabs bellow to navigate through the report.
To accomplish the results of this project, I worked with two data sets, as mentioned below. Metadata for these datasets is already given and information for newly created columns would be highlighted as needed.
Zillow Dataset - provdies information of property costs from April 1996 to June 2017
AirBnB Dataset - Raw Data taken from Airbnb that has all the information related to Airbnb bookings, locations, prices, availability etc.,
Below are the R packages used for the purpose of this project.
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forecast)
library(reshape2)
library(Hmisc)
library(gridExtra)
library(kableExtra)
Have used below theme for all the plots visualised in the report. Also, for better presentation and control over data, I have used scrollbars, which are also defined in the below snippet.
#them for plotting
theme <- theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"))
#function for scrollbars
scrollbar <- function(dataf, rows_disp = 4){
kable(head(dataf, n = rows_disp)) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "100%", height = "230px")
}
From the datafile provided initially, data is loaded into R, and as a first step of cleaning the data, I have filtered out the observations that are specific to New York City. For better understanding of data,I have also combined almost 200+ columns into several rows with new variables indication month and year.
Zillow <- read_csv("data/Zip_Zhvi_2bedroom.csv")
#Loading data
Zillow_NY <- Zillow %>%
gather(Time, Price, colnames(Zillow[-c(1:7)])) %>%
filter(City == "New York") %>%
rename(zipcode = RegionName) %>%
select(-c(1,3:5)) %>%
separate(Time, c("Year", "Month")) %>%
mutate_at(-c(3,6) , as.factor) %>%
mutate(Month = as.factor(month.name[Month]))
This dataset is pretty clean and the summary of variables can be seen below. From the below dimensions it is quite clear that the dataset is pretty clean and organised. There are not many issues with the data in regards to improrper formatting and there arent any observed outliers.
| zipcode | CountyName | SizeRank | Year | Month | Price |
|---|---|---|---|---|---|
| 10025 | New York | 1 | 1996 | April | NA |
| 10023 | New York | 3 | 1996 | April | NA |
| 10128 | New York | 14 | 1996 | April | NA |
| 10011 | New York | 15 | 1996 | April | NA |
## zipcode CountyName SizeRank Year Month
## 10003 : 255 Kings :1275 Min. : 1 1997 : 300 April : 550
## 10011 : 255 New York:2805 1st Qu.: 52 1998 : 300 June : 550
## 10013 : 255 Queens : 255 Median : 580 1999 : 300 May : 550
## 10014 : 255 Richmond:2040 Mean :1045 2000 : 300 August : 525
## 10021 : 255 3rd Qu.:1744 2001 : 300 December: 525
## 10022 : 255 Max. :4647 2002 : 300 February: 525
## (Other):4845 (Other):4575 (Other) :3150
## Price
## Min. : 99200
## 1st Qu.: 277700
## Median : 425100
## Mean : 779289
## 3rd Qu.:1306000
## Max. :3347100
## NA's :1696
The new data is now condensed to 6375 observations of just 6 variables from 8946 observations of 252 variables.
Eventhough, the dataset looks pretty clean, there are about 1600 missing values for price. I have broken down the number of missing price values for each zipcode and the same can be visualised below.
## `summarise()` ungrouping output (override with `.groups` argument)
The zipcodes in Kings and New York counties, which are probably new localities have lot of missing price data. This makes sense becuase, the area would not have been commercialised as early as 1994 and hence no data was available.
To better get a look at the prices for several zipcodes, I have visulaized the prices of all zipcodes, averaging by year, from 2008. As it can be seen below, the trend is mostly increasing, which makes sense too as the prices usually increase along with time. Also, it is important to nore that some zipcodes like 10003 and 10011 stand out really far from the crowd, indicating that they are amongst the costliest neighbourhoods.
Due to the bulk size of the Airbnb Dataset, I directly downloaded it from the link provided. As shown below, the gzfile function is used to decompress the zip file and read the comma separated file format.
temp <- tempfile()
download.file("http://data.insideairbnb.com/united-states/ny/new-york-city/2017-05-02/data/listings.csv.gz", temp)
Airbnb <- read.csv(gzfile(temp), as.is = TRUE, na.strings = c("","NA"))
unlink(temp)
Airbnb Dataset is directly downloaded from their webiste and like any real world data, it needs to be cleaned and checked thoroughly before processing the same. The dataset has around 95 variables with a lot of unwanted and unrelated data. Below is a quick glance at the original Airbnb dataset.
## [1] 40753 95
| id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | notes | transit | access | interaction | house_rules | thumbnail_url | medium_url | picture_url | xl_picture_url | host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | street | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | city | state | zipcode | market | smart_location | country_code | country | latitude | longitude | is_location_exact | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | amenities | square_feet | price | weekly_price | monthly_price | security_deposit | cleaning_fee | guests_included | extra_people | minimum_nights | maximum_nights | calendar_updated | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | calendar_last_scraped | number_of_reviews | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | requires_license | license | jurisdiction_names | instant_bookable | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | reviews_per_month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7949480 | https://www.airbnb.com/rooms/7949480 | 2.01705e+13 | 2017-05-03 | City Island Sanctuary relaxing BR & Bath w Parking | Come relax on City Island in our quiet guest room w/adjacent private bath. Enjoy a Continental Breakfast served by Didier at your table downstairs. Walk to restaurants, shops, the park or the beach. Park one car on our private driveway. If you are looking to explore Manhattan you can take the 29 bus to the nearest subway number 6. Or you can take the BXM8 express bus at the Pelham Bay Subway station which goes directly into Manhattan. There are also green taxis at the subway. | On parle français et anglais, (lire Français ci-dessous). I am a native New Yorker and Didier grew up in Biarritz, France. Our three friendly cats do not go into the room. The bathroom is yours during your entire stay (we have our own en suite). We will provide a Continental breakfast with fresh bread butter & jams, fresh squeezed orange juice or fresh fruit, tea or coffee served by my French husband who is a talented cook! You might even get a special treat of canelés de Bordeaux, banana bread or home made scones! City Island Sanctuary relaxing BR & Bath w Parking Pièce calme, ensoleillée et à proximité d’une salle de bain privée. Marchez jusqu, au parc ou à la plage. On peut aller à pied à quelques restaurants, une petite supermarché, petits magasins et le City Island Nautical Museum. Prenez le metro ou le bus express à Manhattan (une heure et demi à centre-ville. Les chats sont en résidence, mais ne vont pas dans la chambre. Cable TV avec TV5 Monde, lecteur DVD, Wi-fi . Petit | Come relax on City Island in our quiet guest room w/adjacent private bath. Enjoy a Continental Breakfast served by Didier at your table downstairs. Walk to restaurants, shops, the park or the beach. Park one car on our private driveway. If you are looking to explore Manhattan you can take the 29 bus to the nearest subway number 6. Or you can take the BXM8 express bus at the Pelham Bay Subway station which goes directly into Manhattan. There are also green taxis at the subway. On parle français et anglais, (lire Français ci-dessous). I am a native New Yorker and Didier grew up in Biarritz, France. Our three friendly cats do not go into the room. The bathroom is yours during your entire stay (we have our own en suite). We will provide a Continental breakfast with fresh bread butter & jams, fresh squeezed orange juice or fresh fruit, tea or coffee served by my French husband who is a talented cook! You might even get a special treat of canelés de Bordeaux, banana bread or home mad | none | City Island is a unique sanctuary in New York City’s the Bronx. It boasts many wonderful seafood restaurants, two great pubs and the only true French bistro in the borough. We are walking distance to the largest park in the City of New York, Pelham Bay Park and also to lovely Orchard Beach, the Riviera of the Bronx! | We are walkable to Pelham Park which is the largest park in NYC. It has picnic areas, tennis courts, horseback riding, Orchard Beach and a lovely rocky coast. Check out their website for all there is to do! | On the island, you can walk to everything. A local bus will take you to shopping at the Mall at Bay Plaza near Coop City. If you want to go to Manhattan you take the local 29 bus to the subway No. 6 to Manhattan. The local bus does not run after midnight but there are green taxis at the subway station. There is an express bus early in the morning which goes directly from the island to Manhattan. It will take at least one hour to get to midtown by subway and a bit longer on the express bus depending upon traffic. | Feel free to enjoy our large backyard in season, our downstairs large recreation area with TV, WIFI. Walk to the local beaches for the view or a dip. Please, no food or drinks except water in the guest bedroom. | We speak fluent French and English. We will be in residence during your stay to help you with recommendations for a real New York experience or a unique City Island French experience. But, we do respect your privacy! | No extra guests or visitors who are not in the reservation. Maximum 2 registered guests only. No late check ins or check outs. Please, no food or drinks in the guest bedroom except water. No smoking on the property. It’s a residential neighborhood so we keep it quiet after 10. Please only inquire about booking for yourself, we will not accept 3rd party inquiries. | https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=small | https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=medium | https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=large | https://a0.muscache.com/im/pictures/001d1cb0-6ef2-4074-bd39-e486ee58d00f.jpg?aki_policy=x_large | 119445 | https://www.airbnb.com/users/show/119445 | Linda & Didier | 2010-05-06 | New York, New York, United States |
I am a designer, former instructor at a design school and my husband is a talented French cook (he is from France). He is also an artist, writer and a Zen practitioner. We are sociable but respectful of your privacy. I am a native New Yorker having relocated back to NYC from So Cal because I Love New York! We are the only French/American Airbnb in the Bronx! |
within an hour | 100% | N/A | t | https://a0.muscache.com/im/pictures/2de30fe4-bebe-4145-a86a-b8a8a970078d.jpg?aki_policy=profile_small | https://a0.muscache.com/im/pictures/2de30fe4-bebe-4145-a86a-b8a8a970078d.jpg?aki_policy=profile_x_medium | City Island | 1 | 1 | [‘email’, ‘phone’, ‘reviews’, ‘jumio’] | t | t | City Island, Bronx, NY 10464, United States | City Island | City Island | Bronx | Bronx | NY | 10464 | New York | Bronx, NY | US | United States | 40.85205 | -73.78868 | t | House | Private room | 2 | 1 | 1 | 1 | Real Bed | {“Cable TV”,“Wireless Internet”,“Air conditioning”,“Free parking on premises”,Breakfast,“Pets live on this property”,Cat(s),“Indoor fireplace”,Heating,“Smoke detector”,“Carbon monoxide detector”,“Fire extinguisher”,Essentials,Shampoo,Hangers,“Hair dryer”,Iron,“translation missing: en.hosting_amenity_49”,“translation missing: en.hosting_amenity_50”} | NA | $99.00 | NA | NA | $100.00 | NA | 1 | $20.00 | 1 | 7 | yesterday | NA | 24 | 54 | 80 | 170 | 2017-05-03 | 25 | 2016-01-18 | 2017-04-23 | 100 | 10 | 10 | 10 | 10 | 10 | 10 | f | NA | NA | f | moderate | t | t | 1 | 1.59 |
| 16042478 | https://www.airbnb.com/rooms/16042478 | 2.01705e+13 | 2017-05-04 | WATERFRONT STUDIO APARTMENT | My place is close to Sea Shore. You’ll love my place because of the coziness, the location, the views, and the people. My place is good for couples, solo adventurers, business travelers, and families (with kids). | (URL HIDDEN) | My place is close to Sea Shore. You’ll love my place because of the coziness, the location, the views, and the people. My place is good for couples, solo adventurers, business travelers, and families (with kids). (URL HIDDEN) public transport from Grand Central, NYC or from JFK, Laguadia is about 30 minutes away. (URL HIDDEN) Fine sea food restaurant, bars and night clubs all within walking distance, Sailing, fishing, canoeing, marinas and just 30 minutes away from central NYC or major airports like JFK or LaGuardia airports. Enjoy NYC | none | (URL HIDDEN) | Fine sea food restaurant, bars and night clubs all within walking distance, Sailing, fishing, canoeing, marinas and just 30 minutes away from central NYC or major airports like JFK or LaGuardia airports. Enjoy NYC | NA | public transport from Grand Central, NYC or from JFK, Laguadia is about 30 minutes away. | NA | NA | https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=small | https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=medium | https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=large | https://a0.muscache.com/im/pictures/bb5bc3c4-3eb2-4b90-a2c8-28401c6eabcc.jpg?aki_policy=x_large | 9117975 | https://www.airbnb.com/users/show/9117975 | Collins | 2013-09-29 | New York, New York, United States | I am married with 3 children and 2 grandkid. Family and community is the most important thing besides God. I am a Tax consultant and a I have travelled to a few countries including Germany, France, Holland, Netherlands, England in Europe. I have also travel extensively within the Caribbean and Mexico so I appreciate the importance of comfortable and affordable lodging when one is away. | a few days or more | 0% | N/A | f | https://a0.muscache.com/im/users/9117975/profile_pic/1423794446/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/9117975/profile_pic/1423794446/original.jpg?aki_policy=profile_x_medium | City Island | 1 | 1 | [‘phone’, ‘facebook’] | t | f | City Island, Bronx, NY 10464, United States | City Island | City Island | Bronx | Bronx | NY | 10464 | New York | Bronx, NY | US | United States | 40.85349 | -73.78861 | t | Apartment | Private room | 4 | 1 | 1 | 1 | Real Bed | {TV,Internet,“Wireless Internet”,“Air conditioning”,Kitchen,Heating,“Family/kid friendly”,“Smoke detector”,“Carbon monoxide detector”,“First aid kit”,“Fire extinguisher”,Essentials,Shampoo,Hangers,“Hair dryer”,Iron} | NA | $200.00 | NA | NA | NA | NA | 1 | $0.00 | 7 | 29 | 6 months ago | NA | 30 | 60 | 90 | 180 | 2017-05-04 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | f | NA | NA | t | flexible | f | f | 1 | NA |
| 1886820 | https://www.airbnb.com/rooms/1886820 | 2.01705e+13 | 2017-05-04 | Quaint City Island Community. | Quiet island boating town on Long Island Sound. 10 miles from Manhattan. Public trans. Walk to shops, restaurants, bars,parks. Near Botanical gardens, Bronx Zoo and Fordham University. Water view. off street parking. 3BR, 3 Full Baths. Washer/dryer. full kitchen.Fully furnished comfortable home with deck/backyard. local caretaker. | Master bed with queen bed, full bath and office. Mid bedroom has queen bed. Small bed has single. TV room/den queen sleeper sofa. Galley kitchen off dining room. Piano. Winter water view. Large back deck with seating. Small front porch. Great for family of 4. Will fit 7. Additional$25.00 per person per night over 4 people. | Quiet island boating town on Long Island Sound. 10 miles from Manhattan. Public trans. Walk to shops, restaurants, bars,parks. Near Botanical gardens, Bronx Zoo and Fordham University. Water view. off street parking. 3BR, 3 Full Baths. Washer/dryer. full kitchen.Fully furnished comfortable home with deck/backyard. local caretaker. Master bed with queen bed, full bath and office. Mid bedroom has queen bed. Small bed has single. TV room/den queen sleeper sofa. Galley kitchen off dining room. Piano. Winter water view. Large back deck with seating. Small front porch. Great for family of 4. Will fit 7. Additional$25.00 per person per night over 4 people. 2 story 1800 sq ft house all to yourselves. Local caretaker available during stay. We will be available by phone Small New England type town in the middle of the big city. Small Island in the Long Island Sound. Local marinas, seafood restaurants, yacht clubs, supermarket, library, shops. Parking available in driveway. Bus stop | none | Small New England type town in the middle of the big city. Small Island in the Long Island Sound. Local marinas, seafood restaurants, yacht clubs, supermarket, library, shops. | NA | Parking available in driveway. Bus stop around the corner. Near Orchard Beach, Bronx Zoo, World Famous Botanical Gardens, Fordham University and of course, Manhattan | 2 story 1800 sq ft house all to yourselves. | Local caretaker available during stay. We will be available by phone | No smoking. No pets. This is our private home. We expect that you would treat it with care and enjoy it as we do. | NA | NA | https://a0.muscache.com/im/pictures/26266928/14c688d9_original.jpg?aki_policy=large | NA | 9815788 | https://www.airbnb.com/users/show/9815788 | Steve | 2013-11-04 | US | NA | N/A | N/A | N/A | f | https://a0.muscache.com/im/users/9815788/profile_pic/1383845917/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/9815788/profile_pic/1383845917/original.jpg?aki_policy=profile_x_medium | City Island | 1 | 1 | [‘email’, ‘phone’] | t | f | City Island, Bronx, NY 10464, United States | City Island | City Island | Bronx | Bronx | NY | 10464 | New York | Bronx, NY | US | United States | 40.84114 | -73.78305 | t | House | Entire home/apt | 4 | 3 | 3 | 3 | Real Bed | {TV,“Cable TV”,Internet,“Wireless Internet”,“Air conditioning”,Kitchen,“Free parking on premises”,Heating,Washer,Dryer} | NA | $300.00 | NA | NA | $800.00 | $100.00 | 4 | $25.00 | 7 | 90 | 11 months ago | NA | 30 | 60 | 90 | 365 | 2017-05-04 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | f | NA | NA | f | strict | f | f | 1 | NA |
| 6627449 | https://www.airbnb.com/rooms/6627449 | 2.01705e+13 | 2017-05-05 | Large 1 BDRM in Great location | This ground floor apartment is light and airy with a fully equipped kitchen. Located in the heart of City Island w/ shops, restaurants and transportation nearby. If needed there is extra sleeping space on the modular sofa. | We are close to fishing, boating, biking, horseback riding or hiking. We are near Orchard Beach and Pelham Bay Park. This is a very convenient area. | This ground floor apartment is light and airy with a fully equipped kitchen. Located in the heart of City Island w/ shops, restaurants and transportation nearby. If needed there is extra sleeping space on the modular sofa. We are close to fishing, boating, biking, horseback riding or hiking. We are near Orchard Beach and Pelham Bay Park. This is a very convenient area. The entire apartment is available as I am only there part of the time. There is a patio in back off the living room. You are welcome to help yourself to coffee or tea. My daughter and her family live on the Island and will be there to greet you and help you out with any information you might like. City Island is a unique and a hidden gem of New York City. Many New Yorkers do not even know it exists. It is known for its seafood and sailing, with many seafood restaurants in various price ranges. We are close to grocery stores, library, City Island Diner, antique stores and art galleries. It is a very safe and walk | none | City Island is a unique and a hidden gem of New York City. Many New Yorkers do not even know it exists. It is known for its seafood and sailing, with many seafood restaurants in various price ranges. We are close to grocery stores, library, City Island Diner, antique stores and art galleries. It is a very safe and walkable neighborhood. City Island is only 11/2 miles long and (3) blocks wide at its widest, with 2 seafood restaurants at the end with outdoor do it yourself seating | There is off street parking in the apartment complex. I have a reserved space which you would be able to use. | We are (2) blocks from the BX29 bus which you can take to the Pelham Bay subway station. ( a 10 minute ride) From there you can transfer to the #6 train to Manhattan; about 45 minutes. | The entire apartment is available as I am only there part of the time. There is a patio in back off the living room. You are welcome to help yourself to coffee or tea. | My daughter and her family live on the Island and will be there to greet you and help you out with any information you might like. | There is no smoking and no pets. Otherwise use the place as you would your own. | https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=small | https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=medium | https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=large | https://a0.muscache.com/im/pictures/83539171/13fe7e85_original.jpg?aki_policy=x_large | 13886510 | https://www.airbnb.com/users/show/13886510 | Arlene | 2014-04-04 | Los Angeles, California, United States | NA | within a few hours | 100% | N/A | f | https://a0.muscache.com/im/users/13886510/profile_pic/1433129533/original.jpg?aki_policy=profile_small | https://a0.muscache.com/im/users/13886510/profile_pic/1433129533/original.jpg?aki_policy=profile_x_medium | City Island | 1 | 1 | [‘email’, ‘phone’, ‘reviews’, ‘kba’] | t | t | City Island, City Island, NY 10464, United States | City Island | City Island | Bronx | City Island | NY | 10464 | New York | City Island, NY | US | United States | 40.84977 | -73.78661 | t | Apartment | Entire home/apt | 3 | 1 | 1 | 1 | Real Bed | {TV,Internet,“Wireless Internet”,“Air conditioning”,Kitchen,“Free parking on premises”,“Buzzer/wireless intercom”,Heating,“Family/kid friendly”,Washer,Dryer,“Smoke detector”,“Carbon monoxide detector”,Essentials,Shampoo,Hangers,“Hair dryer”,Iron} | NA | $125.00 | $775.00 | NA | NA | $75.00 | 1 | $0.00 | 3 | 21 | 2 weeks ago | NA | 8 | 30 | 60 | 335 | 2017-05-05 | 12 | 2015-07-04 | 2016-10-24 | 93 | 10 | 10 | 10 | 10 | 10 | 10 | f | NA | NA | f | strict | f | f | 1 | 0.54 |
As per the requirements, I have filtered and cleaned the AIrbnb data as coded below. First, I have filtered out the 2 BHK properties as our investor is only interested in those at the moment. I have defined a function to convert the currency formats to numeric, which can be reused for any project in similar context.
#Function to convert price from $-chr format to integer
priceformat <- function(x) {
as.numeric(gsub('\\$|,', '', x))
}
#Retaining only required columns
Airbnb_NY_raw <- Airbnb %>%
filter(bedrooms == 2) %>%
select(c(1,41,44,60:63,65:69,72:75)) %>%
mutate_at(c(2,3) , as.factor) %>%
mutate_at(c(5:8,10), priceformat)
Below is a glimpse of the filtered data from Airbnb dataset.
## [1] 4894 16
| id | neighbourhood_group_cleansed | zipcode | square_feet | price | weekly_price | monthly_price | cleaning_fee | guests_included | extra_people | minimum_nights | maximum_nights | availability_30 | availability_60 | availability_90 | availability_365 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9513511 | Bronx | 10462 | NA | 130 | NA | NA | NA | 3 | 0 | 7 | 1125 | 0 | 0 | 0 | 0 |
| 5046189 | Bronx | 10469 | NA | 150 | NA | NA | 75 | 2 | 25 | 2 | 28 | 22 | 50 | 75 | 346 |
| 4357134 | Queens | 11102 | NA | 200 | NA | NA | NA | 1 | 0 | 1 | 1125 | 0 | 0 | 0 | 0 |
| 16027061 | Queens | 11102 | NA | 250 | NA | NA | NA | 1 | 0 | 1 | 7 | 30 | 60 | 90 | 180 |
I have taken the neighbourhood, zipcode, squarefeet of the property - which could play an imporant role in the pricing. Along with these, the weekly and monthly prices are chosen as they are discounted in general for long stays. The availability columns indicate the status of bookings before respective number of days.
Before diving any deeepr, we need to format the data and identify any potential problems with it. Starting with identifying the missing values for each variable.
As it can be seen from the above bar plot, there are many number of missing values for square feet, monthly price and weekly price. Since the number of missing values is higher than 90%, it is impratical to recode and use these variables. In addition to dropping these variables, I have also deleted 62 observations which do not have any zipcodes. As it is only 62 observations that do not have a zipcodes, my judgement is to drop them as they only constitue to 0.01% of the entire data. Also, to ensure that the zipcodes length is not any longer than 5 characters, I have instituted a trim function to omit white spaces and additional hyphenated values that are attached to zipcode.
#Dropping variables with lots of missing data
Airbnb_NY <- Airbnb_NY_raw[-c(4,6,7)] %>%
filter(zipcode != is.na(zipcode)) %>%
mutate_at('zipcode', strtrim, 5) %>%
mutate_at('zipcode', factor)
Imputing Missing Values
Now that most of the variables are taken care of, lets take a closer look at cleaning fee. Cleaning fee has around 25% of the missing values, and is almost mandatory for all the bookings. Considering that it could play a part in calculating revenues, we retain this variable, by replacing the missing values with mean of the entire column, making sure that the distribution of whole dataset is minimally disturbed.
#Imputing Missing values with mean
Airbnb_NY$cleaning_fee <- as.numeric(impute(Airbnb_NY$cleaning_fee,mean))
Building a boxplot with whiskers is the best way to visualize for outliers. Based on summary and preliminary analysis, I found that below are the variables with huge number of outliers. The ones that catch our eye are a high price of 10000$ and a cleaning fee of 600$ as seen in the maximum values in the summary, far from respective thrid quantile values.
## price cleaning_fee guests_included extra_people
## Min. : 10.0 Min. : 0.00 Min. : 1.000 Min. : 0.00
## 1st Qu.: 135.0 1st Qu.: 60.00 1st Qu.: 1.000 1st Qu.: 0.00
## Median : 190.0 Median : 85.76 Median : 1.000 Median : 10.00
## Mean : 232.2 Mean : 85.76 Mean : 2.148 Mean : 17.32
## 3rd Qu.: 275.0 3rd Qu.:100.00 3rd Qu.: 3.250 3rd Qu.: 25.00
## Max. :9999.0 Max. :600.00 Max. :15.000 Max. :300.00
## minimum_nights maximum_nights
## Min. : 1.000 Min. : 1.0
## 1st Qu.: 2.000 1st Qu.: 30.0
## Median : 2.000 Median :1125.0
## Mean : 4.181 Mean : 679.1
## 3rd Qu.: 4.000 3rd Qu.:1125.0
## Max. :180.000 Max. :9000.0
Imputing the Outliers
Similar to what we did for the missing values of cleaning fee, we impute the outliers with median value of the respective column, for all observations with abnormal values. This step ensures that the data is not skewed to one direction and yields us better results.
outlier <- function(x) {
x[x < quantile(x,0.25) - 1.5 * IQR(x) |
x > quantile(x,0.75) + 1.5 * IQR(x)] <- median(x)
x
}
Airbnb_NY[4:9] <- lapply(Airbnb_NY[4:9], outlier)
The final cleaned and processed data for Airbnb looks like this. Summary of the dataset also shows that there no outliers and missing values in the cleaned dataset.
## id neighbourhood_group_cleansed zipcode price
## Min. : 5107 Bronx : 53 11211 : 289 Min. : 10.0
## 1st Qu.: 4424452 Brooklyn :2131 11238 : 187 1st Qu.:135.0
## Median : 9648892 Manhattan :2284 10002 : 168 Median :190.0
## Mean : 9531250 Queens : 334 11221 : 165 Mean :198.2
## 3rd Qu.:14808537 Staten Island: 30 10009 : 156 3rd Qu.:250.0
## Max. :18519989 11216 : 147 Max. :485.0
## (Other):3720
## cleaning_fee guests_included extra_people minimum_nights
## Min. : 0.00 Min. :1.000 Min. : 0.00 Min. :1.000
## 1st Qu.: 60.00 1st Qu.:1.000 1st Qu.: 0.00 1st Qu.:2.000
## Median : 85.76 Median :1.000 Median :10.00 Median :2.000
## Mean : 81.62 Mean :2.119 Mean :13.83 Mean :2.575
## 3rd Qu.:100.00 3rd Qu.:3.000 3rd Qu.:25.00 3rd Qu.:3.000
## Max. :160.00 Max. :6.000 Max. :60.00 Max. :7.000
##
## maximum_nights availability_30 availability_60 availability_90
## Min. : 1.0 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 30.0 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0
## Median :1125.0 Median : 2.000 Median : 9.00 Median :20.0
## Mean : 675.9 Mean : 6.409 Mean :16.67 Mean :29.3
## 3rd Qu.:1125.0 3rd Qu.:10.000 3rd Qu.:29.00 3rd Qu.:54.0
## Max. :1365.0 Max. :30.000 Max. :60.00 Max. :90.0
##
## availability_365
## Min. : 0.0
## 1st Qu.: 0.0
## Median :102.0
## Mean :139.1
## 3rd Qu.:273.0
## Max. :365.0
##
| id | neighbourhood_group_cleansed | zipcode | price | cleaning_fee | guests_included | extra_people | minimum_nights | maximum_nights | availability_30 | availability_60 | availability_90 | availability_365 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9513511 | Bronx | 10462 | 130 | 85.75786 | 3 | 0 | 7 | 1125 | 0 | 0 | 0 | 0 |
| 5046189 | Bronx | 10469 | 150 | 75.00000 | 2 | 25 | 2 | 28 | 22 | 50 | 75 | 346 |
| 4357134 | Queens | 11102 | 200 | 85.75786 | 1 | 0 | 1 | 1125 | 0 | 0 | 0 | 0 |
| 16027061 | Queens | 11102 | 250 | 85.75786 | 1 | 0 | 1 | 7 | 30 | 60 | 90 | 180 |
In addition to the above addressed issues, there are more issues with the data like incorrect and improper State/City names, inconsistent values in various columns, abrupt text mixed with data in some of the descriptive columns etc., However, since we do not have to look at those columns for this case, I chose to ignore them and more forward with the below dataset.
Now that we have two cleaned datasets, we can start working on the polished data, to build models and to generate more data as needed. One of our main agendas with the Zillow dataset is to obtain cost of the properties as of today. As we already have data of the asset value of properties located in several regions until mid-2017, we can predict current value of these properties with the help of time series forecasting.
To model the data using Timeseries, we need to create a Time series object that has data regarding each dependent variable over a period of time. For simplicity, I have built a model using univariate time series.
Zillow_model <- Zillow %>%
filter(City == "New York") %>%
rename(zipcode = RegionName) %>%
select(-c(1,3:148)) %>%
gather(Time, Price, -zipcode) %>%
arrange(zipcode)
The dataset can be viewed below.
| zipcode | Time | Price |
|---|---|---|
| 10003 | 2008-01 | 1481900 |
| 10003 | 2008-02 | 1478800 |
| 10003 | 2008-03 | 1509000 |
| 10003 | 2008-04 | 1532700 |
After, I reduced the dataset into just 3 columns, I separated the data by zip. As prices at each zipcode are not dependent on prices at other places, and as we only have 25 zipcodes, creating a time series for each indidvidual zipcode seemed plausible.
Zipcodes <- unique(Zillow_model$zipcode)
Zillow_zip <- ARIMA_fit <- Forecast_Price <- Price_plot <- vector("list", 7)
#Loop to separate data for all 25 zip codes
for (i in seq_along(Zipcodes)) {
Zillow_zip[[i]] <- Zillow_model %>%
filter(zipcode == Zipcodes[i])
Zillow_zip[[i]] <- ts(Zillow_zip[[i]]['Price'],
start = c(2008,1), frequency = 12)
}
names(Zillow_zip) <- Zipcodes # For clearer representation
After the for loop is executed, we have 25 objects in the list Zillow_zip with time series objects for respective zip codes. A sample time series object for the first zip code can be viewed below
#sample time series object for zip - 10003
Zillow_zip[1]
## $`10003`
## Jan Feb Mar Apr May Jun Jul Aug Sep
## 2008 1481900 1478800 1509000 1532700 1524300 1520600 1533200 1556600 1586000
## 2009 1627200 1648100 1614400 1553000 1486000 1417000 1375100 1361300 1333400
## 2010 1227800 1202600 1207100 1220100 1252400 1285700 1278300 1279100 1326700
## 2011 1381300 1380700 1368500 1372700 1378000 1361700 1357800 1364400 1358000
## 2012 1348500 1349500 1352200 1354100 1351900 1364200 1376600 1384200 1387900
## 2013 1435300 1460300 1466500 1458100 1465500 1502300 1563900 1592000 1596200
## 2014 1718500 1734300 1748600 1763700 1766700 1772200 1762700 1736700 1712400
## 2015 1716300 1720500 1721800 1741800 1775800 1796500 1821500 1870100 1901000
## 2016 1932200 1936700 1945200 1935600 1911200 1918700 1947600 1951300 1932800
## 2017 1915700 1916500 1965700 2045300 2109100 2147000
## Oct Nov Dec
## 2008 1595200 1578900 1587000
## 2009 1299700 1296500 1273900
## 2010 1376500 1368900 1366000
## 2011 1329800 1317800 1333200
## 2012 1404200 1419200 1425700
## 2013 1625200 1672300 1699500
## 2014 1703700 1702500 1708800
## 2015 1904900 1914000 1926400
## 2016 1930400 1937500 1935100
## 2017
Since I can use my univariate time series created above for forecasting, I am going to apply the the most popular time series forecasting method - ARIMA (Autoregressive integrated moving average). Built-in and customized R functions such as auto.arima, forecast and autoplot from the forecast package are readily available to model the data and perform the complicated processes.
Below for loop is completely resuable, even if there are multiple zipcodes added to the dataset or if we event to model the same for a different city. The function coded below is building an ARIMA model for each zipcode, forecasting the price for respective zip code for 30 months, which is until December 2019 and then plotting the results with 95% confidence interval.
#ARIMA model
for (i in seq_along(Zipcodes)) {
ARIMA_fit[[i]] <- auto.arima(Zillow_zip[[i]], approximation = F,trace = F)
Forecast_Price[[i]] <- forecast(ARIMA_fit[[i]],level = c(95), h = 30)
Price_plot[[i]] <- autoplot( Forecast_Price[[i]]) +
theme(legend.position="none", axis.text.x = element_blank(),
axis.ticks.x = element_blank(), axis.text.y = element_blank(),
axis.ticks.y = element_blank(),plot.title = element_text(size = 7)) + theme +
labs(title = paste("Zipcode - ", Zipcodes[i]),x = "", y = "")
}
Forecasted prices using ARIMA for all the zipcodes can be viewed below.
Now that we have cost data for all zip codes from July 2017 to December 2019, we need to extract the same from lists and make it accessible by storing it in the data frame format. Below code cleans up the list of time series objects and combines to a data frame.
#Cost Data
Zillow_Cost1 <- list()
Zillow_Cost2 <- data.frame()
for (i in seq_along(Zipcodes)) {
Zillow_Cost1[[i]] <- data.frame(cbind(
zipcode = Zipcodes[i],
New_price = round(Forecast_Price[[i]]$mean[7:30]),
Year = rep(2018:2019, each = 12),
Month = rep(month.name[1:12], times = 2)
))
Zillow_Cost2 <- rbind(Zillow_Cost2,Zillow_Cost1[[i]])
}
Zillow_Cost <- left_join(Zillow_Cost2,unique(Zillow_NY[1:3]), by = "zipcode") %>%
mutate(New_price = as.numeric(paste(New_price)))
I have filtered the data to only show forecasted prices for 2018 and 2019. Below is a snapshot of the same.
| zipcode | New_price | Year | Month | CountyName | SizeRank |
|---|---|---|---|---|---|
| 10003 | 2365663 | 2018 | January | New York | 21 |
| 10003 | 2393296 | 2018 | February | New York | 21 |
| 10003 | 2397258 | 2018 | March | New York | 21 |
| 10003 | 2386068 | 2018 | April | New York | 21 |
Since the purchasing and Real estate decisions could take even months to finalise the decisions, I have added below code to filter and draw forecasted price for any month until end of 2019. At any point during the year, client may use the simple filter below and view prices for respective month.
#to obtain values by Month required
Zillow_April_2019 <- Zillow_Cost %>%
filter(Year == 2019 & Month == "April") %>%
select(-c(Year,Month)) %>%
unique()
Sample of April 2019 data.
| zipcode | New_price | CountyName | SizeRank |
|---|---|---|---|
| 10003 | 2555948 | New York | 21 |
| 10011 | 2996088 | New York | 15 |
| 10013 | 3571885 | New York | 1744 |
| 10014 | 2567540 | New York | 379 |
| 10021 | 2146272 | New York | 190 |
| 10022 | 2491790 | New York | 894 |
| 10023 | 2538860 | New York | 3 |
| 10025 | 1707666 | New York | 1 |
| 10028 | 2668846 | New York | 109 |
| 10036 | 1824162 | New York | 580 |
| 10128 | 2523582 | New York | 14 |
| 10303 | 440313 | Richmond | 4647 |
| 10304 | 424480 | Richmond | 1958 |
| 10305 | 549036 | Richmond | 2087 |
| 10306 | 449000 | Richmond | 668 |
| 10308 | 566181 | Richmond | 4149 |
| 10309 | 488557 | Richmond | 3682 |
| 10312 | 407871 | Richmond | 764 |
| 10314 | 439918 | Richmond | 68 |
| 11201 | 1611478 | Kings | 32 |
| 11215 | 1224740 | Kings | 71 |
| 11217 | 1542713 | Kings | 1555 |
| 11231 | 1290337 | Kings | 1817 |
| 11234 | 492619 | Kings | 52 |
| 11434 | 452679 | Queens | 622 |
Now that we have the cost data ready, we have to create new data to calculate revenues and then profits. By applying the below function to cleaned Airbnb dataset, we can generate data 3 more columns that show Rent Revenues, Cleaning Fee Revenues and Total Revenues.
As we have made assumptions for the occupancy rate (75%) and revenue from cleaning fee (25%), I built below function to ensure that in the model would be least disrupted if there is a new update or if client wishes to see more variations in these rates for his analysis.
Simply inputting thr occupancy rate and clean rate( Revenue % from cleaning fee), new data can be generated.
clean_rate <- Occ_rate <- NULL
Revenues <- function(Occ_rate,clean_rate){
rate <- Occ_rate*365
Data <- Airbnb_NY %>%
mutate( Clean_Rev = cleaning_fee*rate,
Rent_Rev = price*rate,
Tot_Rev = Clean_Rev + Rent_Rev)
}
Below is the revenue data I have generated by populating values with 75% occupancy rate and assuming that 25% returns would be gained from cleaning fee.
Rev_Data <- Revenues(Occ_rate = 0.75,clean_rate = 0.25) %>%
group_by(zipcode,neighbourhood_group_cleansed ) %>%
summarise( Total_Listings = n(),
Mean_Tot_Rev = mean(Tot_Rev),
Mean_Rent_Rev = mean(Rent_Rev),
Mean_Clean_Rev = mean(Clean_Rev),
Mean_avail_30 = mean(availability_30),
Mean_avail_60 = mean(availability_60),
Mean_avail_365 = mean(availability_365)) %>%
arrange(-Mean_Tot_Rev)
I have created several new variables as it could be seen below. Metadata for these varaibles is as follows. All the variables are populated for each zipcode
| zipcode | neighbourhood_group_cleansed | Total_Listings | Mean_Tot_Rev | Mean_Rent_Rev | Mean_Clean_Rev | Mean_avail_30 | Mean_avail_60 | Mean_avail_365 |
|---|---|---|---|---|---|---|---|---|
| 10111 | Manhattan | 1 | 147825.0 | 123187.50 | 24637.50 | 16.00000 | 37.00000 | 342.0000 |
| 10460 | Bronx | 1 | 139820.0 | 116343.75 | 23476.22 | 28.00000 | 58.00000 | 363.0000 |
| 10280 | Manhattan | 6 | 114463.5 | 88056.25 | 26407.26 | 14.16667 | 33.66667 | 206.1667 |
| 10069 | Manhattan | 4 | 113830.3 | 91090.31 | 22739.97 | 15.00000 | 30.00000 | 185.7500 |
| 11109 | Queens | 5 | 105367.2 | 82891.50 | 22475.74 | 10.60000 | 23.80000 | 57.6000 |
| 10017 | Manhattan | 25 | 104378.4 | 75949.20 | 28429.20 | 9.32000 | 18.08000 | 143.3200 |
Using the above data, I visualized the Revenue generated by each neighbourhood, scattered by zipcode. As shown in the below grpah, Manhattan is clearly generating highest revenues and is also the largest group in contrast to Staten Island, which has very few listings.
Overall, except for a few outliers, the revenues generated are around 1.5M$.
Also, plotting how availability would give us an understanding on which areas are most busy and which ones are less busy. From the below figure, clearly, Queens has the most availibility, implying that the bookings in the area are slow compared to other neighbourhoods. Manhattan tops this list as well, being least available and hence most occupied.
In the previous sections, analysis was done separately on both Zillow and Airbnb Dataset and now we combine both of them to draw many more insights for our client.
#Combining both the Datasets
Master_data <- inner_join(Zillow_April_2019,Rev_Data, by = "zipcode") %>%
rename(neighbourhood = neighbourhood_group_cleansed) %>%
mutate(profit_year00 = Mean_Tot_Rev - New_price,
profit_year10 = Mean_Tot_Rev*10 - New_price,
profit_year25 = Mean_Tot_Rev*20 - New_price,
profit_year30 = Mean_Tot_Rev*30 - New_price,
profit_year40 = Mean_Tot_Rev*40 - New_price,
profit_year50 = Mean_Tot_Rev*50 - New_price)
Our final dataset, after joining both the indidvidual datasets has 22 unique zipcodes and all the data as shown below. * The profit columns show the net profit before taxes, gained/lost by the investor, within next 50 years.
| zipcode | New_price | CountyName | SizeRank | neighbourhood | Total_Listings | Mean_Tot_Rev | Mean_Rent_Rev | Mean_Clean_Rev | Mean_avail_30 | Mean_avail_60 | Mean_avail_365 | profit_year00 | profit_year10 | profit_year25 | profit_year30 | profit_year40 | profit_year50 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10003 | 2555948 | New York | 21 | Manhattan | 134 | 87927.94 | 64169.86 | 23758.08 | 6.343284 | 16.67164 | 130.2537 | -2468020 | -1676669 | -797389.2 | 81890.16 | 961169.6 | 1840449 |
| 10011 | 2996088 | New York | 15 | Manhattan | 102 | 96519.59 | 71132.06 | 25387.53 | 6.588235 | 16.66667 | 124.2745 | -2899568 | -2030892 | -1065696.3 | -100500.41 | 864695.5 | 1829891 |
| 10013 | 3571885 | New York | 1744 | Manhattan | 82 | 96624.96 | 71395.34 | 25229.63 | 9.414634 | 23.21951 | 159.5366 | -3475260 | -2605635 | -1639385.7 | -673136.10 | 293113.5 | 1259363 |
| 10014 | 2567540 | New York | 379 | Manhattan | 95 | 99624.58 | 74364.91 | 25259.68 | 5.515790 | 13.84211 | 116.7158 | -2467915 | -1571294 | -575048.3 | 421197.53 | 1417443.4 | 2413689 |
Plotting the profits for next 50 years, for all available zipcodes, we get the below line plot. Zipcode `10128’ seems to have the highest profits, 50 years from now. Also, notice how there is a jump in profits right around the 25th year, denoting the change to profits at large, once costs are covered.
Now only considering the profit after 50 years, below is a chart, orderd based on returns, 50 years from now. Investing altogether in Manhattan would be a great idea for higher yields.
As we also need to keep in mind the current trend of bookings in Aitbnb, based on locality, before making a purchase, I created a simple bar chart that would show the total number of listings per neighbourhood. Staten Island and Queens are closely competing suggesting that there is some scope for growth in those areas. In sync with earlier findings, Manhattan obviously has the highest numer of listings.
Before wrapping up, I have also considered mapping the neighbourhoods based on how popolated they are. These ranks are taken from the Zillow dataset and are unique for each zipcode. Manhattan and Brooklyn dominate other two regions by population. Staten Island seems to be the least populated area with least rankings.
As a consultant, I would strongly recommend investing in Manhattan, which has the highest number of listings, is most populated and the most profitable of all other areas. Specifically investing in zip codes 10036, 10011, 10025 and 10003 would be most promising, as all of them also have more than 100 listings.
Though zipcode 10312 is amongst the top 5 profitable areas, investing there is not recommended as the area only had one listing. It is way too risky to invest there based on just one account data.
Investing in Brooklyn at 11217 and 11215 would also be a good strategy, as the prices are low now. Eventhough the profits are not at the highest compared to other Manhattan zipcodes, I believe it is best to grab it before the demand goes too high in the future. Especially, since Brooklyn seems to be catching up with Manhattan, with its competetive number of listings.
Since we are not considering any inflation (discount as 0%) or loan rate, the calculation for profits and revenue is straight forward. As it is for any big investments, it takes a bunch of years to yield profits.
The project is coded many functions and is reusable, allowing the client to come back and test several scenarios as and when needed. A couple examples for the same would be
We can also add a few more parameters to the project to make our predictions more accurate, though they would make it complex.