New York Property Analysis

Introduction

A real estate company plans to purchase properties in New York City and use Airbnb as the medium to lease out their investment property. The main objective of the real estate company is to find out the zip codes in New York City that would generate the most profit on short term rentals. We are looking to find out the best zip codes to invest within New York City.

Two data sets, Airbnb and Zillow are available. Zillow provides data on estimated value of two-bedroom properties. Airbnb dataset provides data on how much properties in certain neighborhoods rent out for in New York City.

Assumptions

  1. An occupancy rate of 75%
  1. Investor pays for the property upfront in cash
  1. Time value of money discount rate is 0%
  1. Rent of a ‘m’ bedroom property is m/n times of that of a ‘n’ bedroom property
  1. Indirect discounts due to bulk bookings and special promotional discounts were not taken into consideration
  2. Cleaning fee covers the cleaning costs and doesn’t have any contribution to the revenue

Data Preparation

Prior to the analysis , it is essential to properly load, examine and clean the data

============> =============> =============> =============> ==============> =============> ============>

Packages Required

To reproduce the results of this project you will need to load the following packages

library(tidyverse) # Collection of R packages designed for data science.
library(knitr) #A general-purpose tool for dynamic report generation in R
library(ggplot2) # Aesthetic depiction library
library(kableExtra) # Aesthetic depiction of tables in Rmarkdown
library(scales) # To work with axis in charts

Loading the data

The data for this project originated from the following source.

listings <- read.csv("listings.csv") #Airbnb listings
zillowdata <- read.csv("Zip_Zhvi_2bedroom.csv") # Median Price of twobedroom properties
NY_zip  <- read.csv("New York zipcodes.csv") #List of all the zipcodes in New York city

Listings Data Description

Information on the listings including - location, number of bedrooms, room types (entire home/private home/shared home)

Field Description
id Identifier used by AirBnB for the listing
listing_url URL used by AirBnB for their listing
scrape_id Identifier used internally by AirBnB to identify the date of the data
last_scraped Identifier used internally by AirBnB to identify when the data was last pulled.
name Header created by host to entice potential occupants to stay at their property
summary Description provided by the host of their property
space Description provided by host describing the space (bedrooms, bathrooms, sqft, etc) of their property
description Detailed information about the property, neighborhood, amenities, location, and anything else the host feels would be beneficial to share
experiences_offered Information about what experiences are offered; eg. Bikes, surfing, canoes, etc
neighborhood_overview Additional detail about the neighborhood including sites to see, restaurants, noise, landmarks around, etc.
notes Any other additional notes the host would like to disclose, which was not provided in the other description information.
transit Host can provide the kind of transportation available nearby including bus, trains, airport, etc..
access Description on how to enter the home, what area of the home is available for use, and what you are able items you are able to use during your stay.
interaction The availability of the host; could be in person, phone, email, text, etc. The host provides guidelines of what the occupant can expect for communication and if they expect to meet.
house_rules The host provides rules of the home that they expect the occupant to adhere to. Examples include clean up, laundry, dishes, etc.
thumbnail_url URL specific to AirBnB where the posted pictures are stored.
medium_url URL specific to AirBnB where the posted pictures are stored.
picture_url URL specific to AirBnB where the posted pictures are stored.
xl_picture_url URL specific to AirBnB where the posted pictures are stored.
host_id Identifier assigned to a particular host. A host can have multiple properties listed and would have the same unique identifier.
host_url URL assigned to the host for their listing.
host_name Name of the host.
host_since Date of when the host first initiated their account with AirBnB.
host_location City, State, and Country of where the host is located.
host_about Information provided by host to potential occupant.
host_response_time Time a potential occupant can expect to receive a response after inquiring about the property.
host_response_rate Percentage of historical responses
host_acceptance_rate Percentage of historical acceptances from potential occupants. This information is currently unavailable.
host_is_superhost Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests.
host_thumbnail_url URL assigned to host where the pictures are posted
host_picture_url URL assigned to host where the pictures are posted
host_neighbourhood Name of the neighborhood the host is listing the property at
host_listings_count Number of properties host has listed on AirBnB currently
host_total_listings_count Number of properties host has ever listed on AirBnB
host_verifications Avenues a potential occupant can check on the host to verify reputation
host_has_profile_pic Indicator for profile picture provided (t = yes) (f = no)
host_identity_verified Further checks have been conducted by host to AirBnB to verify existence and reputation. (t = yes) (f = no)
street Street address where the property is located
neighbourhood Neighborhood where the property is located
neighbourhood_cleansed Verified neighborhood name where the property is located.
neighbourhood_group_cleansed Name of the area where the property is located.
city City where the property is located.
state State where the property is located.
zipcode Zip code where the property is located.
market State where the property is located
smart_location City and state together where the property is located.
country_code Country abbreviation where the property is located.
country Full country name where the property is located.
latitude The angular distance of a place north or south of the earth’s equator, expressed in degrees and minutes.
longitude The angular distance of a place east or west of the meridian at Greenwich, England, expressed in degrees and minutes.
is_location_exact Indicator to validate if the latitude and longitude is exact or not. (t==yes) (f = no)
property_type Indicates the type of house is this is considered. Examples are house, apartment, townhomes, etc…
room_type Indicates specific rooms available for rent and/or the entire home.
accommodates Indicates the amount of people the host is allowing to occupy the property.
bathrooms Indicates the number of bathrooms within the property.
bedrooms Indicates the number of bedrooms within the property.
beds Indicates the number of beds within the property for use.
bed_type Indicates the type of bed the property has. ie. futon, real bed, airbed, etc.
amenities Provides the potential occupant the type of convenience items included as part of rent. Items include cable television, wifi, toiletries, food, etc..
square_feet Square footage of the property or space for rent.
price Price the host is charging to stay per night.
weekly_price Weekly price the host is charging; this could be discounted for longer term stays.
monthly_price Monthly price the host is charging; this could be discounted for longer term stays.
security_deposit Amount the host would like upfront for any incidentals (should you damage anything during your stay).
cleaning_fee Price the host is charging to clean up after your stay.
guests_included Indicates the number of guests the host is allowing you to bring along.
extra_people Additional charge per additional guests you bring.
minimum_nights Minimum amount of nights the host is willing to rent out the property.
maximum_nights Maximum amount of nights the host is willing to rent out the property.
calendar_updated Indicates when the host has updated the availability of when the property is available to rent.
has_availability Indicator to show if the host has availability currently; not currently available at the moment.
availability_30 Indicates the number of days the property is available for rent within 30 days.
availability_60 Indicates the number of days the property is available for rent within 60 days.
availability_90 Indicates the number of days the property is available for rent within 90 days.
availability_365 Indicates the number of days the property is available for rent within 365 days.
calendar_last_scraped Date when the availability calendar was last pulled by AirBnB.
number_of_reviews Number of reviews received for the property for its entire existence within AirBnB.
first_review Date of when the first review was provided.
last_review Date of when the last review was provided.
review_scores_rating Overall score given based on accuracy, cleanliness, check-in, communication, location, and value.
review_scores_accuracy Score based on the accuracy of the description and pictures provided in the listing of the host.
review_scores_cleanliness Score based on the cleanliness of the property when occupant arrived property.
review_scores_checkin Score based on the ease of checkin when the occupant arrived to the property.
review_scores_communication Score based on the communication between host and occupant before and during their stay.
review_scores_location Score based on the location of the property.
review_scores_value Score based on the price charged for the property based on everything that is offered.
requires license Indicator to show if the host requires a license to stay at the property. (t= yes) (f= no)
license Host should require a license; this field would have the number.
jurisdiction_names Name of the jurisdiction of where the property is at. This filed is currently unavailable.
instant_bookable Indicates if the property can be booked immediately or is currently unavailable to book.(t= yes) (f= no)
cancellation_policy Indicates how flexible the host is for cancelling the reservation. Values range from flexible to superstrict.
require_guest_profile_picture Requires the occupant to provide a picture in order to reserve the property.
require_guest_phone_verification Requires the occupant to provide a valid phone number in order to reserve the property.
calculated_host_listings_count Total number of listings the host has on AirBnB.
reviews_per_month Number of reviews the host has received on average per month.

Zillow Data Description

Cost data to determine the average property price for 2 bedrooms

Field Description
RegionID Zillow assigned number only, assigned consecutively when the regions are defined.
RegionName Zip code of where the property is located.
City City of where the property is located.
State State of where the property is located.
Metro General name of the surrounding area where the property is located.
CountyName Political and administrative division of a state, referred to as a particular part of the state.
SizeRank Population of the area; the lower the number the greater the population.
1996-04 – XXXX-XX ((XXXX-XX indicates the latest date available at the time the data is pulled) Indicates the historical median price within that area.

NY_zip Description

Official list of all the zipcodes available in New York city Reference:New York Gov. Health Department

Structure of Listings Data

It is a dataframe with 48895 Observations and 106 Variables

## 'data.frame':    48895 obs. of  106 variables:
##  $ id                                          : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ listing_url                                 : Factor w/ 48895 levels "https://www.airbnb.com/rooms/10000070",..: 21404 21961 39206 39474 41416 41514 41543 41592 41625 41660 ...
##  $ scrape_id                                   : num  2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
##  $ last_scraped                                : Factor w/ 2 levels "2019-07-08","2019-07-09": 2 2 1 2 1 1 2 1 1 1 ...
##  $ name                                        : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
##  $ summary                                     : Factor w/ 43807 levels "","----This a listing for our photo studio.  The space does not offer overnight stays---- With an abundance of Sou"| __truncated__,..: 29689 13422 1 40376 1 22465 1 27616 26184 19463 ...
##  $ space                                       : Factor w/ 32171 levels "","---------------------------------------------------------------------------------------------------------------"| __truncated__,..: 18395 201 30500 8115 12328 9504 8648 31798 18480 16185 ...
##  $ description                                 : Factor w/ 46250 levels "","----------------------- THE APARTMENT ----------------------- **SUMMER ALERT:  Apartment is FULLY AIRCONDITIONE"| __truncated__,..: 31270 14116 44645 42618 21635 23689 16093 29099 27589 20535 ...
##  $ experiences_offered                         : Factor w/ 1 level "none": 1 1 1 1 1 1 1 1 1 1 ...
##  $ neighborhood_overview                       : Factor w/ 26905 levels "","''Stuyvesant hight'' is a trendy and safe neighborhood, their is plenty of coffee shops, restaurants and stores"| __truncated__,..: 4516 3890 1 10157 1 12131 1 21046 13722 8179 ...
##  $ notes                                       : Factor w/ 17018 levels "","'Minibar' is the name of the best app for beer, wine, and liquor delivery. 'Seamless' is the main food delivery"| __truncated__,..: 5563 1 1 1 1 10205 1 10272 2164 14126 ...
##  $ transit                                     : Factor w/ 28041 levels "","'very very close to L express to manhattan! JMZ on broadway one block away>  sometimes you hear the train runni"| __truncated__,..: 24517 4488 1 4789 1 8116 1 1 20931 7719 ...
##  $ access                                      : Factor w/ 23435 levels "","---------------------------------------------------------------------------------------------------------------"| __truncated__,..: 1 8448 1 22621 1 10294 1 2632 8780 8179 ...
##  $ interaction                                 : Factor w/ 24970 levels "","---------------------------------------------------------------------------------------------- ----------------"| __truncated__,..: 1 6204 1 19711 1 10508 1 1 16482 19266 ...
##  $ house_rules                                 : Factor w/ 25255 levels "","'No Shoes' in the apartment.",..: 6531 12547 23842 21886 17807 7595 1 15619 16771 15412 ...
##  $ thumbnail_url                               : logi  NA NA NA NA NA NA ...
##  $ medium_url                                  : logi  NA NA NA NA NA NA ...
##  $ picture_url                                 : Factor w/ 48272 levels "https://a0.muscache.com/im/pictures/00027b83-c89a-4a76-ac36-de9ab57e878e.jpg?aki_policy=large",..: 11889 45757 26944 43799 48081 1209 6720 3977 2921 17476 ...
##  $ xl_picture_url                              : logi  NA NA NA NA NA NA ...
##  $ host_id                                     : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_url                                    : Factor w/ 37457 levels "https://www.airbnb.com/users/show/1000014",..: 19513 19756 26531 27259 32826 33085 33159 35905 33430 33539 ...
##  $ host_name                                   : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
##  $ host_since                                  : Factor w/ 3587 levels "","2008-08-22",..: 4 5 14 15 19 20 21 31 22 24 ...
##  $ host_location                               : Factor w/ 1571 levels "","  Brooklyn, NY  ",..: 982 982 982 982 886 982 982 982 982 982 ...
##  $ host_about                                  : Factor w/ 22144 levels "","'Eric is a director-actor-writer-producer-teacher of theatre; he also writes songs, makes bread, and advocates "| __truncated__,..: 3047 751 17091 16419 13912 9517 107 15173 21518 14390 ...
##  $ host_response_time                          : Factor w/ 6 levels "","a few days or more",..: 6 5 4 5 3 6 3 5 3 5 ...
##  $ host_response_rate                          : Factor w/ 87 levels "","0%","10%",..: 4 74 4 80 87 4 87 77 87 4 ...
##  $ host_acceptance_rate                        : Factor w/ 2 levels "","N/A": 2 2 2 2 2 2 2 2 2 2 ...
##  $ host_is_superhost                           : Factor w/ 3 levels "","f","t": 2 2 2 2 2 2 2 2 2 3 ...
##  $ host_thumbnail_url                          : Factor w/ 37360 levels "","https://a0.muscache.com/defaults/user_pic-50x50.png?v=3",..: 6697 31587 34728 34835 36082 14713 5717 36888 36252 21825 ...
##  $ host_picture_url                            : Factor w/ 37360 levels "","https://a0.muscache.com/defaults/user_pic-225x225.png?v=3",..: 6697 31587 34728 34835 36082 14713 5717 36888 36252 21825 ...
##  $ host_neighbourhood                          : Factor w/ 416 levels "","Ã\230sterbro",..: 156 241 167 75 113 129 27 169 375 218 ...
##  $ host_listings_count                         : int  6 5 1 1 1 1 1 1 1 4 ...
##  $ host_total_listings_count                   : int  6 5 1 1 1 1 1 1 1 4 ...
##  $ host_verifications                          : Factor w/ 536 levels "['email', 'facebook', 'google', 'reviews', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']",..: 351 350 182 351 118 317 138 155 33 301 ...
##  $ host_has_profile_pic                        : Factor w/ 3 levels "","f","t": 3 3 3 3 3 3 3 3 3 3 ...
##  $ host_identity_verified                      : Factor w/ 3 levels "","f","t": 3 3 3 3 3 2 2 2 3 3 ...
##  $ street                                      : Factor w/ 315 levels " Astoria, NY, United States",..: 46 203 203 64 203 203 64 203 203 203 ...
##  $ neighbourhood                               : Factor w/ 195 levels "","Allerton",..: 20 106 83 20 53 117 20 106 178 105 ...
##  $ neighbourhood_cleansed                      : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
##  $ neighbourhood_group_cleansed                : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ city                                        : Factor w/ 306 levels ""," Astoria",..: 49 187 187 46 187 187 46 187 187 187 ...
##  $ state                                       : Factor w/ 8 levels "","CA","MP","New York ",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ zipcode                                     : Factor w/ 200 levels "","07093","07302",..: 123 20 29 142 31 18 121 21 27 5 ...
##  $ market                                      : Factor w/ 18 levels "","Adirondacks",..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ smart_location                              : Factor w/ 317 levels " Astoria, NY",..: 46 205 205 62 205 205 62 205 205 205 ...
##  $ country_code                                : Factor w/ 1 level "US": 1 1 1 1 1 1 1 1 1 1 ...
##  $ country                                     : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
##  $ latitude                                    : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                                   : num  -74 -74 -73.9 -74 -73.9 ...
##  $ is_location_exact                           : Factor w/ 2 levels "f","t": 1 1 2 2 2 1 1 2 2 2 ...
##  $ property_type                               : Factor w/ 36 levels "Aparthotel","Apartment",..: 2 2 2 19 2 2 2 2 2 2 ...
##  $ room_type                                   : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ accommodates                                : int  2 2 2 3 1 2 2 2 1 3 ...
##  $ bathrooms                                   : num  1 1 1 1 1 1 NA 1 1 1 ...
##  $ bedrooms                                    : int  1 0 1 1 NA 1 1 1 1 1 ...
##  $ beds                                        : int  1 1 1 4 1 1 1 1 1 2 ...
##  $ bed_type                                    : Factor w/ 5 levels "Airbed","Couch",..: 5 5 4 5 5 5 3 5 5 5 ...
##  $ amenities                                   : Factor w/ 44292 levels "{\"Air conditioning\",\"Carbon monoxide detector\",\"First aid kit\",\"Lock on bedroom door\",\"translation mis"| __truncated__,..: 9863 23810 339 7790 3243 9905 38040 20063 920 11136 ...
##  $ square_feet                                 : int  NA NA NA 500 NA NA NA NA NA NA ...
##  $ price                                       : Factor w/ 674 levels "$0.00","$1,000.00",..: 111 216 113 642 619 189 551 612 612 113 ...
##  $ weekly_price                                : Factor w/ 599 levels "","$1,000.00",..: 260 162 1 426 440 1 1 366 1 1 ...
##  $ monthly_price                               : Factor w/ 681 levels "","$1,000.00",..: 681 1 1 233 131 1 1 1 1 1 ...
##  $ security_deposit                            : Factor w/ 225 levels "","$0.00","$1,000.00",..: 32 148 95 178 32 137 169 1 1 2 ...
##  $ cleaning_fee                                : Factor w/ 207 levels "","$0.00","$1,000.00",..: 91 5 182 1 188 26 2 43 1 128 ...
##  $ guests_included                             : int  1 2 2 1 1 2 1 1 1 2 ...
##  $ extra_people                                : Factor w/ 104 levels "$0.00","$10.00",..: 51 1 29 1 29 3 45 8 1 2 ...
##  $ minimum_nights                              : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ maximum_nights                              : int  730 1125 7 730 120 21 730 14 14 1125 ...
##  $ minimum_minimum_nights                      : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ maximum_minimum_nights                      : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ minimum_maximum_nights                      : int  730 1125 7 730 120 21 730 14 14 1125 ...
##  $ maximum_maximum_nights                      : int  730 1125 7 730 120 21 730 14 14 1125 ...
##  $ minimum_nights_avg_ntm                      : num  1 1 3 1 10 3 45 2 2 1 ...
##  $ maximum_nights_avg_ntm                      : num  730 1125 7 730 120 ...
##  $ calendar_updated                            : Factor w/ 92 levels "1 week ago","10 months ago",..: 27 38 32 91 26 27 10 39 19 91 ...
##  $ has_availability                            : Factor w/ 1 level "t": 1 1 1 1 1 1 1 1 1 1 ...
##  $ availability_30                             : int  30 25 30 0 0 23 0 3 0 0 ...
##  $ availability_60                             : int  60 55 60 0 0 48 0 10 0 0 ...
##  $ availability_90                             : int  90 80 90 3 0 48 0 12 0 0 ...
##  $ availability_365                            : int  365 355 365 194 0 129 0 220 0 188 ...
##  $ calendar_last_scraped                       : Factor w/ 2 levels "2019-07-08","2019-07-09": 2 2 1 2 1 1 2 1 1 1 ...
##  $ number_of_reviews                           : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ number_of_reviews_ltm                       : int  2 11 0 69 4 9 0 43 0 12 ...
##  $ first_review                                : Factor w/ 2898 levels "","2009-03-12",..: 1587 23 1 1162 357 3 9 5 15 14 ...
##  $ last_review                                 : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
##  $ review_scores_rating                        : int  98 95 NA 90 93 89 90 84 98 94 ...
##  $ review_scores_accuracy                      : int  10 10 NA 10 10 10 8 9 10 10 ...
##  $ review_scores_cleanliness                   : int  10 9 NA 9 9 9 8 7 10 9 ...
##  $ review_scores_checkin                       : int  10 10 NA 10 10 10 10 10 10 10 ...
##  $ review_scores_communication                 : int  10 10 NA 10 10 10 10 9 10 10 ...
##  $ review_scores_location                      : int  10 10 NA 10 9 9 9 10 10 9 ...
##  $ review_scores_value                         : int  10 9 NA 9 10 9 9 9 10 9 ...
##  $ requires_license                            : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1 1 ...
##  $ license                                     : Factor w/ 10 levels "","11-3461723",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ jurisdiction_names                          : Factor w/ 6 levels "","{\"Connecticut State\"}",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ instant_bookable                            : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
##  $ is_business_travel_ready                    : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cancellation_policy                         : Factor w/ 6 levels "flexible","moderate",..: 2 4 4 2 4 4 4 4 1 4 ...
##   [list output truncated]

Structure of Zillow Data

It is a dataframe with 8946 obs. of 262 variables

## 'data.frame':    8946 obs. of  262 variables:
##  $ RegionID  : int  61639 84654 61637 84616 93144 84640 97564 91982 71831 84646 ...
##  $ RegionName: int  10025 60657 10023 60614 79936 60640 94109 77494 32162 60647 ...
##  $ City      : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 649 2702 649 1119 649 3479 672 3912 649 ...
##  $ State     : Factor w/ 48 levels "AK","AL","AR",..: 34 15 34 15 42 15 5 42 10 15 ...
##  $ Metro     : Factor w/ 467 levels "","Aberdeen",..: 292 76 292 76 125 76 368 188 415 76 ...
##  $ CountyName: Factor w/ 722 levels "Ada","Adams",..: 460 158 460 158 212 158 574 231 634 158 ...
##  $ SizeRank  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ X1996.04  : int  NA 167700 NA 195800 59100 123300 336800 117300 72600 129600 ...
##  $ X1996.05  : int  NA 166400 NA 193500 60500 122600 337400 117900 74800 129800 ...
##  $ X1996.06  : int  NA 166700 NA 192600 60900 122000 337700 118800 77300 129900 ...
##  $ X1996.07  : int  NA 167200 NA 192300 60800 121500 338200 121100 80000 130600 ...
##  $ X1996.08  : int  NA 166900 NA 192600 60300 120900 339700 123200 81700 131100 ...
##  $ X1996.09  : int  NA 166900 NA 193600 60400 120600 340400 124600 82600 131200 ...
##  $ X1996.10  : int  NA 168000 NA 195500 61200 120900 339700 125800 83500 131000 ...
##  $ X1996.11  : int  NA 170100 NA 197600 61700 121300 338900 126400 84700 131400 ...
##  $ X1996.12  : int  NA 171700 NA 199400 61000 121600 338300 125600 85100 132000 ...
##  $ X1997.01  : int  NA 173000 NA 201300 60100 122100 340000 125100 84200 132100 ...
##  $ X1997.02  : int  NA 174600 NA 203600 59300 122900 345800 127000 82200 132200 ...
##  $ X1997.03  : int  NA 177600 NA 206500 59000 124200 350600 129400 79900 133100 ...
##  $ X1997.04  : int  NA 180100 NA 209200 58700 125300 351300 130400 78400 133500 ...
##  $ X1997.05  : int  NA 182300 NA 211100 58400 126100 352600 130400 77500 133300 ...
##  $ X1997.06  : int  NA 184400 NA 212600 58000 126700 357300 131200 76800 134100 ...
##  $ X1997.07  : int  NA 186300 NA 214400 57800 127900 363900 130600 77400 136100 ...
##  $ X1997.08  : int  NA 187600 NA 215600 57900 129300 370600 129100 78600 138700 ...
##  $ X1997.09  : int  NA 189400 NA 216500 57800 130400 374300 127600 79800 141300 ...
##  $ X1997.10  : int  NA 190300 NA 217900 57800 131300 375600 125800 80700 143100 ...
##  $ X1997.11  : int  NA 189700 NA 220100 58100 131700 381000 124500 81300 144100 ...
##  $ X1997.12  : int  NA 189800 NA 222200 58400 132300 389500 124600 81500 144800 ...
##  $ X1998.01  : int  NA 191900 NA 223900 58700 133500 395400 125100 81900 145300 ...
##  $ X1998.02  : int  NA 194500 NA 225400 59200 134500 400800 125500 82300 145800 ...
##  $ X1998.03  : int  NA 195500 NA 227700 59400 134800 409000 126000 82200 146100 ...
##  $ X1998.04  : int  NA 196000 NA 230100 58700 135200 414600 127200 81700 146500 ...
##  $ X1998.05  : int  NA 196900 NA 231700 58100 135500 418500 128700 81300 147500 ...
##  $ X1998.06  : int  NA 198900 NA 232700 57900 136300 424200 128300 81000 148500 ...
##  $ X1998.07  : int  NA 201400 NA 233700 57900 137700 430500 126300 80500 149200 ...
##  $ X1998.08  : int  NA 204600 NA 234700 57700 139600 437500 126000 80300 150000 ...
##  $ X1998.09  : int  NA 207900 NA 235600 57600 141600 448900 126500 80100 150500 ...
##  $ X1998.10  : int  NA 211800 NA 236800 57500 143400 460300 126300 79900 151100 ...
##  $ X1998.11  : int  NA 214600 NA 238800 57800 144500 461900 127400 80600 152000 ...
##  $ X1998.12  : int  NA 216000 NA 240800 58000 145600 461900 130600 82300 153700 ...
##  $ X1999.01  : int  NA 217500 NA 242400 58000 147400 466200 131600 83800 156300 ...
##  $ X1999.02  : int  NA 220200 NA 243800 57900 149200 470500 130400 84700 159200 ...
##  $ X1999.03  : int  NA 222800 NA 246400 57800 149900 470500 129900 85600 161000 ...
##  $ X1999.04  : int  NA 226200 NA 250200 57800 150500 473100 131900 86900 162300 ...
##  $ X1999.05  : int  NA 229600 NA 254300 58000 151700 480000 134600 88300 163500 ...
##  $ X1999.06  : int  NA 232400 NA 257600 58500 152800 487800 137400 88900 164700 ...
##  $ X1999.07  : int  NA 234400 NA 261100 58700 153900 496000 140600 89300 165700 ...
##  $ X1999.08  : int  NA 236300 NA 264800 59000 156400 506900 141800 91000 166800 ...
##  $ X1999.09  : int  NA 238300 NA 267900 59200 159400 516300 141200 93400 169600 ...
##  $ X1999.10  : int  NA 241800 NA 270700 59300 161800 522900 141800 95100 173700 ...
##  $ X1999.11  : int  NA 246100 NA 272800 59500 163800 533800 142800 95600 177200 ...
##  $ X1999.12  : int  NA 249500 NA 274400 59900 165700 549400 141700 95000 179100 ...
##  $ X2000.01  : int  NA 251300 NA 276200 60300 167100 564600 141200 95500 180900 ...
##  $ X2000.02  : int  NA 253200 NA 278600 60400 168400 577300 143000 98700 183600 ...
##  $ X2000.03  : int  NA 255700 NA 280100 60300 169600 592200 145600 101800 187200 ...
##  $ X2000.04  : int  NA 259200 NA 283100 60300 171000 610400 145500 102500 190700 ...
##  $ X2000.05  : int  NA 263100 NA 287700 60000 172400 626700 145400 102300 193500 ...
##  $ X2000.06  : int  NA 266600 NA 293600 59500 174800 640600 145000 103300 195700 ...
##  $ X2000.07  : int  NA 269500 NA 298500 59400 177900 654400 143000 103300 196500 ...
##  $ X2000.08  : int  NA 272800 NA 302700 59800 180400 663500 140800 101900 197800 ...
##  $ X2000.09  : int  NA 275500 NA 305000 59900 182300 664700 140000 101800 200200 ...
##  $ X2000.10  : int  NA 278800 NA 306800 59700 184600 670100 142200 102700 202700 ...
##  $ X2000.11  : int  NA 283400 NA 309400 59500 187700 679000 145400 102500 204800 ...
##  $ X2000.12  : int  NA 288600 NA 313100 59400 190700 676500 146400 103100 208000 ...
##  $ X2001.01  : int  NA 291300 NA 314900 59500 193100 663100 145900 105000 211800 ...
##  $ X2001.02  : int  NA 292400 NA 316200 59700 196100 656800 145800 105300 214400 ...
##  $ X2001.03  : int  NA 294600 NA 318200 59700 200000 658700 142400 104500 215700 ...
##  $ X2001.04  : int  NA 297100 NA 320600 59200 202900 661300 138200 104300 218100 ...
##  $ X2001.05  : int  NA 298200 NA 322900 58700 205500 661100 135200 103900 221100 ...
##  $ X2001.06  : int  NA 299800 NA 325500 58400 207600 658700 133600 103000 222100 ...
##  $ X2001.07  : int  NA 302000 NA 328400 57800 208600 653900 133500 103000 222100 ...
##  $ X2001.08  : int  NA 304200 NA 330700 57000 209200 646800 135400 103500 222900 ...
##  $ X2001.09  : int  NA 307900 NA 332800 56700 210000 641400 136800 103500 224500 ...
##  $ X2001.10  : int  NA 311000 NA 334400 56800 210200 636300 136500 103700 226700 ...
##  $ X2001.11  : int  NA 311400 NA 335900 56600 211300 630300 135300 104000 229700 ...
##  $ X2001.12  : int  NA 311000 NA 337400 56500 213000 628500 134600 104000 231500 ...
##  $ X2002.01  : int  NA 311700 NA 339700 56500 214100 633500 135400 104700 232200 ...
##  $ X2002.02  : int  NA 312300 NA 342300 56600 215200 637300 137700 106800 233100 ...
##  $ X2002.03  : int  NA 312000 NA 343800 56700 217600 640700 139500 109700 234300 ...
##  $ X2002.04  : int  NA 311800 NA 343400 56600 220200 647900 139900 112100 235100 ...
##  $ X2002.05  : int  NA 312600 NA 342300 56700 222400 655500 140600 114300 235500 ...
##  $ X2002.06  : int  NA 313000 NA 341800 57000 224600 662500 141600 116600 236400 ...
##  $ X2002.07  : int  NA 314400 NA 341700 57300 227000 670800 141200 118000 238100 ...
##  $ X2002.08  : int  NA 317300 NA 342400 57300 228600 677500 140600 117800 239300 ...
##  $ X2002.09  : int  NA 319700 NA 344300 57300 230100 687900 141900 117800 240400 ...
##  $ X2002.10  : int  NA 320500 NA 346900 57100 232400 698400 142700 118600 242100 ...
##  $ X2002.11  : int  NA 321000 NA 348900 56900 234300 701000 141700 119900 243700 ...
##  $ X2002.12  : int  NA 321600 NA 350200 56900 235300 698900 141700 121300 244900 ...
##  $ X2003.01  : int  NA 323800 NA 351700 57000 236200 697600 143800 121900 247000 ...
##  $ X2003.02  : int  NA 326100 NA 353500 56700 237000 691800 144100 121100 249500 ...
##  $ X2003.03  : int  NA 329000 NA 355700 56700 237900 684600 143000 120400 252100 ...
##  $ X2003.04  : int  NA 332200 NA 358000 57000 239200 681100 142200 120600 254300 ...
##  $ X2003.05  : int  NA 334700 NA 361600 57400 241300 685100 141400 121200 257200 ...
##  $ X2003.06  : int  NA 336000 NA 364000 57700 243600 692200 141000 121600 261300 ...
##  $ X2003.07  : int  NA 337300 NA 365500 58000 244200 697500 141400 122400 265400 ...
##  $ X2003.08  : int  NA 337500 NA 366400 58300 243800 704200 140800 123900 268400 ...
##  $ X2003.09  : int  NA 337100 NA 367000 58600 244500 710500 139400 125300 269600 ...
##  $ X2003.10  : int  NA 334900 NA 365200 58800 245500 717200 139400 126200 268700 ...
##  $ X2003.11  : int  NA 333100 NA 363100 59100 245700 732300 140200 127600 268100 ...
##   [list output truncated]

Filter New York Entries

New York Listings data

Since our analysis confines to the properties in the New York city, we can filter the corresponding entries in both “listings” and “zillowdata”

sum(duplicated(listings)) # Checking for duplicated entries
## [1] 0
sum(is.na(listings$zipcode)) # Checking for missing values in the zipcode column
## [1] 0
sum(is.na(listings$square_feet))/nrow(listings) # checking missing values in sq.ft column
## [1] 0.9916556

But looking at the unique values in the listings data would reveal that some missing values are disguised as blank spaces

unique(listings$zipcode) # Blank Spaces Spotted
##   [1] 11218        10018        10027        11238        10029       
##   [6] 10016        11216        10019        10025        10002       
##  [11] 10036        11215        10014        11211        11205       
##  [16] 10011        11217        10034        10003        10031       
##  [21] 11222                     11237        11226        10009       
##  [26] 10024        11225        11101        11221        11206       
##  [31] 10010        11249        10001        10162        10035       
##  [36] 10032        10075        11377        11201        11233       
##  [41] 11231        10023        10128        10040        11234       
##  [46] 11367        10028        11104        10065        10301       
##  [51] 10452        10006        10039        11385        10021       
##  [56] 11434        11379        10026        11105        10044       
##  [61] 10022        10013        11103        10473        10475       
##  [66] 11213        10468        11102        10038        10033       
##  [71] 11427        11693        10012        11375        10470       
##  [76] 10030        10453        11223        10469        11208       
##  [81] 11207        10451        11235        10314        11209       
##  [86] 11204        10305        11232        10017        11373       
##  [91] 11224        11372        11412        10004        11203       
##  [96] 11374        10466        11106        10304        11220       
## [101] 11230        11210        10303        11229        10463       
## [106] 11435        11416        10080        10005        10454       
## [111] 10455        11236        10280        10037        11355       
## [116] 11692        11411        10282        10307        11219       
## [121] 10467        11358        11361        10069        11415       
## [126] 11356        11214        11432        10464        11370       
## [131] 10302        11418        11433        11429        11422       
## [136] 11378        11421        10312        10460        10471       
## [141] 10462        10459        11417        10007        11360       
## [146] 11249\n11249 11109        11357        10458        11369       
## [151] 11212        11691        11354        11368        10306       
## [156] 10456        11413        11365        11436        11414       
## [161] 11694        11366        10457        10461        10310       
## [166] 11420        10308        11419        11103-3233   10474       
## [171] 10002-2289   10465        10472        11001        11428       
## [176] 11426        11423        11228        10129        11364       
## [181] 11385-2308   11363        10281        91766        11362       
## [186] 10174        11413-3220   07093        10550        11692-2304  
## [191] 11559        10270        10309        07302        10704       
## [196] 11003        10020        11021        10705        94085       
## 200 Levels:  07093 07302 10001 10002 10002-2289 10003 10004 10005 ... 94085
sum(listings$zipcode == "")/nrow(listings)*100 # To caculate percent of entries with missing zipcodes
## [1] 1.057368

Very small percentage of missing values(1%) . We can ignore the entries with missing zipcodes ony after making sure that the percentage of entries with missing zipcodes is minimal in each and every neighborhood , since our final analysis includes comparision of local averages.

missing_zipcode_entrylist = as.data.frame(table(listings[listings$zipcode == "", "neighbourhood"]))
missing_zipcode_entrylist <- missing_zipcode_entrylist[missing_zipcode_entrylist$Freq != 0, ]
names(missing_zipcode_entrylist) <- c("neighbourhood","Frequency")



missing_zipcode_entrylist <- missing_zipcode_entrylist[order(-missing_zipcode_entrylist$Frequency), ] #List of no. of entries with missing zipcodes w.r.t neighbourhood

#Checking the local missing value percentage in top 3 in the list


df1 <- missing_zipcode_entrylist[missing_zipcode_entrylist$neighbourhood == "Brooklyn" , ]$Frequency # no. of entries with missing zipcodes in Brooklyn
Brooklyn_percent = df1/nrow(listings[listings$neighbourhood == "Brooklyn" , ])*100 # percent of entries with missing zipcodes in Brooklyn

df2 <- missing_zipcode_entrylist[missing_zipcode_entrylist$neighbourhood == "Manhattan" , ]$Frequency # no. of entries with missing zipcodes in Manhattan
Manhattan_percent = df2/nrow(listings[listings$neighbourhood == "Manhattan" , ])*100 # percent of entries with missing zipcodes in Manhattan

df3 <- missing_zipcode_entrylist[missing_zipcode_entrylist$neighbourhood == "Queens" , ]$Frequency #no. of entries with missing zipcodes in Queens
Queens_percent = df3/nrow(listings[listings$neighbourhood == "Queens" , ])*100 # percent of entries with missing zipcodes in Queens 

Brooklyn_percent
## [1] 1.046878
Manhattan_percent
## [1] 1.038884
Queens_percent
## [1] 1.029464

We can choose to ignore the entries with missing zipcodes since they are negligible and evenly distributed

Since there are no duplicate entries and negligible amount of missing values in the city column , we can filter the listings pertaining to New York City by matching it with the official list of zipcodes

NY_zip_char <- NY_zip %>% mutate(X10001  = as.character(X10001)) # Matching the class of variables containing zipcode in both "listings" and "NY_zip"

listings <- listings %>% mutate(zipcode = as.character(zipcode))
listingsNY <- listings[listings$zipcode %in% NY_zip_char[ , 1], ] # Filtering out data corresponding to zipcodes in New YOrk city
unique(listingsNY$zipcode)
##   [1] "11218" "10018" "10027" "11238" "10029" "10016" "11216" "10019"
##   [9] "10025" "10002" "10036" "11215" "10014" "11211" "11205" "10011"
##  [17] "11217" "10034" "10003" "10031" "11222" "11237" "11226" "10009"
##  [25] "10024" "11225" "11101" "11221" "11206" "10010" "11249" "10001"
##  [33] "10162" "10035" "10032" "11377" "11201" "11233" "11231" "10023"
##  [41] "10128" "10040" "11234" "11367" "10028" "11104" "10301" "10452"
##  [49] "10006" "10039" "11385" "10021" "11434" "11379" "10026" "11105"
##  [57] "10044" "10022" "10013" "11103" "10473" "10475" "11213" "10468"
##  [65] "11102" "10038" "10033" "11427" "11693" "10012" "11375" "10470"
##  [73] "10030" "10453" "11223" "10469" "11208" "11207" "10451" "11235"
##  [81] "10314" "11209" "11204" "10305" "11232" "10017" "11373" "11224"
##  [89] "11372" "11412" "10004" "11203" "11374" "10466" "11106" "10304"
##  [97] "11220" "11230" "11210" "10303" "11229" "10463" "11435" "11416"
## [105] "10005" "10454" "10455" "11236" "10280" "10037" "11355" "11692"
## [113] "11411" "10282" "10307" "11219" "10467" "11358" "11361" "10069"
## [121] "11415" "11356" "11214" "11432" "10464" "11370" "10302" "11418"
## [129] "11433" "11429" "11422" "11378" "11421" "10312" "10460" "10471"
## [137] "10462" "10459" "11417" "10007" "11360" "11109" "11357" "10458"
## [145] "11369" "11212" "11691" "11354" "11368" "10306" "10456" "11413"
## [153] "11365" "11436" "11414" "11694" "11366" "10457" "10461" "10310"
## [161] "11420" "10308" "11419" "10474" "10465" "10472" "11428" "11426"
## [169] "11423" "11228" "11364" "11363" "10281" "11362" "10174" "10270"
## [177] "10309" "10020"

listingsNY contains listings pertaining to New YOrk city

New York Median Price data

sum(duplicated(zillowdata)) # Checking for duplicate entries
## [1] 0
sum(is.na(zillowdata$RegionName)) # Checking for missing values in Region Name column which contains zipcodes
## [1] 0
twobedNY <- zillowdata[zillowdata$RegionName %in% NY_zip[ , 1] ,  ] #Filtering out available median price data for zipcodes in New York City

twobedNY contains history of Median Price data of properties in New YOrk city

Critical parameter identification for data quality check

Preliminary Analysis

Filtering critical parameters by Analyzing factors affecting Revenue and Investment
Revenue_factors Assumptions Remarks
Rent “zipcode”, “Price”, “monthly_price”,“weekly_price” to be checked
Bedrooms Rent of a ‘m’ bedroom property is m/n times of that of a ‘n’ bedroom property “bedrooms”
Occupancy Occupancy assumed to be 75% So we can filter out measures relating to reviews,host qualities, bookability, web urls and neighborhood , latitude and longitude
Cleaning fee Cleaning fee assumed to cover cleaning costs Impact on revenue is zero
Indirect discounts due to bulk bookings Assumed to be zero
Special Promotional discounts Assumed to be zero
Investment_factors Assumptions Remarks
Median Price of the property Latest available Median price data taken as price at the time of investment “RegionName” and “x2017.06” are to be checked
Investment_factors Assumptions Remarks
Median Price of the property Latest available Median price data taken as price at the time of investment “RegionName” and “x2017.06” are to be checked

Data Quality Insights of critical parameters

Data Quality check for Airbnb listings data

After the above preliminary analysis

  • Critical parameters in ListingsNY data - “zipcode”, “Price”, “monthly_price”,“weekly_price” , “bedrooms”
#checking the class of all the critical variables
class(listingsNY$zipcode)
## [1] "character"
class(listingsNY$price)
## [1] "factor"
class(listingsNY$monthly_price)
## [1] "factor"
class(listingsNY$weekly_price)
## [1] "factor"
class(listingsNY$bedrooms)
## [1] "integer"
  • Changing the variable types from factor to numeric as well as removing special characters
# changing the variables with rent information  from factor to numeric to make it condusive for arithmatic operations on the variables

listingsNY$zipcode <- as.character(listingsNY$zipcode)

listingsNY$monthly_price <-  as.character(listingsNY$monthly_price)
listingsNY$monthly_price <-  str_replace_all(listingsNY$monthly_price ,"\\$" , "")
listingsNY$monthly_price <-  str_replace_all(listingsNY$monthly_price ,"\\," , "")
listingsNY$monthly_price <- as.numeric(listingsNY$monthly_price)



listingsNY$price <-  as.character(listingsNY$price)
listingsNY$price <-  str_replace_all(listingsNY$price ,"\\$" , "")
listingsNY$price <-  str_replace_all(listingsNY$price ,"\\," , "")
listingsNY$price <- as.numeric(listingsNY$price)


listingsNY$weekly_price <-  as.character(listingsNY$weekly_price)
listingsNY$weekly_price <-  str_replace_all(listingsNY$weekly_price ,"\\$" , "")
listingsNY$weekly_price <-  str_replace_all(listingsNY$weekly_price ,"\\," , "")
listingsNY$weekly_price <- as.numeric(listingsNY$weekly_price)
  • Checking percentage of missing values in price related variables
sum(is.na(listingsNY$price))/nrow(listingsNY)*100
## [1] 0
sum(is.na(listingsNY$monthly_price))/nrow(listingsNY)*100
## [1] 89.29476
sum(is.na(listingsNY$weekly_price))/nrow(listingsNY)*100
## [1] 87.63782
sum(is.na(listingsNY$bedrooms))/nrow(listingsNY)*100
## [1] 0.04184188

Since they have high percentage (~89%) of missing values , “monthly_price” and “weekly_price” columns cannot be used for the analysis only “price” is selected as critical parameter

Remove the rows with no values for bedrooms(.04%, so negligible), since equivalent twobed price is calculated by the assumption (Assumption no.4) of direct proportionality of rent with no.of bedrooms

listingsNY <- listingsNY[complete.cases(listingsNY$bedrooms), ] # Selecting rows with info on no.of bedrooms
class(listingsNY$zipcode)
## [1] "character"
unique(listingsNY$zipcode) #Checking for abnormal values
##   [1] "11218" "10018" "10027" "11238" "10016" "11216" "10019" "10025"
##   [9] "10002" "10036" "11215" "10014" "11205" "10011" "10029" "11217"
##  [17] "10034" "10003" "10031" "11222" "11237" "11226" "10009" "10024"
##  [25] "11225" "11101" "11221" "11206" "10010" "11211" "11249" "10001"
##  [33] "10162" "10035" "10032" "11377" "11201" "11233" "11231" "10023"
##  [41] "10128" "10040" "11234" "11367" "10028" "11104" "10301" "10452"
##  [49] "10006" "10039" "11385" "10021" "11434" "11379" "10026" "11105"
##  [57] "10044" "10022" "10013" "11103" "10473" "10475" "11213" "10468"
##  [65] "11102" "10038" "10033" "11427" "11693" "10012" "11375" "10470"
##  [73] "10030" "10453" "11223" "10469" "11208" "11207" "10451" "11235"
##  [81] "10314" "11209" "11204" "10305" "11232" "10017" "11373" "11224"
##  [89] "11372" "11412" "10004" "11203" "11374" "10466" "11106" "10304"
##  [97] "11220" "11230" "11210" "10303" "11229" "10463" "11435" "11416"
## [105] "10005" "10454" "10455" "11236" "10280" "10037" "11355" "11692"
## [113] "11411" "10282" "10307" "11219" "10467" "11358" "11361" "10069"
## [121] "11415" "11356" "11214" "11432" "10464" "11370" "10302" "11418"
## [129] "11433" "11429" "11422" "11378" "11421" "10312" "10460" "10471"
## [137] "10462" "10459" "11417" "10007" "11360" "11109" "11357" "10458"
## [145] "11369" "11212" "11691" "11354" "11368" "10306" "10456" "11413"
## [153] "11365" "11436" "11414" "11694" "11366" "10457" "10461" "10310"
## [161] "11420" "10308" "11419" "10474" "10465" "10472" "11428" "11426"
## [169] "11423" "11228" "11364" "11363" "10281" "11362" "10174" "10270"
## [177] "10309" "10020"
  • Missing values found
# Counting the no.of abnormal values
 
sum(listingsNY$zipcode == "") #Missing values
## [1] 0

Entries without zipcode information are not useful for our analysis, so I chose to delete the entries with missing zipcodes

listingsNY <-  listingsNY[listingsNY$zipcode != "", ] # delete the entries with missing zipcodes

Since equivalent twobed price is calculated by the assumption of direct proportionality of rent with no.of bedrooms , We cannot include listings with 0 bedrooms

unique(listingsNY$bedrooms)
##  [1]  1  0  2  4  3  5  6  8  9  7 14 10 11
sum(listingsNY$bedrooms == 0)
## [1] 4402
#deleting entries w which have 0 bedrooms
listingsNY <- filter(listingsNY, bedrooms > 0)

Parameters useful for further analysis are RegionName , x2017.06 (Latest Median Price data)

Transformed data

After filtering critical parameters from both the data , data is merged and new variables useful for the final analysis i.e., “revenue” (Estimated Annual Revenue), “break_even” (No. of years to break even the intial investment) and “ten_year_PI” (Profitability Index for 10 years) are created

  • Filtering critical parameters in listingsNY
criticallistingsNY <- listingsNY[ , c("zipcode","price","bedrooms")] # Critical parameters from New York Listings data
  • Filtering critical parameters from Median Price data (twobedNY)
criticaltwobedNY <- twobedNY[ , c("RegionName","X2017.06")] # Critical Parameters from Median Price Data

names(criticaltwobedNY) <- c("zipcode" , "Latest_Median_Price") #Changing the column names to depict its summary
  • Joining the Listings data and the Median Price data
# Changing class of zipcode to match to that in criticallistings

criticaltwobedNY <-  criticaltwobedNY %>% mutate(zipcode  = as.character(zipcode))

criticallistingsNY <-  criticallistingsNY %>% mutate(zipcode  = as.character(zipcode))

# Joining the data through left_join
final <- criticallistingsNY %>% left_join(criticaltwobedNY, by = c("zipcode" = "zipcode"))

# Deleting entries which does not have a matching median price value
final <- final[is.na(final$Latest_Median_Price) == FALSE, ]
  • Addition of variable representing Revenue

    • We assume that annual occupancy is 75%. So, to get the revenue generated by a 2-bed room property, we multiply ‘price’ by 0.75x365x(2/n) to get annual revenue generated by that property, where n is the number of bedrooms present in that property. This value is stored in the column ‘revenue’.

Creation of “revenue” variable representing Annual Revenue

zipcode_count = as.data.frame(table(final$zipcode)) #Table with no. of entries with each zipcode
names(zipcode_count) = c("zipcode" , "count")


final$revenue = 0.75 * 365*final$price*(2/final$bedrooms) #Defining revenue variable


test <- final
test <- test %>% group_by(zipcode) %>% transmute(zip_rev = mean(revenue)) # Table with avg revenue from each zipcode
test <- as.data.frame(test)
test = unique(test) # Ensuring single entry for each zipcode

final1 <- final %>% left_join(test , by = "zipcode") #Adding column consisting average revenue by zipcode
final1 <- final1 %>% left_join(zipcode_count , by = "zipcode") #adding column consisting sample size in each zipcode
final1 <- final1[final1$count > 30,] #Eliminating zipcodes with less than 30 entries (Assuming sample size of 30 is significant)
zipcode price bedrooms Latest_Median_Price revenue
7 10025 79 1 1431000 43252.5
9 10025 135 1 1431000 73912.5
10 10036 85 1 1712900 46537.5
11 11215 89 1 1070800 48727.5
12 10025 85 1 1431000 46537.5
13 10014 120 1 2491600 65700.0

Creation of variables representing “TIme to Break_even” and “Profitability Index of 10 years”

rev_by_zipcode <-  final1[ , c("zipcode","Latest_Median_Price","zip_rev")]

rev_by_zipcode <- rev_by_zipcode %>% as.data.frame() %>%  group_by(zipcode) %>% summarise_at(c("Latest_Median_Price" , "zip_rev"), mean, na.rm = TRUE)

rev_by_zipcode <- rev_by_zipcode %>%   mutate(break_even = Latest_Median_Price/zip_rev) %>% mutate(ten_year_PI = (10*zip_rev + Latest_Median_Price)/Latest_Median_Price)
knitr::kable(rev_by_zipcode) %>% kable_styling() # Creation of variables "break_even" and "ten_year_PI"
zipcode Latest_Median_Price zip_rev break_even ten_year_PI
10003 2147000 94566.65 22.703564 1.440460
10011 2480400 109691.11 22.612590 1.442232
10013 3316500 111305.96 29.796249 1.335613
10014 2491600 117884.76 21.135896 1.473129
10021 1815600 84202.87 21.562210 1.463774
10022 2031600 107639.02 18.874197 1.529824
10023 2142300 120019.10 17.849659 1.560235
10025 1431000 70440.79 20.314933 1.492249
10028 2083900 80406.15 25.917170 1.385845
10036 1712900 102563.29 16.700908 1.598770
10128 1787100 76363.62 23.402506 1.427305
10304 328300 38503.20 8.526565 2.172805
10305 425100 35829.10 11.864657 1.842839
11201 1420700 79367.63 17.900244 1.558652
11215 1070800 57284.86 18.692547 1.534973
11217 1302300 67674.22 19.243664 1.519652
11231 1202900 62353.54 19.291608 1.518360
11234 476900 38997.21 12.229081 1.817723
11434 382300 37569.19 10.175892 1.982715
rev_by_zipcode1 <-  final1[ , c("zipcode","Latest_Median_Price","zip_rev")]

rev_by_zipcode1 <- rev_by_zipcode1 %>% as.data.frame() %>%  group_by(zipcode) %>% summarise_at(c("Latest_Median_Price" , "zip_rev"), mean, na.rm = TRUE)

rev_by_zipcode1 <- rev_by_zipcode1 %>%   mutate(break_even = Latest_Median_Price/zip_rev) %>% mutate(ten_year_PI = (10*zip_rev + Latest_Median_Price)/Latest_Median_Price) %>% mutate(ten_yrs = 10*zip_rev - Latest_Median_Price) %>% mutate(fifteen_yrs = 15*zip_rev - Latest_Median_Price) %>% mutate(twenty_yrs = 20*zip_rev - Latest_Median_Price) %>%  mutate(twentyfive_yrs = 25*zip_rev - Latest_Median_Price) %>% mutate(thirty_yrs = 30*zip_rev - Latest_Median_Price)
knitr::kable(rev_by_zipcode) %>% kable_styling() # Creation of variables for profit after 10, 15, 20 ,25 yrs
zipcode Latest_Median_Price zip_rev break_even ten_year_PI
10003 2147000 94566.65 22.703564 1.440460
10011 2480400 109691.11 22.612590 1.442232
10013 3316500 111305.96 29.796249 1.335613
10014 2491600 117884.76 21.135896 1.473129
10021 1815600 84202.87 21.562210 1.463774
10022 2031600 107639.02 18.874197 1.529824
10023 2142300 120019.10 17.849659 1.560235
10025 1431000 70440.79 20.314933 1.492249
10028 2083900 80406.15 25.917170 1.385845
10036 1712900 102563.29 16.700908 1.598770
10128 1787100 76363.62 23.402506 1.427305
10304 328300 38503.20 8.526565 2.172805
10305 425100 35829.10 11.864657 1.842839
11201 1420700 79367.63 17.900244 1.558652
11215 1070800 57284.86 18.692547 1.534973
11217 1302300 67674.22 19.243664 1.519652
11231 1202900 62353.54 19.291608 1.518360
11234 476900 38997.21 12.229081 1.817723
11434 382300 37569.19 10.175892 1.982715
write.csv(rev_by_zipcode1,"rev_by_zipcode1.csv")

Zipcode 10304 is profitable after 10, 15 and 20 yrs but zipcode 10023 is profitable after 25 yrs

Visualization

Latest Median Price Chart

Revenue through rent in different zipcodes

Time to break even

## # A tibble: 19 x 5
##    zipcode Latest_Median_Price zip_rev break_even ten_year_PI
##    <chr>                 <dbl>   <dbl>      <dbl>       <dbl>
##  1 10304                328300  38503.       8.53        2.17
##  2 11434                382300  37569.      10.2         1.98
##  3 10305                425100  35829.      11.9         1.84
##  4 11234                476900  38997.      12.2         1.82
##  5 10036               1712900 102563.      16.7         1.60
##  6 10023               2142300 120019.      17.8         1.56
##  7 11201               1420700  79368.      17.9         1.56
##  8 11215               1070800  57285.      18.7         1.53
##  9 10022               2031600 107639.      18.9         1.53
## 10 11217               1302300  67674.      19.2         1.52
## 11 11231               1202900  62354.      19.3         1.52
## 12 10025               1431000  70441.      20.3         1.49
## 13 10014               2491600 117885.      21.1         1.47
## 14 10021               1815600  84203.      21.6         1.46
## 15 10011               2480400 109691.      22.6         1.44
## 16 10003               2147000  94567.      22.7         1.44
## 17 10128               1787100  76364.      23.4         1.43
## 18 10028               2083900  80406.      25.9         1.39
## 19 10013               3316500 111306.      29.8         1.34

Quickest break_even happens for zipcodes 10304 , 10312, 10305

Profitability Index for 10 years

Zipcodes 10304 , 11434 , 10305 have the highest PI after 10 years

Detailed Summary

Working with Zillow dataset

Zillow dataset has 8946 observations of 262 variables. It has median monthly prices of all properties on Zillow grouped by zip code from April 1996 till June 2017. Lots of median price records have no values corresponding to them. But, because of our assumption that time value of money discount rate is 0%, we can take the latest available median price data and assign this value as median price corresponding to that zip code. These values are stored in column ‘Latest_Median_Price’. The next step of our analysis is to extract records corresponding to the zip codes present in New York City. To perform this task, Zillow data is filtered on the field ‘City’ whose value is ‘New York’. We observe that there are only 25 records in the filtered data. There are about 200 zip codes in NYC. Compared to that, this value is low. This implies that the Zillow data is incomplete, due to which, our analysis is restricted to only these 25 zip codes. Only, the columns ‘RegionName’ and ‘price’ are retained for further analysis. This dataset is named ‘twobedNY’.

Latest Median Price Chart

Working with Airbnb listings dataset

Airbnb listings dataset has 48895 obs. of 106 variables Most of the variables present in this dataset are of no interest to us. After preliminary analysis , the variables of interest are ‘zipcode’, ‘price’ and ‘bedrooms’. Extract only those records whose zip codes match with the zip codes that we obtain from Zillow dataset. At the same time, filter out the records with zero number of bedrooms. Now, convert the character value in ‘price’ column into numeric column by removing ‘$’ and ‘,’ characters.

Merging both the datasets

Now, merge both the datasets based on zip codes.After the merge , zipcodes with less than 30 as sample size have been removed from the analysis. Next step is to convert this daily price into annual revenue. We assume that annual occupancy is 75%. So, to get the revenue generated by a 2-bed room property, we multiply ‘price’ by 0.75365(2/n) to get annual revenue generated by that property, where n is the number of bedrooms present in that property. This will give us an estimate of the median value of price of a 2-bedroom property in a zip code and median value of the revenue that is generated from all the properties listed on Airbnb in that zip code.

1. Estimation of profitable zipcodes through “Time to Break even”

\(Time to break even = Investment/Annual Revenue\)

Now, in each zip code, we calculate the number of years it takes for the property to break-even on the amount spent on purchasing it. This can simply be calculated by dividing cost of the property by annual revenue generated by the property. The column having these values is named ‘break_even’. This dataset is named ‘rev_by_zipcode’. The zip codes that requires the minimum amount of time to break even are the ones that are most profitable.

2. Estimation of profitable zipcodes through Profitability Index

\(Profitability Index(PI) = (Present value of future cash flows)/(Intial Investment)\)

Profitability Index represents the relationship between the costs and benefits of a proposed project. It is the ratio of present value of future cash flows and intial investment.Here , since time value of money discount rate is 0 . Assuming the properties sell at the same price after 10 years of being in the rental business.

  • Present value of future cash flows of 10 years = 10 x Annual revenue + Latest Median Price of the property = 10 x zip_rev + Latest_Median_Price

  • Intial Investment = Latest_Median_Price

Revenue through rent in different zipcodes

Time to break even

Quickest break_even point happens for zipcodes 10309 (Tottenville) , 10312(Eltingville) , 10304(Stapleton Heights) in that order

Profitability Index for 10 years

Zipcodes 10309 (Tottenville) , 10312(Eltingville) , 10304(Stapleton Heights) have the highest PI after 10 years in that order

Additional scope

Since it is a 4 hour data challenge , some ideas which when implemented would help inform a better decision to the company. I list some below.

  1. Since last few year price data is available to us, we can do time series analysis to better predict the current median price and use this for profitability analysis

  2. R shiny can be used to build interactive dashboards , drop downs and slider bars for data filtering

  3. Since median price data is available only for 25 zipcodes of New York city, more data can be procured for zipcodes where data is unavailable

  4. An areal heat map can be generated to represent our results.

  5. For listings without zipcodes, zipcodes can be estimated by applying clustering for longitude and latitude data , and they too can be included in our analysis.