A real estate company plans to purchase properties in New York City and use Airbnb as the medium to lease out their investment property. The main objective of the real estate company is to find out the zip codes in New York City that would generate the most profit on short term rentals. We are looking to find out the best zip codes to invest within New York City.
Two data sets, Airbnb and Zillow are available. Zillow provides data on estimated value of two-bedroom properties. Airbnb dataset provides data on how much properties in certain neighborhoods rent out for in New York City.
- An occupancy rate of 75%
- Investor pays for the property upfront in cash
- Time value of money discount rate is 0%
- Rent of a ‘m’ bedroom property is m/n times of that of a ‘n’ bedroom property
- Indirect discounts due to bulk bookings and special promotional discounts were not taken into consideration
- Cleaning fee covers the cleaning costs and doesn’t have any contribution to the revenue
Prior to the analysis , it is essential to properly load, examine and clean the data
============> =============> =============> =============> ==============> =============> ============>
To reproduce the results of this project you will need to load the following packages
library(tidyverse) # Collection of R packages designed for data science.
library(knitr) #A general-purpose tool for dynamic report generation in R
library(ggplot2) # Aesthetic depiction library
library(kableExtra) # Aesthetic depiction of tables in Rmarkdown
library(scales) # To work with axis in charts
The data for this project originated from the following source.
listings <- read.csv("listings.csv") #Airbnb listings
zillowdata <- read.csv("Zip_Zhvi_2bedroom.csv") # Median Price of twobedroom properties
NY_zip <- read.csv("New York zipcodes.csv") #List of all the zipcodes in New York city
Information on the listings including - location, number of bedrooms, room types (entire home/private home/shared home)
| Field | Description |
|---|---|
| id | Identifier used by AirBnB for the listing |
| listing_url | URL used by AirBnB for their listing |
| scrape_id | Identifier used internally by AirBnB to identify the date of the data |
| last_scraped | Identifier used internally by AirBnB to identify when the data was last pulled. |
| name | Header created by host to entice potential occupants to stay at their property |
| summary | Description provided by the host of their property |
| space | Description provided by host describing the space (bedrooms, bathrooms, sqft, etc) of their property |
| description | Detailed information about the property, neighborhood, amenities, location, and anything else the host feels would be beneficial to share |
| experiences_offered | Information about what experiences are offered; eg. Bikes, surfing, canoes, etc |
| neighborhood_overview | Additional detail about the neighborhood including sites to see, restaurants, noise, landmarks around, etc. |
| notes | Any other additional notes the host would like to disclose, which was not provided in the other description information. |
| transit | Host can provide the kind of transportation available nearby including bus, trains, airport, etc.. |
| access | Description on how to enter the home, what area of the home is available for use, and what you are able items you are able to use during your stay. |
| interaction | The availability of the host; could be in person, phone, email, text, etc. The host provides guidelines of what the occupant can expect for communication and if they expect to meet. |
| house_rules | The host provides rules of the home that they expect the occupant to adhere to. Examples include clean up, laundry, dishes, etc. |
| thumbnail_url | URL specific to AirBnB where the posted pictures are stored. |
| medium_url | URL specific to AirBnB where the posted pictures are stored. |
| picture_url | URL specific to AirBnB where the posted pictures are stored. |
| xl_picture_url | URL specific to AirBnB where the posted pictures are stored. |
| host_id | Identifier assigned to a particular host. A host can have multiple properties listed and would have the same unique identifier. |
| host_url | URL assigned to the host for their listing. |
| host_name | Name of the host. |
| host_since | Date of when the host first initiated their account with AirBnB. |
| host_location | City, State, and Country of where the host is located. |
| host_about | Information provided by host to potential occupant. |
| host_response_time | Time a potential occupant can expect to receive a response after inquiring about the property. |
| host_response_rate | Percentage of historical responses |
| host_acceptance_rate | Percentage of historical acceptances from potential occupants. This information is currently unavailable. |
| host_is_superhost | Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests. |
| host_thumbnail_url | URL assigned to host where the pictures are posted |
| host_picture_url | URL assigned to host where the pictures are posted |
| host_neighbourhood | Name of the neighborhood the host is listing the property at |
| host_listings_count | Number of properties host has listed on AirBnB currently |
| host_total_listings_count | Number of properties host has ever listed on AirBnB |
| host_verifications | Avenues a potential occupant can check on the host to verify reputation |
| host_has_profile_pic | Indicator for profile picture provided (t = yes) (f = no) |
| host_identity_verified | Further checks have been conducted by host to AirBnB to verify existence and reputation. (t = yes) (f = no) |
| street | Street address where the property is located |
| neighbourhood | Neighborhood where the property is located |
| neighbourhood_cleansed | Verified neighborhood name where the property is located. |
| neighbourhood_group_cleansed | Name of the area where the property is located. |
| city | City where the property is located. |
| state | State where the property is located. |
| zipcode | Zip code where the property is located. |
| market | State where the property is located |
| smart_location | City and state together where the property is located. |
| country_code | Country abbreviation where the property is located. |
| country | Full country name where the property is located. |
| latitude | The angular distance of a place north or south of the earth’s equator, expressed in degrees and minutes. |
| longitude | The angular distance of a place east or west of the meridian at Greenwich, England, expressed in degrees and minutes. |
| is_location_exact | Indicator to validate if the latitude and longitude is exact or not. (t==yes) (f = no) |
| property_type | Indicates the type of house is this is considered. Examples are house, apartment, townhomes, etc… |
| room_type | Indicates specific rooms available for rent and/or the entire home. |
| accommodates | Indicates the amount of people the host is allowing to occupy the property. |
| bathrooms | Indicates the number of bathrooms within the property. |
| bedrooms | Indicates the number of bedrooms within the property. |
| beds | Indicates the number of beds within the property for use. |
| bed_type | Indicates the type of bed the property has. ie. futon, real bed, airbed, etc. |
| amenities | Provides the potential occupant the type of convenience items included as part of rent. Items include cable television, wifi, toiletries, food, etc.. |
| square_feet | Square footage of the property or space for rent. |
| price | Price the host is charging to stay per night. |
| weekly_price | Weekly price the host is charging; this could be discounted for longer term stays. |
| monthly_price | Monthly price the host is charging; this could be discounted for longer term stays. |
| security_deposit | Amount the host would like upfront for any incidentals (should you damage anything during your stay). |
| cleaning_fee | Price the host is charging to clean up after your stay. |
| guests_included | Indicates the number of guests the host is allowing you to bring along. |
| extra_people | Additional charge per additional guests you bring. |
| minimum_nights | Minimum amount of nights the host is willing to rent out the property. |
| maximum_nights | Maximum amount of nights the host is willing to rent out the property. |
| calendar_updated | Indicates when the host has updated the availability of when the property is available to rent. |
| has_availability | Indicator to show if the host has availability currently; not currently available at the moment. |
| availability_30 | Indicates the number of days the property is available for rent within 30 days. |
| availability_60 | Indicates the number of days the property is available for rent within 60 days. |
| availability_90 | Indicates the number of days the property is available for rent within 90 days. |
| availability_365 | Indicates the number of days the property is available for rent within 365 days. |
| calendar_last_scraped | Date when the availability calendar was last pulled by AirBnB. |
| number_of_reviews | Number of reviews received for the property for its entire existence within AirBnB. |
| first_review | Date of when the first review was provided. |
| last_review | Date of when the last review was provided. |
| review_scores_rating | Overall score given based on accuracy, cleanliness, check-in, communication, location, and value. |
| review_scores_accuracy | Score based on the accuracy of the description and pictures provided in the listing of the host. |
| review_scores_cleanliness | Score based on the cleanliness of the property when occupant arrived property. |
| review_scores_checkin | Score based on the ease of checkin when the occupant arrived to the property. |
| review_scores_communication | Score based on the communication between host and occupant before and during their stay. |
| review_scores_location | Score based on the location of the property. |
| review_scores_value | Score based on the price charged for the property based on everything that is offered. |
| requires license | Indicator to show if the host requires a license to stay at the property. (t= yes) (f= no) |
| license | Host should require a license; this field would have the number. |
| jurisdiction_names | Name of the jurisdiction of where the property is at. This filed is currently unavailable. |
| instant_bookable | Indicates if the property can be booked immediately or is currently unavailable to book.(t= yes) (f= no) |
| cancellation_policy | Indicates how flexible the host is for cancelling the reservation. Values range from flexible to superstrict. |
| require_guest_profile_picture | Requires the occupant to provide a picture in order to reserve the property. |
| require_guest_phone_verification | Requires the occupant to provide a valid phone number in order to reserve the property. |
| calculated_host_listings_count | Total number of listings the host has on AirBnB. |
| reviews_per_month | Number of reviews the host has received on average per month. |
Cost data to determine the average property price for 2 bedrooms
| Field | Description |
|---|---|
| RegionID | Zillow assigned number only, assigned consecutively when the regions are defined. |
| RegionName | Zip code of where the property is located. |
| City | City of where the property is located. |
| State | State of where the property is located. |
| Metro | General name of the surrounding area where the property is located. |
| CountyName | Political and administrative division of a state, referred to as a particular part of the state. |
| SizeRank | Population of the area; the lower the number the greater the population. |
| 1996-04 – XXXX-XX ((XXXX-XX indicates the latest date available at the time the data is pulled) | Indicates the historical median price within that area. |
Official list of all the zipcodes available in New York city Reference:New York Gov. Health Department
It is a dataframe with 48895 Observations and 106 Variables
## 'data.frame': 48895 obs. of 106 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ listing_url : Factor w/ 48895 levels "https://www.airbnb.com/rooms/10000070",..: 21404 21961 39206 39474 41416 41514 41543 41592 41625 41660 ...
## $ scrape_id : num 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
## $ last_scraped : Factor w/ 2 levels "2019-07-08","2019-07-09": 2 2 1 2 1 1 2 1 1 1 ...
## $ name : Factor w/ 47906 levels "","'Fan'tastic",..: 12661 38172 45171 15702 19366 25001 8337 25048 15597 17682 ...
## $ summary : Factor w/ 43807 levels "","----This a listing for our photo studio. The space does not offer overnight stays---- With an abundance of Sou"| __truncated__,..: 29689 13422 1 40376 1 22465 1 27616 26184 19463 ...
## $ space : Factor w/ 32171 levels "","---------------------------------------------------------------------------------------------------------------"| __truncated__,..: 18395 201 30500 8115 12328 9504 8648 31798 18480 16185 ...
## $ description : Factor w/ 46250 levels "","----------------------- THE APARTMENT ----------------------- **SUMMER ALERT: Apartment is FULLY AIRCONDITIONE"| __truncated__,..: 31270 14116 44645 42618 21635 23689 16093 29099 27589 20535 ...
## $ experiences_offered : Factor w/ 1 level "none": 1 1 1 1 1 1 1 1 1 1 ...
## $ neighborhood_overview : Factor w/ 26905 levels "","''Stuyvesant hight'' is a trendy and safe neighborhood, their is plenty of coffee shops, restaurants and stores"| __truncated__,..: 4516 3890 1 10157 1 12131 1 21046 13722 8179 ...
## $ notes : Factor w/ 17018 levels "","'Minibar' is the name of the best app for beer, wine, and liquor delivery. 'Seamless' is the main food delivery"| __truncated__,..: 5563 1 1 1 1 10205 1 10272 2164 14126 ...
## $ transit : Factor w/ 28041 levels "","'very very close to L express to manhattan! JMZ on broadway one block away> sometimes you hear the train runni"| __truncated__,..: 24517 4488 1 4789 1 8116 1 1 20931 7719 ...
## $ access : Factor w/ 23435 levels "","---------------------------------------------------------------------------------------------------------------"| __truncated__,..: 1 8448 1 22621 1 10294 1 2632 8780 8179 ...
## $ interaction : Factor w/ 24970 levels "","---------------------------------------------------------------------------------------------- ----------------"| __truncated__,..: 1 6204 1 19711 1 10508 1 1 16482 19266 ...
## $ house_rules : Factor w/ 25255 levels "","'No Shoes' in the apartment.",..: 6531 12547 23842 21886 17807 7595 1 15619 16771 15412 ...
## $ thumbnail_url : logi NA NA NA NA NA NA ...
## $ medium_url : logi NA NA NA NA NA NA ...
## $ picture_url : Factor w/ 48272 levels "https://a0.muscache.com/im/pictures/00027b83-c89a-4a76-ac36-de9ab57e878e.jpg?aki_policy=large",..: 11889 45757 26944 43799 48081 1209 6720 3977 2921 17476 ...
## $ xl_picture_url : logi NA NA NA NA NA NA ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_url : Factor w/ 37457 levels "https://www.airbnb.com/users/show/1000014",..: 19513 19756 26531 27259 32826 33085 33159 35905 33430 33539 ...
## $ host_name : Factor w/ 11453 levels "","'Cil","-TheQueensCornerLot",..: 5051 4846 2962 6264 5982 1970 3601 9699 6935 1264 ...
## $ host_since : Factor w/ 3587 levels "","2008-08-22",..: 4 5 14 15 19 20 21 31 22 24 ...
## $ host_location : Factor w/ 1571 levels ""," Brooklyn, NY ",..: 982 982 982 982 886 982 982 982 982 982 ...
## $ host_about : Factor w/ 22144 levels "","'Eric is a director-actor-writer-producer-teacher of theatre; he also writes songs, makes bread, and advocates "| __truncated__,..: 3047 751 17091 16419 13912 9517 107 15173 21518 14390 ...
## $ host_response_time : Factor w/ 6 levels "","a few days or more",..: 6 5 4 5 3 6 3 5 3 5 ...
## $ host_response_rate : Factor w/ 87 levels "","0%","10%",..: 4 74 4 80 87 4 87 77 87 4 ...
## $ host_acceptance_rate : Factor w/ 2 levels "","N/A": 2 2 2 2 2 2 2 2 2 2 ...
## $ host_is_superhost : Factor w/ 3 levels "","f","t": 2 2 2 2 2 2 2 2 2 3 ...
## $ host_thumbnail_url : Factor w/ 37360 levels "","https://a0.muscache.com/defaults/user_pic-50x50.png?v=3",..: 6697 31587 34728 34835 36082 14713 5717 36888 36252 21825 ...
## $ host_picture_url : Factor w/ 37360 levels "","https://a0.muscache.com/defaults/user_pic-225x225.png?v=3",..: 6697 31587 34728 34835 36082 14713 5717 36888 36252 21825 ...
## $ host_neighbourhood : Factor w/ 416 levels "","Ã\230sterbro",..: 156 241 167 75 113 129 27 169 375 218 ...
## $ host_listings_count : int 6 5 1 1 1 1 1 1 1 4 ...
## $ host_total_listings_count : int 6 5 1 1 1 1 1 1 1 4 ...
## $ host_verifications : Factor w/ 536 levels "['email', 'facebook', 'google', 'reviews', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']",..: 351 350 182 351 118 317 138 155 33 301 ...
## $ host_has_profile_pic : Factor w/ 3 levels "","f","t": 3 3 3 3 3 3 3 3 3 3 ...
## $ host_identity_verified : Factor w/ 3 levels "","f","t": 3 3 3 3 3 2 2 2 3 3 ...
## $ street : Factor w/ 315 levels " Astoria, NY, United States",..: 46 203 203 64 203 203 64 203 203 203 ...
## $ neighbourhood : Factor w/ 195 levels "","Allerton",..: 20 106 83 20 53 117 20 106 178 105 ...
## $ neighbourhood_cleansed : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
## $ neighbourhood_group_cleansed : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
## $ city : Factor w/ 306 levels ""," Astoria",..: 49 187 187 46 187 187 46 187 187 187 ...
## $ state : Factor w/ 8 levels "","CA","MP","New York ",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ zipcode : Factor w/ 200 levels "","07093","07302",..: 123 20 29 142 31 18 121 21 27 5 ...
## $ market : Factor w/ 18 levels "","Adirondacks",..: 15 15 15 15 15 15 15 15 15 15 ...
## $ smart_location : Factor w/ 317 levels " Astoria, NY",..: 46 205 205 62 205 205 62 205 205 205 ...
## $ country_code : Factor w/ 1 level "US": 1 1 1 1 1 1 1 1 1 1 ...
## $ country : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ is_location_exact : Factor w/ 2 levels "f","t": 1 1 2 2 2 1 1 2 2 2 ...
## $ property_type : Factor w/ 36 levels "Aparthotel","Apartment",..: 2 2 2 19 2 2 2 2 2 2 ...
## $ room_type : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
## $ accommodates : int 2 2 2 3 1 2 2 2 1 3 ...
## $ bathrooms : num 1 1 1 1 1 1 NA 1 1 1 ...
## $ bedrooms : int 1 0 1 1 NA 1 1 1 1 1 ...
## $ beds : int 1 1 1 4 1 1 1 1 1 2 ...
## $ bed_type : Factor w/ 5 levels "Airbed","Couch",..: 5 5 4 5 5 5 3 5 5 5 ...
## $ amenities : Factor w/ 44292 levels "{\"Air conditioning\",\"Carbon monoxide detector\",\"First aid kit\",\"Lock on bedroom door\",\"translation mis"| __truncated__,..: 9863 23810 339 7790 3243 9905 38040 20063 920 11136 ...
## $ square_feet : int NA NA NA 500 NA NA NA NA NA NA ...
## $ price : Factor w/ 674 levels "$0.00","$1,000.00",..: 111 216 113 642 619 189 551 612 612 113 ...
## $ weekly_price : Factor w/ 599 levels "","$1,000.00",..: 260 162 1 426 440 1 1 366 1 1 ...
## $ monthly_price : Factor w/ 681 levels "","$1,000.00",..: 681 1 1 233 131 1 1 1 1 1 ...
## $ security_deposit : Factor w/ 225 levels "","$0.00","$1,000.00",..: 32 148 95 178 32 137 169 1 1 2 ...
## $ cleaning_fee : Factor w/ 207 levels "","$0.00","$1,000.00",..: 91 5 182 1 188 26 2 43 1 128 ...
## $ guests_included : int 1 2 2 1 1 2 1 1 1 2 ...
## $ extra_people : Factor w/ 104 levels "$0.00","$10.00",..: 51 1 29 1 29 3 45 8 1 2 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ maximum_nights : int 730 1125 7 730 120 21 730 14 14 1125 ...
## $ minimum_minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ maximum_minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ minimum_maximum_nights : int 730 1125 7 730 120 21 730 14 14 1125 ...
## $ maximum_maximum_nights : int 730 1125 7 730 120 21 730 14 14 1125 ...
## $ minimum_nights_avg_ntm : num 1 1 3 1 10 3 45 2 2 1 ...
## $ maximum_nights_avg_ntm : num 730 1125 7 730 120 ...
## $ calendar_updated : Factor w/ 92 levels "1 week ago","10 months ago",..: 27 38 32 91 26 27 10 39 19 91 ...
## $ has_availability : Factor w/ 1 level "t": 1 1 1 1 1 1 1 1 1 1 ...
## $ availability_30 : int 30 25 30 0 0 23 0 3 0 0 ...
## $ availability_60 : int 60 55 60 0 0 48 0 10 0 0 ...
## $ availability_90 : int 90 80 90 3 0 48 0 12 0 0 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
## $ calendar_last_scraped : Factor w/ 2 levels "2019-07-08","2019-07-09": 2 2 1 2 1 1 2 1 1 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ number_of_reviews_ltm : int 2 11 0 69 4 9 0 43 0 12 ...
## $ first_review : Factor w/ 2898 levels "","2009-03-12",..: 1587 23 1 1162 357 3 9 5 15 14 ...
## $ last_review : Factor w/ 1765 levels "","2011-03-28",..: 1503 1717 1 1762 1534 1749 1124 1751 1048 1736 ...
## $ review_scores_rating : int 98 95 NA 90 93 89 90 84 98 94 ...
## $ review_scores_accuracy : int 10 10 NA 10 10 10 8 9 10 10 ...
## $ review_scores_cleanliness : int 10 9 NA 9 9 9 8 7 10 9 ...
## $ review_scores_checkin : int 10 10 NA 10 10 10 10 10 10 10 ...
## $ review_scores_communication : int 10 10 NA 10 10 10 10 9 10 10 ...
## $ review_scores_location : int 10 10 NA 10 9 9 9 10 10 9 ...
## $ review_scores_value : int 10 9 NA 9 10 9 9 9 10 9 ...
## $ requires_license : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1 1 ...
## $ license : Factor w/ 10 levels "","11-3461723",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ jurisdiction_names : Factor w/ 6 levels "","{\"Connecticut State\"}",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ instant_bookable : Factor w/ 2 levels "f","t": 1 1 1 1 1 1 1 1 1 1 ...
## $ is_business_travel_ready : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1 1 ...
## $ cancellation_policy : Factor w/ 6 levels "flexible","moderate",..: 2 4 4 2 4 4 4 4 1 4 ...
## [list output truncated]
It is a dataframe with 8946 obs. of 262 variables
## 'data.frame': 8946 obs. of 262 variables:
## $ RegionID : int 61639 84654 61637 84616 93144 84640 97564 91982 71831 84646 ...
## $ RegionName: int 10025 60657 10023 60614 79936 60640 94109 77494 32162 60647 ...
## $ City : Factor w/ 4684 levels "Aberdeen","Abilene",..: 2702 649 2702 649 1119 649 3479 672 3912 649 ...
## $ State : Factor w/ 48 levels "AK","AL","AR",..: 34 15 34 15 42 15 5 42 10 15 ...
## $ Metro : Factor w/ 467 levels "","Aberdeen",..: 292 76 292 76 125 76 368 188 415 76 ...
## $ CountyName: Factor w/ 722 levels "Ada","Adams",..: 460 158 460 158 212 158 574 231 634 158 ...
## $ SizeRank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ X1996.04 : int NA 167700 NA 195800 59100 123300 336800 117300 72600 129600 ...
## $ X1996.05 : int NA 166400 NA 193500 60500 122600 337400 117900 74800 129800 ...
## $ X1996.06 : int NA 166700 NA 192600 60900 122000 337700 118800 77300 129900 ...
## $ X1996.07 : int NA 167200 NA 192300 60800 121500 338200 121100 80000 130600 ...
## $ X1996.08 : int NA 166900 NA 192600 60300 120900 339700 123200 81700 131100 ...
## $ X1996.09 : int NA 166900 NA 193600 60400 120600 340400 124600 82600 131200 ...
## $ X1996.10 : int NA 168000 NA 195500 61200 120900 339700 125800 83500 131000 ...
## $ X1996.11 : int NA 170100 NA 197600 61700 121300 338900 126400 84700 131400 ...
## $ X1996.12 : int NA 171700 NA 199400 61000 121600 338300 125600 85100 132000 ...
## $ X1997.01 : int NA 173000 NA 201300 60100 122100 340000 125100 84200 132100 ...
## $ X1997.02 : int NA 174600 NA 203600 59300 122900 345800 127000 82200 132200 ...
## $ X1997.03 : int NA 177600 NA 206500 59000 124200 350600 129400 79900 133100 ...
## $ X1997.04 : int NA 180100 NA 209200 58700 125300 351300 130400 78400 133500 ...
## $ X1997.05 : int NA 182300 NA 211100 58400 126100 352600 130400 77500 133300 ...
## $ X1997.06 : int NA 184400 NA 212600 58000 126700 357300 131200 76800 134100 ...
## $ X1997.07 : int NA 186300 NA 214400 57800 127900 363900 130600 77400 136100 ...
## $ X1997.08 : int NA 187600 NA 215600 57900 129300 370600 129100 78600 138700 ...
## $ X1997.09 : int NA 189400 NA 216500 57800 130400 374300 127600 79800 141300 ...
## $ X1997.10 : int NA 190300 NA 217900 57800 131300 375600 125800 80700 143100 ...
## $ X1997.11 : int NA 189700 NA 220100 58100 131700 381000 124500 81300 144100 ...
## $ X1997.12 : int NA 189800 NA 222200 58400 132300 389500 124600 81500 144800 ...
## $ X1998.01 : int NA 191900 NA 223900 58700 133500 395400 125100 81900 145300 ...
## $ X1998.02 : int NA 194500 NA 225400 59200 134500 400800 125500 82300 145800 ...
## $ X1998.03 : int NA 195500 NA 227700 59400 134800 409000 126000 82200 146100 ...
## $ X1998.04 : int NA 196000 NA 230100 58700 135200 414600 127200 81700 146500 ...
## $ X1998.05 : int NA 196900 NA 231700 58100 135500 418500 128700 81300 147500 ...
## $ X1998.06 : int NA 198900 NA 232700 57900 136300 424200 128300 81000 148500 ...
## $ X1998.07 : int NA 201400 NA 233700 57900 137700 430500 126300 80500 149200 ...
## $ X1998.08 : int NA 204600 NA 234700 57700 139600 437500 126000 80300 150000 ...
## $ X1998.09 : int NA 207900 NA 235600 57600 141600 448900 126500 80100 150500 ...
## $ X1998.10 : int NA 211800 NA 236800 57500 143400 460300 126300 79900 151100 ...
## $ X1998.11 : int NA 214600 NA 238800 57800 144500 461900 127400 80600 152000 ...
## $ X1998.12 : int NA 216000 NA 240800 58000 145600 461900 130600 82300 153700 ...
## $ X1999.01 : int NA 217500 NA 242400 58000 147400 466200 131600 83800 156300 ...
## $ X1999.02 : int NA 220200 NA 243800 57900 149200 470500 130400 84700 159200 ...
## $ X1999.03 : int NA 222800 NA 246400 57800 149900 470500 129900 85600 161000 ...
## $ X1999.04 : int NA 226200 NA 250200 57800 150500 473100 131900 86900 162300 ...
## $ X1999.05 : int NA 229600 NA 254300 58000 151700 480000 134600 88300 163500 ...
## $ X1999.06 : int NA 232400 NA 257600 58500 152800 487800 137400 88900 164700 ...
## $ X1999.07 : int NA 234400 NA 261100 58700 153900 496000 140600 89300 165700 ...
## $ X1999.08 : int NA 236300 NA 264800 59000 156400 506900 141800 91000 166800 ...
## $ X1999.09 : int NA 238300 NA 267900 59200 159400 516300 141200 93400 169600 ...
## $ X1999.10 : int NA 241800 NA 270700 59300 161800 522900 141800 95100 173700 ...
## $ X1999.11 : int NA 246100 NA 272800 59500 163800 533800 142800 95600 177200 ...
## $ X1999.12 : int NA 249500 NA 274400 59900 165700 549400 141700 95000 179100 ...
## $ X2000.01 : int NA 251300 NA 276200 60300 167100 564600 141200 95500 180900 ...
## $ X2000.02 : int NA 253200 NA 278600 60400 168400 577300 143000 98700 183600 ...
## $ X2000.03 : int NA 255700 NA 280100 60300 169600 592200 145600 101800 187200 ...
## $ X2000.04 : int NA 259200 NA 283100 60300 171000 610400 145500 102500 190700 ...
## $ X2000.05 : int NA 263100 NA 287700 60000 172400 626700 145400 102300 193500 ...
## $ X2000.06 : int NA 266600 NA 293600 59500 174800 640600 145000 103300 195700 ...
## $ X2000.07 : int NA 269500 NA 298500 59400 177900 654400 143000 103300 196500 ...
## $ X2000.08 : int NA 272800 NA 302700 59800 180400 663500 140800 101900 197800 ...
## $ X2000.09 : int NA 275500 NA 305000 59900 182300 664700 140000 101800 200200 ...
## $ X2000.10 : int NA 278800 NA 306800 59700 184600 670100 142200 102700 202700 ...
## $ X2000.11 : int NA 283400 NA 309400 59500 187700 679000 145400 102500 204800 ...
## $ X2000.12 : int NA 288600 NA 313100 59400 190700 676500 146400 103100 208000 ...
## $ X2001.01 : int NA 291300 NA 314900 59500 193100 663100 145900 105000 211800 ...
## $ X2001.02 : int NA 292400 NA 316200 59700 196100 656800 145800 105300 214400 ...
## $ X2001.03 : int NA 294600 NA 318200 59700 200000 658700 142400 104500 215700 ...
## $ X2001.04 : int NA 297100 NA 320600 59200 202900 661300 138200 104300 218100 ...
## $ X2001.05 : int NA 298200 NA 322900 58700 205500 661100 135200 103900 221100 ...
## $ X2001.06 : int NA 299800 NA 325500 58400 207600 658700 133600 103000 222100 ...
## $ X2001.07 : int NA 302000 NA 328400 57800 208600 653900 133500 103000 222100 ...
## $ X2001.08 : int NA 304200 NA 330700 57000 209200 646800 135400 103500 222900 ...
## $ X2001.09 : int NA 307900 NA 332800 56700 210000 641400 136800 103500 224500 ...
## $ X2001.10 : int NA 311000 NA 334400 56800 210200 636300 136500 103700 226700 ...
## $ X2001.11 : int NA 311400 NA 335900 56600 211300 630300 135300 104000 229700 ...
## $ X2001.12 : int NA 311000 NA 337400 56500 213000 628500 134600 104000 231500 ...
## $ X2002.01 : int NA 311700 NA 339700 56500 214100 633500 135400 104700 232200 ...
## $ X2002.02 : int NA 312300 NA 342300 56600 215200 637300 137700 106800 233100 ...
## $ X2002.03 : int NA 312000 NA 343800 56700 217600 640700 139500 109700 234300 ...
## $ X2002.04 : int NA 311800 NA 343400 56600 220200 647900 139900 112100 235100 ...
## $ X2002.05 : int NA 312600 NA 342300 56700 222400 655500 140600 114300 235500 ...
## $ X2002.06 : int NA 313000 NA 341800 57000 224600 662500 141600 116600 236400 ...
## $ X2002.07 : int NA 314400 NA 341700 57300 227000 670800 141200 118000 238100 ...
## $ X2002.08 : int NA 317300 NA 342400 57300 228600 677500 140600 117800 239300 ...
## $ X2002.09 : int NA 319700 NA 344300 57300 230100 687900 141900 117800 240400 ...
## $ X2002.10 : int NA 320500 NA 346900 57100 232400 698400 142700 118600 242100 ...
## $ X2002.11 : int NA 321000 NA 348900 56900 234300 701000 141700 119900 243700 ...
## $ X2002.12 : int NA 321600 NA 350200 56900 235300 698900 141700 121300 244900 ...
## $ X2003.01 : int NA 323800 NA 351700 57000 236200 697600 143800 121900 247000 ...
## $ X2003.02 : int NA 326100 NA 353500 56700 237000 691800 144100 121100 249500 ...
## $ X2003.03 : int NA 329000 NA 355700 56700 237900 684600 143000 120400 252100 ...
## $ X2003.04 : int NA 332200 NA 358000 57000 239200 681100 142200 120600 254300 ...
## $ X2003.05 : int NA 334700 NA 361600 57400 241300 685100 141400 121200 257200 ...
## $ X2003.06 : int NA 336000 NA 364000 57700 243600 692200 141000 121600 261300 ...
## $ X2003.07 : int NA 337300 NA 365500 58000 244200 697500 141400 122400 265400 ...
## $ X2003.08 : int NA 337500 NA 366400 58300 243800 704200 140800 123900 268400 ...
## $ X2003.09 : int NA 337100 NA 367000 58600 244500 710500 139400 125300 269600 ...
## $ X2003.10 : int NA 334900 NA 365200 58800 245500 717200 139400 126200 268700 ...
## $ X2003.11 : int NA 333100 NA 363100 59100 245700 732300 140200 127600 268100 ...
## [list output truncated]
Since our analysis confines to the properties in the New York city, we can filter the corresponding entries in both “listings” and “zillowdata”
sum(duplicated(listings)) # Checking for duplicated entries
## [1] 0
sum(is.na(listings$zipcode)) # Checking for missing values in the zipcode column
## [1] 0
sum(is.na(listings$square_feet))/nrow(listings) # checking missing values in sq.ft column
## [1] 0.9916556
But looking at the unique values in the listings data would reveal that some missing values are disguised as blank spaces
unique(listings$zipcode) # Blank Spaces Spotted
## [1] 11218 10018 10027 11238 10029
## [6] 10016 11216 10019 10025 10002
## [11] 10036 11215 10014 11211 11205
## [16] 10011 11217 10034 10003 10031
## [21] 11222 11237 11226 10009
## [26] 10024 11225 11101 11221 11206
## [31] 10010 11249 10001 10162 10035
## [36] 10032 10075 11377 11201 11233
## [41] 11231 10023 10128 10040 11234
## [46] 11367 10028 11104 10065 10301
## [51] 10452 10006 10039 11385 10021
## [56] 11434 11379 10026 11105 10044
## [61] 10022 10013 11103 10473 10475
## [66] 11213 10468 11102 10038 10033
## [71] 11427 11693 10012 11375 10470
## [76] 10030 10453 11223 10469 11208
## [81] 11207 10451 11235 10314 11209
## [86] 11204 10305 11232 10017 11373
## [91] 11224 11372 11412 10004 11203
## [96] 11374 10466 11106 10304 11220
## [101] 11230 11210 10303 11229 10463
## [106] 11435 11416 10080 10005 10454
## [111] 10455 11236 10280 10037 11355
## [116] 11692 11411 10282 10307 11219
## [121] 10467 11358 11361 10069 11415
## [126] 11356 11214 11432 10464 11370
## [131] 10302 11418 11433 11429 11422
## [136] 11378 11421 10312 10460 10471
## [141] 10462 10459 11417 10007 11360
## [146] 11249\n11249 11109 11357 10458 11369
## [151] 11212 11691 11354 11368 10306
## [156] 10456 11413 11365 11436 11414
## [161] 11694 11366 10457 10461 10310
## [166] 11420 10308 11419 11103-3233 10474
## [171] 10002-2289 10465 10472 11001 11428
## [176] 11426 11423 11228 10129 11364
## [181] 11385-2308 11363 10281 91766 11362
## [186] 10174 11413-3220 07093 10550 11692-2304
## [191] 11559 10270 10309 07302 10704
## [196] 11003 10020 11021 10705 94085
## 200 Levels: 07093 07302 10001 10002 10002-2289 10003 10004 10005 ... 94085
sum(listings$zipcode == "")/nrow(listings)*100 # To caculate percent of entries with missing zipcodes
## [1] 1.057368
Very small percentage of missing values(1%) . We can ignore the entries with missing zipcodes ony after making sure that the percentage of entries with missing zipcodes is minimal in each and every neighborhood , since our final analysis includes comparision of local averages.
missing_zipcode_entrylist = as.data.frame(table(listings[listings$zipcode == "", "neighbourhood"]))
missing_zipcode_entrylist <- missing_zipcode_entrylist[missing_zipcode_entrylist$Freq != 0, ]
names(missing_zipcode_entrylist) <- c("neighbourhood","Frequency")
missing_zipcode_entrylist <- missing_zipcode_entrylist[order(-missing_zipcode_entrylist$Frequency), ] #List of no. of entries with missing zipcodes w.r.t neighbourhood
#Checking the local missing value percentage in top 3 in the list
df1 <- missing_zipcode_entrylist[missing_zipcode_entrylist$neighbourhood == "Brooklyn" , ]$Frequency # no. of entries with missing zipcodes in Brooklyn
Brooklyn_percent = df1/nrow(listings[listings$neighbourhood == "Brooklyn" , ])*100 # percent of entries with missing zipcodes in Brooklyn
df2 <- missing_zipcode_entrylist[missing_zipcode_entrylist$neighbourhood == "Manhattan" , ]$Frequency # no. of entries with missing zipcodes in Manhattan
Manhattan_percent = df2/nrow(listings[listings$neighbourhood == "Manhattan" , ])*100 # percent of entries with missing zipcodes in Manhattan
df3 <- missing_zipcode_entrylist[missing_zipcode_entrylist$neighbourhood == "Queens" , ]$Frequency #no. of entries with missing zipcodes in Queens
Queens_percent = df3/nrow(listings[listings$neighbourhood == "Queens" , ])*100 # percent of entries with missing zipcodes in Queens
Brooklyn_percent
## [1] 1.046878
Manhattan_percent
## [1] 1.038884
Queens_percent
## [1] 1.029464
We can choose to ignore the entries with missing zipcodes since they are negligible and evenly distributed
Since there are no duplicate entries and negligible amount of missing values in the city column , we can filter the listings pertaining to New York City by matching it with the official list of zipcodes
NY_zip_char <- NY_zip %>% mutate(X10001 = as.character(X10001)) # Matching the class of variables containing zipcode in both "listings" and "NY_zip"
listings <- listings %>% mutate(zipcode = as.character(zipcode))
listingsNY <- listings[listings$zipcode %in% NY_zip_char[ , 1], ] # Filtering out data corresponding to zipcodes in New YOrk city
unique(listingsNY$zipcode)
## [1] "11218" "10018" "10027" "11238" "10029" "10016" "11216" "10019"
## [9] "10025" "10002" "10036" "11215" "10014" "11211" "11205" "10011"
## [17] "11217" "10034" "10003" "10031" "11222" "11237" "11226" "10009"
## [25] "10024" "11225" "11101" "11221" "11206" "10010" "11249" "10001"
## [33] "10162" "10035" "10032" "11377" "11201" "11233" "11231" "10023"
## [41] "10128" "10040" "11234" "11367" "10028" "11104" "10301" "10452"
## [49] "10006" "10039" "11385" "10021" "11434" "11379" "10026" "11105"
## [57] "10044" "10022" "10013" "11103" "10473" "10475" "11213" "10468"
## [65] "11102" "10038" "10033" "11427" "11693" "10012" "11375" "10470"
## [73] "10030" "10453" "11223" "10469" "11208" "11207" "10451" "11235"
## [81] "10314" "11209" "11204" "10305" "11232" "10017" "11373" "11224"
## [89] "11372" "11412" "10004" "11203" "11374" "10466" "11106" "10304"
## [97] "11220" "11230" "11210" "10303" "11229" "10463" "11435" "11416"
## [105] "10005" "10454" "10455" "11236" "10280" "10037" "11355" "11692"
## [113] "11411" "10282" "10307" "11219" "10467" "11358" "11361" "10069"
## [121] "11415" "11356" "11214" "11432" "10464" "11370" "10302" "11418"
## [129] "11433" "11429" "11422" "11378" "11421" "10312" "10460" "10471"
## [137] "10462" "10459" "11417" "10007" "11360" "11109" "11357" "10458"
## [145] "11369" "11212" "11691" "11354" "11368" "10306" "10456" "11413"
## [153] "11365" "11436" "11414" "11694" "11366" "10457" "10461" "10310"
## [161] "11420" "10308" "11419" "10474" "10465" "10472" "11428" "11426"
## [169] "11423" "11228" "11364" "11363" "10281" "11362" "10174" "10270"
## [177] "10309" "10020"
listingsNY contains listings pertaining to New YOrk city
sum(duplicated(zillowdata)) # Checking for duplicate entries
## [1] 0
sum(is.na(zillowdata$RegionName)) # Checking for missing values in Region Name column which contains zipcodes
## [1] 0
twobedNY <- zillowdata[zillowdata$RegionName %in% NY_zip[ , 1] , ] #Filtering out available median price data for zipcodes in New York City
twobedNY contains history of Median Price data of properties in New YOrk city
| Revenue_factors | Assumptions | Remarks |
|---|---|---|
| Rent | “zipcode”, “Price”, “monthly_price”,“weekly_price” to be checked | |
| Bedrooms | Rent of a ‘m’ bedroom property is m/n times of that of a ‘n’ bedroom property | “bedrooms” |
| Occupancy | Occupancy assumed to be 75% | So we can filter out measures relating to reviews,host qualities, bookability, web urls and neighborhood , latitude and longitude |
| Cleaning fee | Cleaning fee assumed to cover cleaning costs | Impact on revenue is zero |
| Indirect discounts due to bulk bookings | Assumed to be zero | |
| Special Promotional discounts | Assumed to be zero |
| Investment_factors | Assumptions | Remarks |
|---|---|---|
| Median Price of the property | Latest available Median price data taken as price at the time of investment | “RegionName” and “x2017.06” are to be checked |
| Investment_factors | Assumptions | Remarks |
|---|---|---|
| Median Price of the property | Latest available Median price data taken as price at the time of investment | “RegionName” and “x2017.06” are to be checked |
After the above preliminary analysis
#checking the class of all the critical variables
class(listingsNY$zipcode)
## [1] "character"
class(listingsNY$price)
## [1] "factor"
class(listingsNY$monthly_price)
## [1] "factor"
class(listingsNY$weekly_price)
## [1] "factor"
class(listingsNY$bedrooms)
## [1] "integer"
# changing the variables with rent information from factor to numeric to make it condusive for arithmatic operations on the variables
listingsNY$zipcode <- as.character(listingsNY$zipcode)
listingsNY$monthly_price <- as.character(listingsNY$monthly_price)
listingsNY$monthly_price <- str_replace_all(listingsNY$monthly_price ,"\\$" , "")
listingsNY$monthly_price <- str_replace_all(listingsNY$monthly_price ,"\\," , "")
listingsNY$monthly_price <- as.numeric(listingsNY$monthly_price)
listingsNY$price <- as.character(listingsNY$price)
listingsNY$price <- str_replace_all(listingsNY$price ,"\\$" , "")
listingsNY$price <- str_replace_all(listingsNY$price ,"\\," , "")
listingsNY$price <- as.numeric(listingsNY$price)
listingsNY$weekly_price <- as.character(listingsNY$weekly_price)
listingsNY$weekly_price <- str_replace_all(listingsNY$weekly_price ,"\\$" , "")
listingsNY$weekly_price <- str_replace_all(listingsNY$weekly_price ,"\\," , "")
listingsNY$weekly_price <- as.numeric(listingsNY$weekly_price)
sum(is.na(listingsNY$price))/nrow(listingsNY)*100
## [1] 0
sum(is.na(listingsNY$monthly_price))/nrow(listingsNY)*100
## [1] 89.29476
sum(is.na(listingsNY$weekly_price))/nrow(listingsNY)*100
## [1] 87.63782
sum(is.na(listingsNY$bedrooms))/nrow(listingsNY)*100
## [1] 0.04184188
Since they have high percentage (~89%) of missing values , “monthly_price” and “weekly_price” columns cannot be used for the analysis only “price” is selected as critical parameter
Remove the rows with no values for bedrooms(.04%, so negligible), since equivalent twobed price is calculated by the assumption (Assumption no.4) of direct proportionality of rent with no.of bedrooms
listingsNY <- listingsNY[complete.cases(listingsNY$bedrooms), ] # Selecting rows with info on no.of bedrooms
class(listingsNY$zipcode)
## [1] "character"
unique(listingsNY$zipcode) #Checking for abnormal values
## [1] "11218" "10018" "10027" "11238" "10016" "11216" "10019" "10025"
## [9] "10002" "10036" "11215" "10014" "11205" "10011" "10029" "11217"
## [17] "10034" "10003" "10031" "11222" "11237" "11226" "10009" "10024"
## [25] "11225" "11101" "11221" "11206" "10010" "11211" "11249" "10001"
## [33] "10162" "10035" "10032" "11377" "11201" "11233" "11231" "10023"
## [41] "10128" "10040" "11234" "11367" "10028" "11104" "10301" "10452"
## [49] "10006" "10039" "11385" "10021" "11434" "11379" "10026" "11105"
## [57] "10044" "10022" "10013" "11103" "10473" "10475" "11213" "10468"
## [65] "11102" "10038" "10033" "11427" "11693" "10012" "11375" "10470"
## [73] "10030" "10453" "11223" "10469" "11208" "11207" "10451" "11235"
## [81] "10314" "11209" "11204" "10305" "11232" "10017" "11373" "11224"
## [89] "11372" "11412" "10004" "11203" "11374" "10466" "11106" "10304"
## [97] "11220" "11230" "11210" "10303" "11229" "10463" "11435" "11416"
## [105] "10005" "10454" "10455" "11236" "10280" "10037" "11355" "11692"
## [113] "11411" "10282" "10307" "11219" "10467" "11358" "11361" "10069"
## [121] "11415" "11356" "11214" "11432" "10464" "11370" "10302" "11418"
## [129] "11433" "11429" "11422" "11378" "11421" "10312" "10460" "10471"
## [137] "10462" "10459" "11417" "10007" "11360" "11109" "11357" "10458"
## [145] "11369" "11212" "11691" "11354" "11368" "10306" "10456" "11413"
## [153] "11365" "11436" "11414" "11694" "11366" "10457" "10461" "10310"
## [161] "11420" "10308" "11419" "10474" "10465" "10472" "11428" "11426"
## [169] "11423" "11228" "11364" "11363" "10281" "11362" "10174" "10270"
## [177] "10309" "10020"
# Counting the no.of abnormal values
sum(listingsNY$zipcode == "") #Missing values
## [1] 0
Entries without zipcode information are not useful for our analysis, so I chose to delete the entries with missing zipcodes
listingsNY <- listingsNY[listingsNY$zipcode != "", ] # delete the entries with missing zipcodes
Since equivalent twobed price is calculated by the assumption of direct proportionality of rent with no.of bedrooms , We cannot include listings with 0 bedrooms
unique(listingsNY$bedrooms)
## [1] 1 0 2 4 3 5 6 8 9 7 14 10 11
sum(listingsNY$bedrooms == 0)
## [1] 4402
#deleting entries w which have 0 bedrooms
listingsNY <- filter(listingsNY, bedrooms > 0)
Parameters useful for further analysis are RegionName , x2017.06 (Latest Median Price data)
After filtering critical parameters from both the data , data is merged and new variables useful for the final analysis i.e., “revenue” (Estimated Annual Revenue), “break_even” (No. of years to break even the intial investment) and “ten_year_PI” (Profitability Index for 10 years) are created
criticallistingsNY <- listingsNY[ , c("zipcode","price","bedrooms")] # Critical parameters from New York Listings data
criticaltwobedNY <- twobedNY[ , c("RegionName","X2017.06")] # Critical Parameters from Median Price Data
names(criticaltwobedNY) <- c("zipcode" , "Latest_Median_Price") #Changing the column names to depict its summary
# Changing class of zipcode to match to that in criticallistings
criticaltwobedNY <- criticaltwobedNY %>% mutate(zipcode = as.character(zipcode))
criticallistingsNY <- criticallistingsNY %>% mutate(zipcode = as.character(zipcode))
# Joining the data through left_join
final <- criticallistingsNY %>% left_join(criticaltwobedNY, by = c("zipcode" = "zipcode"))
# Deleting entries which does not have a matching median price value
final <- final[is.na(final$Latest_Median_Price) == FALSE, ]
Addition of variable representing Revenue
zipcode_count = as.data.frame(table(final$zipcode)) #Table with no. of entries with each zipcode
names(zipcode_count) = c("zipcode" , "count")
final$revenue = 0.75 * 365*final$price*(2/final$bedrooms) #Defining revenue variable
test <- final
test <- test %>% group_by(zipcode) %>% transmute(zip_rev = mean(revenue)) # Table with avg revenue from each zipcode
test <- as.data.frame(test)
test = unique(test) # Ensuring single entry for each zipcode
final1 <- final %>% left_join(test , by = "zipcode") #Adding column consisting average revenue by zipcode
final1 <- final1 %>% left_join(zipcode_count , by = "zipcode") #adding column consisting sample size in each zipcode
final1 <- final1[final1$count > 30,] #Eliminating zipcodes with less than 30 entries (Assuming sample size of 30 is significant)
| zipcode | price | bedrooms | Latest_Median_Price | revenue | |
|---|---|---|---|---|---|
| 7 | 10025 | 79 | 1 | 1431000 | 43252.5 |
| 9 | 10025 | 135 | 1 | 1431000 | 73912.5 |
| 10 | 10036 | 85 | 1 | 1712900 | 46537.5 |
| 11 | 11215 | 89 | 1 | 1070800 | 48727.5 |
| 12 | 10025 | 85 | 1 | 1431000 | 46537.5 |
| 13 | 10014 | 120 | 1 | 2491600 | 65700.0 |
rev_by_zipcode <- final1[ , c("zipcode","Latest_Median_Price","zip_rev")]
rev_by_zipcode <- rev_by_zipcode %>% as.data.frame() %>% group_by(zipcode) %>% summarise_at(c("Latest_Median_Price" , "zip_rev"), mean, na.rm = TRUE)
rev_by_zipcode <- rev_by_zipcode %>% mutate(break_even = Latest_Median_Price/zip_rev) %>% mutate(ten_year_PI = (10*zip_rev + Latest_Median_Price)/Latest_Median_Price)
knitr::kable(rev_by_zipcode) %>% kable_styling() # Creation of variables "break_even" and "ten_year_PI"
| zipcode | Latest_Median_Price | zip_rev | break_even | ten_year_PI |
|---|---|---|---|---|
| 10003 | 2147000 | 94566.65 | 22.703564 | 1.440460 |
| 10011 | 2480400 | 109691.11 | 22.612590 | 1.442232 |
| 10013 | 3316500 | 111305.96 | 29.796249 | 1.335613 |
| 10014 | 2491600 | 117884.76 | 21.135896 | 1.473129 |
| 10021 | 1815600 | 84202.87 | 21.562210 | 1.463774 |
| 10022 | 2031600 | 107639.02 | 18.874197 | 1.529824 |
| 10023 | 2142300 | 120019.10 | 17.849659 | 1.560235 |
| 10025 | 1431000 | 70440.79 | 20.314933 | 1.492249 |
| 10028 | 2083900 | 80406.15 | 25.917170 | 1.385845 |
| 10036 | 1712900 | 102563.29 | 16.700908 | 1.598770 |
| 10128 | 1787100 | 76363.62 | 23.402506 | 1.427305 |
| 10304 | 328300 | 38503.20 | 8.526565 | 2.172805 |
| 10305 | 425100 | 35829.10 | 11.864657 | 1.842839 |
| 11201 | 1420700 | 79367.63 | 17.900244 | 1.558652 |
| 11215 | 1070800 | 57284.86 | 18.692547 | 1.534973 |
| 11217 | 1302300 | 67674.22 | 19.243664 | 1.519652 |
| 11231 | 1202900 | 62353.54 | 19.291608 | 1.518360 |
| 11234 | 476900 | 38997.21 | 12.229081 | 1.817723 |
| 11434 | 382300 | 37569.19 | 10.175892 | 1.982715 |
rev_by_zipcode1 <- final1[ , c("zipcode","Latest_Median_Price","zip_rev")]
rev_by_zipcode1 <- rev_by_zipcode1 %>% as.data.frame() %>% group_by(zipcode) %>% summarise_at(c("Latest_Median_Price" , "zip_rev"), mean, na.rm = TRUE)
rev_by_zipcode1 <- rev_by_zipcode1 %>% mutate(break_even = Latest_Median_Price/zip_rev) %>% mutate(ten_year_PI = (10*zip_rev + Latest_Median_Price)/Latest_Median_Price) %>% mutate(ten_yrs = 10*zip_rev - Latest_Median_Price) %>% mutate(fifteen_yrs = 15*zip_rev - Latest_Median_Price) %>% mutate(twenty_yrs = 20*zip_rev - Latest_Median_Price) %>% mutate(twentyfive_yrs = 25*zip_rev - Latest_Median_Price) %>% mutate(thirty_yrs = 30*zip_rev - Latest_Median_Price)
knitr::kable(rev_by_zipcode) %>% kable_styling() # Creation of variables for profit after 10, 15, 20 ,25 yrs
| zipcode | Latest_Median_Price | zip_rev | break_even | ten_year_PI |
|---|---|---|---|---|
| 10003 | 2147000 | 94566.65 | 22.703564 | 1.440460 |
| 10011 | 2480400 | 109691.11 | 22.612590 | 1.442232 |
| 10013 | 3316500 | 111305.96 | 29.796249 | 1.335613 |
| 10014 | 2491600 | 117884.76 | 21.135896 | 1.473129 |
| 10021 | 1815600 | 84202.87 | 21.562210 | 1.463774 |
| 10022 | 2031600 | 107639.02 | 18.874197 | 1.529824 |
| 10023 | 2142300 | 120019.10 | 17.849659 | 1.560235 |
| 10025 | 1431000 | 70440.79 | 20.314933 | 1.492249 |
| 10028 | 2083900 | 80406.15 | 25.917170 | 1.385845 |
| 10036 | 1712900 | 102563.29 | 16.700908 | 1.598770 |
| 10128 | 1787100 | 76363.62 | 23.402506 | 1.427305 |
| 10304 | 328300 | 38503.20 | 8.526565 | 2.172805 |
| 10305 | 425100 | 35829.10 | 11.864657 | 1.842839 |
| 11201 | 1420700 | 79367.63 | 17.900244 | 1.558652 |
| 11215 | 1070800 | 57284.86 | 18.692547 | 1.534973 |
| 11217 | 1302300 | 67674.22 | 19.243664 | 1.519652 |
| 11231 | 1202900 | 62353.54 | 19.291608 | 1.518360 |
| 11234 | 476900 | 38997.21 | 12.229081 | 1.817723 |
| 11434 | 382300 | 37569.19 | 10.175892 | 1.982715 |
write.csv(rev_by_zipcode1,"rev_by_zipcode1.csv")
Zipcode 10304 is profitable after 10, 15 and 20 yrs but zipcode 10023 is profitable after 25 yrs
## # A tibble: 19 x 5
## zipcode Latest_Median_Price zip_rev break_even ten_year_PI
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10304 328300 38503. 8.53 2.17
## 2 11434 382300 37569. 10.2 1.98
## 3 10305 425100 35829. 11.9 1.84
## 4 11234 476900 38997. 12.2 1.82
## 5 10036 1712900 102563. 16.7 1.60
## 6 10023 2142300 120019. 17.8 1.56
## 7 11201 1420700 79368. 17.9 1.56
## 8 11215 1070800 57285. 18.7 1.53
## 9 10022 2031600 107639. 18.9 1.53
## 10 11217 1302300 67674. 19.2 1.52
## 11 11231 1202900 62354. 19.3 1.52
## 12 10025 1431000 70441. 20.3 1.49
## 13 10014 2491600 117885. 21.1 1.47
## 14 10021 1815600 84203. 21.6 1.46
## 15 10011 2480400 109691. 22.6 1.44
## 16 10003 2147000 94567. 22.7 1.44
## 17 10128 1787100 76364. 23.4 1.43
## 18 10028 2083900 80406. 25.9 1.39
## 19 10013 3316500 111306. 29.8 1.34
Quickest break_even happens for zipcodes 10304 , 10312, 10305
Zipcodes 10304 , 11434 , 10305 have the highest PI after 10 years
Zillow dataset has 8946 observations of 262 variables. It has median monthly prices of all properties on Zillow grouped by zip code from April 1996 till June 2017. Lots of median price records have no values corresponding to them. But, because of our assumption that time value of money discount rate is 0%, we can take the latest available median price data and assign this value as median price corresponding to that zip code. These values are stored in column ‘Latest_Median_Price’. The next step of our analysis is to extract records corresponding to the zip codes present in New York City. To perform this task, Zillow data is filtered on the field ‘City’ whose value is ‘New York’. We observe that there are only 25 records in the filtered data. There are about 200 zip codes in NYC. Compared to that, this value is low. This implies that the Zillow data is incomplete, due to which, our analysis is restricted to only these 25 zip codes. Only, the columns ‘RegionName’ and ‘price’ are retained for further analysis. This dataset is named ‘twobedNY’.
Airbnb listings dataset has 48895 obs. of 106 variables Most of the variables present in this dataset are of no interest to us. After preliminary analysis , the variables of interest are ‘zipcode’, ‘price’ and ‘bedrooms’. Extract only those records whose zip codes match with the zip codes that we obtain from Zillow dataset. At the same time, filter out the records with zero number of bedrooms. Now, convert the character value in ‘price’ column into numeric column by removing ‘$’ and ‘,’ characters.
Now, merge both the datasets based on zip codes.After the merge , zipcodes with less than 30 as sample size have been removed from the analysis. Next step is to convert this daily price into annual revenue. We assume that annual occupancy is 75%. So, to get the revenue generated by a 2-bed room property, we multiply ‘price’ by 0.75365(2/n) to get annual revenue generated by that property, where n is the number of bedrooms present in that property. This will give us an estimate of the median value of price of a 2-bedroom property in a zip code and median value of the revenue that is generated from all the properties listed on Airbnb in that zip code.
\(Time to break even = Investment/Annual Revenue\)
Now, in each zip code, we calculate the number of years it takes for the property to break-even on the amount spent on purchasing it. This can simply be calculated by dividing cost of the property by annual revenue generated by the property. The column having these values is named ‘break_even’. This dataset is named ‘rev_by_zipcode’. The zip codes that requires the minimum amount of time to break even are the ones that are most profitable.
\(Profitability Index(PI) = (Present value of future cash flows)/(Intial Investment)\)
Profitability Index represents the relationship between the costs and benefits of a proposed project. It is the ratio of present value of future cash flows and intial investment.Here , since time value of money discount rate is 0 . Assuming the properties sell at the same price after 10 years of being in the rental business.
Present value of future cash flows of 10 years = 10 x Annual revenue + Latest Median Price of the property = 10 x zip_rev + Latest_Median_Price
Intial Investment = Latest_Median_Price
Quickest break_even point happens for zipcodes 10309 (Tottenville) , 10312(Eltingville) , 10304(Stapleton Heights) in that order
Zipcodes 10309 (Tottenville) , 10312(Eltingville) , 10304(Stapleton Heights) have the highest PI after 10 years in that order
Since it is a 4 hour data challenge , some ideas which when implemented would help inform a better decision to the company. I list some below.
Since last few year price data is available to us, we can do time series analysis to better predict the current median price and use this for profitability analysis
R shiny can be used to build interactive dashboards , drop downs and slider bars for data filtering
Since median price data is available only for 25 zipcodes of New York city, more data can be procured for zipcodes where data is unavailable
An areal heat map can be generated to represent our results.
For listings without zipcodes, zipcodes can be estimated by applying clustering for longitude and latitude data , and they too can be included in our analysis.