You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.
Find properties based on zipcodes in the New York city which would yield the maximum return on investment (ROI)
| RegionID | RegionName | City | State | Metro | CountyName | SizeRank | X1996.04 | X2017.06 |
|---|---|---|---|---|---|---|---|---|
| 61639 | 10025 | New York | NY | New York | New York | 1 | NA | 1431000 |
| 84654 | 60657 | Chicago | IL | Chicago | Cook | 2 | 167700 | 385100 |
| 61637 | 10023 | New York | NY | New York | New York | 3 | NA | 2142300 |
| 84616 | 60614 | Chicago | IL | Chicago | Cook | 4 | 195800 | 412200 |
| 93144 | 79936 | El Paso | TX | El Paso | El Paso | 5 | 59100 | 83700 |
| 84640 | 60640 | Chicago | IL | Chicago | Cook | 6 | 123300 | 255100 |
Zillow has 262 dimensions with 8946 observations.
| Dimensions Zillow |
|---|
| 8946 |
| 262 |
There are just 3 missing values in the relevant columns. We can remove these rows.
Following graph shows the trend in the price.
The forecasting method is inspired from facebook’s open source Prophet package. Prediction has been solely done to see the price trends.
The plots show an increasing trend in cost prices.
| Dimensions Airbnb |
|---|
| 40753 |
| 95 |
AirBNB has 95 dimensions with 40753 observations
| id | neighbourhood_group_cleansed | city | state | zipcode | property_type | bedrooms | price | review_scores_rating |
|---|---|---|---|---|---|---|---|---|
| 7949480 | Bronx | Bronx | NY | 10464 | House | 1 | $99.00 | 100 |
| 16042478 | Bronx | Bronx | NY | 10464 | Apartment | 1 | $200.00 | NA |
| 1886820 | Bronx | Bronx | NY | 10464 | House | 3 | $300.00 | NA |
| 6627449 | Bronx | City Island | NY | 10464 | Apartment | 1 | $125.00 | 93 |
| 5557381 | Bronx | Bronx | NY | 10464 | House | 1 | $69.00 | 97 |
| 9147025 | Bronx | Bronx | NY | 10464 | House | 0 | $125.00 | 97 |
Price and other columns associated with money are listed in dollars, For example, $50. In order to do analysis, the dollar sign is stripped off of the entire column
Since the data was loaded with the option string as Factors =FALSE, data type manipulation is done
URL columns would not be useful, therefore removing these columns .
| missing.values | |
|---|---|
| has_availability | 40753 |
| license | 40753 |
| monthly_price | 40342 |
| square_feet | 40186 |
| weekly_price | 34336 |
| security_deposit | 24136 |
| cleaning_fee | 12692 |
| review_scores_value | 10039 |
| review_scores_location | 10037 |
| review_scores_checkin | 10036 |
| review_scores_accuracy | 10017 |
| review_scores_communication | 9986 |
| review_scores_cleanliness | 9985 |
| review_scores_rating | 9924 |
| reviews_per_month | 9299 |
| host_listings_count | 251 |
| host_total_listings_count | 251 |
| price | 198 |
| bathrooms | 159 |
| beds | 79 |
| bedrooms | 69 |
| host_about | 1 |
Columns ‘license’ and ‘has availability’ have no data. So, these columns can be removed.
Distribution of bedroom’s variable
Data integrity of Zipcodes
Zipcode should be of length 5
| zipcode |
|---|
| 40135 |
| zipcode |
|---|
| 618 |
There are columns which are disseminating no information. Analyzing columns for unique values.
| Distinct Values | |
|---|---|
| experiences_offered | 1 |
| requires_license | 1 |
From distinct value analysis, columns ‘requires_license’ and ‘experiences_offered’ have no information at all
Since the consultancy company has already identified that 2 bedroom properties are the most profitable the Airbnb dataset is filtered for 2 bedrooms
Code can be reused based on consulting company’s requirements .If they want to analyze 1 bedroom properties in ‘Chicago’ the variables can be modified as
no_of_bedrooms=1
city=‘Chicago’
Property Type
Majority of properties are of Apartment type which was expected in a city like New York. We would have factored in property_type had there been an equitable distribution.
After doing exploratory data analysis and data munging, we can comment on the data quality and data integrity
I concur with the statement “Bad data is worse than no data”.
There are few zipcodes which have over 70 properties. The red dotted line indicates the arbitrary y-axis intercept at 70
The boxplot shows
Average price is decreasing from 320 to 70
Median price is decreasing from 300 to 70
Averages are bit higher than median price suggesting averages are being pulled over because of outliers.
The distribution of airbnb price is right skewed. The red dotted line indicates the arbitrary x-axis intercept at 750 which is relatively higher compared to other Airbnb prices and corroborated by the outlier plot.
It is skewed towards the left which means that majority of the properties have received a score greater than 50
\[Breakeven-Period=\frac{CostPrice}{ ShortTerm Rental Price* Occupancy Rate * Time Period }\]
Time Period is taken as year
| occupancy_score | occupancy_rate |
|---|---|
| 75-100 | 75% |
| 50-75 | 65% |
| 25-50 | 55% |
| 0-25 | 45% |
| zipcode | breakeven_return_in_years | avg_airbnb_price | avg_cost_price | count |
|---|---|---|---|---|
| 10304 | 13.21511 | 97.00000 | 328300 | 2 |
| 10306 | 14.96976 | 93.00000 | 352900 | 2 |
| 11234 | 17.67554 | 107.33333 | 476900 | 6 |
| 11434 | 17.98308 | 99.44444 | 382300 | 9 |
| 10305 | 22.80642 | 111.22222 | 425100 | 9 |
| 10308 | 23.33333 | 65.00000 | 409500 | 1 |
| 11231 | 24.77342 | 202.83562 | 1202900 | 73 |
| 11217 | 25.52693 | 214.89535 | 1302300 | 86 |
| 11215 | 25.66153 | 181.22695 | 1070800 | 141 |
| 10036 | 26.40323 | 317.11321 | 1712900 | 108 |
| 10025 | 27.86024 | 257.17431 | 1431000 | 112 |
| 10022 | 29.42387 | 298.80556 | 2031600 | 38 |
| 11201 | 30.84704 | 209.47059 | 1420700 | 68 |
| 10023 | 31.92834 | 284.69841 | 2142300 | 63 |
| 10011 | 32.08772 | 343.69000 | 2480400 | 102 |
| 10014 | 35.15971 | 304.32258 | 2491600 | 95 |
| 10003 | 36.63234 | 288.14394 | 2147000 | 133 |
| 10128 | 36.76921 | 225.37500 | 1787100 | 56 |
| 10021 | 38.22762 | 200.16667 | 1815600 | 19 |
| 10028 | 39.90868 | 270.46875 | 2083900 | 32 |
| 10013 | 46.97057 | 331.65823 | 3316500 | 82 |
| 10312 | NaN | 215.00000 | 355000 | 1 |
| zipcode | breakeven_return_in_years | median_airbnb_price | median_cost_price | count |
|---|---|---|---|---|
| 10304 | 13.21511 | 97.0 | 328300 | 2 |
| 11434 | 14.15926 | 95.0 | 382300 | 9 |
| 10306 | 14.96976 | 93.0 | 352900 | 2 |
| 10305 | 15.90348 | 120.0 | 425100 | 9 |
| 11234 | 16.05724 | 110.0 | 476900 | 6 |
| 10036 | 21.21764 | 300.0 | 1712900 | 108 |
| 10308 | 23.33333 | 65.0 | 409500 | 1 |
| 10025 | 24.09091 | 225.0 | 1431000 | 112 |
| 11231 | 24.75103 | 190.0 | 1202900 | 73 |
| 11215 | 24.78704 | 162.0 | 1070800 | 141 |
| 10022 | 25.16537 | 299.5 | 2031600 | 38 |
| 11217 | 25.38596 | 192.5 | 1302300 | 86 |
| 11201 | 27.84049 | 189.0 | 1420700 | 68 |
| 10023 | 28.25346 | 275.0 | 2142300 | 63 |
| 10011 | 29.16402 | 304.5 | 2480400 | 102 |
| 10021 | 29.88642 | 187.5 | 1815600 | 19 |
| 10128 | 30.08586 | 220.0 | 1787100 | 56 |
| 10003 | 31.93515 | 249.5 | 2147000 | 133 |
| 10014 | 32.17883 | 299.0 | 2491600 | 95 |
| 10028 | 38.59074 | 239.5 | 2083900 | 32 |
| 10013 | 41.08138 | 300.0 | 3316500 | 82 |
| 10312 | NA | 215.0 | 355000 | 1 |
For the purpose of decision making, sufficient data points are needed to calculate ROI, hence zipcodes with less than 20 properties are filtered out.
* NA in the table is due to missing review_score_rating. Since the count of such properties was 1, it was not imputed but rather left as it is.
Visualing the ROI against the zipcodes
Top 5 zipcodes to invest in are the same for the two metrics which is Median and Average.
The ROI is less for some zipcodes but there are not many properties. We atleast need a few properties to be able to make a firm decision.
Therefore the zipcodes to invest in are