logo

Problem Statement

You are consulting for a real estate company that has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City. The real estate company has already concluded that two bedroom properties are the most profitable; however, they do not know which zip codes are the best to invest in.

Objective

Find properties based on zipcodes in the New York city which would yield the maximum return on investment (ROI)

Assumptions

  1. Occupancy rate has been assumed to be constant throughout the year irrespective of the fact that it can change due to holidays
  2. Review score rating has been used to adjudicate the occupancy rate as it is the overall score based on
    • accuracy
    • cleanliness
    • check-in
    • communication
    • location
    • value
  3. Cost of property has been predicted using time series forecasting from the Zillow Data set. Extraneous variables like natural calamities, slowdown in economy which can affect the cost of the property have been ignored.
  4. Mean and median prices have been taken to calculate the breakeven period.
  5. The data was last scraped in the third quarter of the FY 2017 and the analysis is done in 2019. Factors like holiday effects, superhost, location, amenities etc. could effect the price variation for short term rentals and hence prediction of rental price is not done.

Metadata Created

  1. predicted_price -This column contains the predicted cost price from the zillow data set. In the data set only the price as of 1 August 2019 is taken and is in dollar amount
  2. occupancy_score -Same as review_scores_rating
  3. occupancy_rate -Percentage occupancy of the airbnb listing. It is represented as intervals.
  4. breakeven_years -Time it takes for the property to return it’s cost price. This is also known as breakeven period and it is taken in the form of years

Exploratory data analysis

1. Analyzing Zillow data set

Zillow Data
RegionID RegionName City State Metro CountyName SizeRank X1996.04 X2017.06
61639 10025 New York NY New York New York 1 NA 1431000
84654 60657 Chicago IL Chicago Cook 2 167700 385100
61637 10023 New York NY New York New York 3 NA 2142300
84616 60614 Chicago IL Chicago Cook 4 195800 412200
93144 79936 El Paso TX El Paso El Paso 5 59100 83700
84640 60640 Chicago IL Chicago Cook 6 123300 255100

Zillow has 262 dimensions with 8946 observations.

Dimensions Zillow
8946
262

There are just 3 missing values in the relevant columns. We can remove these rows.

Following graph shows the trend in the price.
The forecasting method is inspired from facebook’s open source Prophet package. Prediction has been solely done to see the price trends.

  1. Cost Price trend of zipcode 11231
  2. Cost Price trend of zipcode 11217

The plots show an increasing trend in cost prices.

2. Exploring AIRBNB data

Dimensions Airbnb
40753
95

AirBNB has 95 dimensions with 40753 observations

AIRBNB data
id neighbourhood_group_cleansed city state zipcode property_type bedrooms price review_scores_rating
7949480 Bronx Bronx NY 10464 House 1 $99.00 100
16042478 Bronx Bronx NY 10464 Apartment 1 $200.00 NA
1886820 Bronx Bronx NY 10464 House 3 $300.00 NA
6627449 Bronx City Island NY 10464 Apartment 1 $125.00 93
5557381 Bronx Bronx NY 10464 House 1 $69.00 97
9147025 Bronx Bronx NY 10464 House 0 $125.00 97

Price and other columns associated with money are listed in dollars, For example, $50. In order to do analysis, the dollar sign is stripped off of the entire column

Since the data was loaded with the option string as Factors =FALSE, data type manipulation is done
URL columns would not be useful, therefore removing these columns .

Missing value analysis

Missing value analysis
missing.values
has_availability 40753
license 40753
monthly_price 40342
square_feet 40186
weekly_price 34336
security_deposit 24136
cleaning_fee 12692
review_scores_value 10039
review_scores_location 10037
review_scores_checkin 10036
review_scores_accuracy 10017
review_scores_communication 9986
review_scores_cleanliness 9985
review_scores_rating 9924
reviews_per_month 9299
host_listings_count 251
host_total_listings_count 251
price 198
bathrooms 159
beds 79
bedrooms 69
host_about 1

Columns ‘license’ and ‘has availability’ have no data. So, these columns can be removed.

Distribution of bedroom’s variable

  1. 69 missing values or 0.17%
  2. frequency distribution shows that there are 3525 properties which are having 0 bedrooms.

Data integrity of Zipcodes

Zipcode should be of length 5

Number of zipcodes having correct length
zipcode
40135
Number of zipcodes having incorrect length
zipcode
618

There are columns which are disseminating no information. Analyzing columns for unique values.

Distinct Values
experiences_offered 1
requires_license 1

From distinct value analysis, columns ‘requires_license’ and ‘experiences_offered’ have no information at all

Since the consultancy company has already identified that 2 bedroom properties are the most profitable the Airbnb dataset is filtered for 2 bedrooms

Code can be reused based on consulting company’s requirements .If they want to analyze 1 bedroom properties in ‘Chicago’ the variables can be modified as
no_of_bedrooms=1
city=‘Chicago’

Property Type

Majority of properties are of Apartment type which was expected in a city like New York. We would have factored in property_type had there been an equitable distribution.

Data Integrity and Data Quality

After doing exploratory data analysis and data munging, we can comment on the data quality and data integrity

I concur with the statement “Bad data is worse than no data”.

  1. Missing values : There were a few columns in which missing values were present. The variables important for our analysis are
    • review_score_rating : occupancy_rate is directly dependent on the score
    • price : There are 41 properties which do not have their price listed.
  2. Data Quality : Some zipcodes had length other than 5 which questions the data credibility
  3. Quality Check of data was performed by
    • removing inconsistencies : zipcodes of abnormal length, missing values in price columns, rating columns
    • conversion into suitable data type : There are certain columns which were reformatted for calculation,data type was changed from character to factor, or from factor to numeric based on the charting requirements as well as limitation of the tool at hand.
  4. Uniqueness/Variance : Some of the columns had a low count of unique values. Although low count of unique values are a distinguished feature of binary variables(0 or 1, TRUE or FALSE, this or that) but there should not be a single value i.e no change in variable throughout the dataset. Such columns have been removed or not considered for analysis
  5. Timeliness: The data was last scraped in the third quarter of the FY 2017 and the analysis is done in 2019. Prediction could have been done (is done but not used) for Zillow properties but we don’t have 2019 Airbnb data our analysis is restricted to FY 2017

Visualizations of the properties versus price and other parameters

There are few zipcodes which have over 70 properties. The red dotted line indicates the arbitrary y-axis intercept at 70

The boxplot shows

  1. Three phases can be seen in the cost price

Average price is decreasing from 320 to 70

Median price is decreasing from 300 to 70

Averages are bit higher than median price suggesting averages are being pulled over because of outliers.

The distribution of airbnb price is right skewed. The red dotted line indicates the arbitrary x-axis intercept at 750 which is relatively higher compared to other Airbnb prices and corroborated by the outlier plot.

It is skewed towards the left which means that majority of the properties have received a score greater than 50

Which properties to invest in ?

Formula used to calculate ROI

\[Breakeven-Period=\frac{CostPrice}{ ShortTerm Rental Price* Occupancy Rate * Time Period }\]

Time Period is taken as year

Occupancy rate based on review score and number of reviews
occupancy_score occupancy_rate
75-100 75%
50-75 65%
25-50 55%
0-25 45%
ROI using the mean price in years
zipcode breakeven_return_in_years avg_airbnb_price avg_cost_price count
10304 13.21511 97.00000 328300 2
10306 14.96976 93.00000 352900 2
11234 17.67554 107.33333 476900 6
11434 17.98308 99.44444 382300 9
10305 22.80642 111.22222 425100 9
10308 23.33333 65.00000 409500 1
11231 24.77342 202.83562 1202900 73
11217 25.52693 214.89535 1302300 86
11215 25.66153 181.22695 1070800 141
10036 26.40323 317.11321 1712900 108
10025 27.86024 257.17431 1431000 112
10022 29.42387 298.80556 2031600 38
11201 30.84704 209.47059 1420700 68
10023 31.92834 284.69841 2142300 63
10011 32.08772 343.69000 2480400 102
10014 35.15971 304.32258 2491600 95
10003 36.63234 288.14394 2147000 133
10128 36.76921 225.37500 1787100 56
10021 38.22762 200.16667 1815600 19
10028 39.90868 270.46875 2083900 32
10013 46.97057 331.65823 3316500 82
10312 NaN 215.00000 355000 1
ROI using the median price in years
zipcode breakeven_return_in_years median_airbnb_price median_cost_price count
10304 13.21511 97.0 328300 2
11434 14.15926 95.0 382300 9
10306 14.96976 93.0 352900 2
10305 15.90348 120.0 425100 9
11234 16.05724 110.0 476900 6
10036 21.21764 300.0 1712900 108
10308 23.33333 65.0 409500 1
10025 24.09091 225.0 1431000 112
11231 24.75103 190.0 1202900 73
11215 24.78704 162.0 1070800 141
10022 25.16537 299.5 2031600 38
11217 25.38596 192.5 1302300 86
11201 27.84049 189.0 1420700 68
10023 28.25346 275.0 2142300 63
10011 29.16402 304.5 2480400 102
10021 29.88642 187.5 1815600 19
10128 30.08586 220.0 1787100 56
10003 31.93515 249.5 2147000 133
10014 32.17883 299.0 2491600 95
10028 38.59074 239.5 2083900 32
10013 41.08138 300.0 3316500 82
10312 NA 215.0 355000 1

For the purpose of decision making, sufficient data points are needed to calculate ROI, hence zipcodes with less than 20 properties are filtered out.

* NA in the table is due to missing review_score_rating. Since the count of such properties was 1, it was not imputed but rather left as it is.

Visualing the ROI against the zipcodes

Top 5 zipcodes to invest in are the same for the two metrics which is Median and Average.

Conclusion

The ROI is less for some zipcodes but there are not many properties. We atleast need a few properties to be able to make a firm decision.

Therefore the zipcodes to invest in are

Future Steps

  1. There are over 35,000 observations besides 2 bedrooms hence further analysis can be done.
  2. Factors such as transportation, proximity to work location can be used to map the area for long-term rentals
  3. NY being a financial capital has a scope for long term rentals. Hence, the properties can be rented out on a long term basis as well.
  4. There are 618 rows which have improper length zipcodes. These can be imputed using the latitude and longitudes
  5. Crime score is an important factor in determining the price. Crime score is freely available online and can be used for deeper analysis.
  6. Majority of the tourists book short term rentals in NY city hence holidays (variable pricing can be done ) and closeness to tourist locations should be accounted for.