GitHub Repo

Introduction

Airbnb has been one of the most successful companies since its inception in 2008. According to Airbnb Newsrooms, currently there are more than 7 million listings in more than 191 countries and regions and operating in more than 100,000 cities. As one of the most popular cities in the world, New York City has been one of the hottest markets for Airbnb. With close to 50,000 listings in the city, Airbnb has interwoven with the rental landscape within 10 years of its inception. Analyses on such a dataset would not only provide intuition about the rental metrics but also shed some light on the socio-economic setting of the city.

The aim of the project is to perform analyses on New York City Airbnb dataset and uncover insights into the sharing economy in one of the biggest cities of the world. The tasks involve developing business intelligence for both hosts who are listing their apartments and the guests who are using them to meet their accommodation requirements.

Following are the questions the project tries to answer which are split into three broad sections:

  • Insights into Airbnb
    • How has Airbnb presence grown over the years?
    • How costly are the Airbnb rates in the neighbourhoods across the five boroughs?
    • How badly the Covid-19 crisis affect Airbnb?
  • Insights for Hosts
    • What should be the rental value if you want to list your property with Airbnb?
    • What are the pain points that a guest finds in Airbnb?
  • Insights for Customers
    • What are the top 10 listing recommendations based on customer constraints?

Data Description

The second-hand dataset is taken from Inside Airbnb which provides non-commercial set of tools and data that allows us to explore how Airbnb is really being used in cities around the world. The New York Airbnb dataset is compiled on 6 May 2020. There are three datasets that we used for our analysis, namely –

  • listings.csv – file contains 106 variables and 50,246 listing information. Details about the listings such as price, apartment details, ratings of the apartment, number of rooms, neighbourhood and host information are included in this file.
  • calendar.csv – file includes the daily rates of the listings up till a year. The data in the file was used to project the prices during the holiday season.
  • reviews.csv – file includes the reviews of each listing posted by guests. This file was primarily used for text analytics.

Analysis

Airbnb: How has Airbnb presence grown over the years?

Being the most densely populous city in U.S., New York City has over 50,000 Airbnb listings as of May 2020. Bar plot shows that new listings in NYC increased steadily from 2008 to 2015. Post 2015 new listings started to go down and averaged around 4,000 up until last year.

Geo plot shows the landscape of Airbnb listings over the years. A quick glance at the geo plot reveals that Manhattan and North Brooklyn around the East river are the most populated areas by Airbnb listings.

Airbnb: How costly are the Airbnb rates in the neighbourhoods across the five boroughs?

Since Airbnb rates are not necessarily per individual basis, it makes logical sense to standardize the rates with respect to an individual. Furthermore, entities involving price or income generally tend to be right skewed (outliers on the higher end). Hence, median is considered to be the best measure of central tendency. To capture the data well, the logarithm of listing price per single guest is taken. The box plots with respect to the five boroughs in NYC illustrate the intuition. Coinciding with the reasoning of high cost of living in Manhattan, the Airbnb rates are similar to the expectations.

Log Rate
Rate

To visualize in depth pricing analysis of neighbourhoods in each borough, a heatmap of prices with respect to the neighbourhoods having minimum of 5 listings is plotted. This provides crucial insights on the median price range of neighbourhoods. The grey area in the heat map shows neighbourhoods with less than 5 listings. Most of the neighbourhoods in Staten Island have less than 5 listings probably due to its suburb nature. The region around East River including North Brooklyn and the entire Manhattan are the costliest places to rent an Airbnb in addition to the greatest number of listings in the region.

Costly neighbourhoods in Manhattan with median rate

Airbnb:

Borough Neighbourhood Count Median Price per Guest
Manhattan NoHo 84 179.5
Manhattan Tribeca 195 179.0
Manhattan Midtown 1699 175.0
Manhattan West Village 761 162.5
Manhattan Murray Hill 496 157.0
Zumper:
Costly neighbourhoods in Brooklyn with median rate

Airbnb:

Borough Neighbourhood Count Median Price per Guest
Brooklyn Brooklyn Heights 146 130
Brooklyn Navy Yard 13 130
Brooklyn DUMBO 35 125
Brooklyn Sea Gate 13 125
Brooklyn Vinegar Hill 29 120
Zumper:

Zumper has mapped NYC neighbourhood rents for winter 2019, and the maps show median 1-bedroom rents in Brooklyn and Manhattan. Places like Dumbo, Vinegar Hill, Brooklyn Heights, Downtown Brooklyn, and Fort Greene are costlier neighbourhoods in Brooklyn. Similarly places like Tribeca, Battery Park, Soho, West Village, and Chelsea are costlier in Manhattan. This presents the real estate setting for the New York boroughs. These places form New York Skyline and is a hub for intercultural and financial activities. Both Zumper (real estate setting) and Airbnb (rental landscape) paint the same picture.

Airbnb: How badly the Covid-19 crisis affect Airbnb?

May 15, 2020 - Jun 15, 2020
Dec 1, 2019 - Jan 1, 2020

From the graphs, one can grasp contrasting scenarios. As of September 12, 2019, an average person had to shell out extra 13-17% on accommodation during New Years’ Week – booking almost three months in advance. Fast forward 5 months to May 06, 2020, the situation has changed dramatically. What was considered to be a peak summer season for Airbnb Rentals, the projections have changed for the worst. Covid-19 has halted most of the economic functions and recreational activities and isolation has become a new norm. Travel and hospitality industries are the worst affected due to this. As of May 06, 2020, the hosts have reduced the rents by more than 20% of what was charged during New Year’s Week – that too with immediate availability.

Hosts: What should be the rental value if you want to list your property with Airbnb?

As a new host, one would like to know how much his/her property can be listed with Airbnb. The analysis gives a crucial information for new hosts to estimate their listing price based on certain attributes.

The parameters chosen are of:

  • Geographical importance:
    • Borough
    • Neighbourhood
  • Listing attributes:
    • Property Type
    • Room Type
    • Number of Bedrooms
    • Number of Bathrooms
    • Number of Guests included

Since many of the parameters are categorical variables such as borough name, neighbourhood name, property type, and room type, we proceed with multilevel linear regression model to predict the price.


EDA and Data Cleaning

Before fitting a linear model, a careful examination of dependent variable and explanatory variables is necessary to see if the variables meet linear model assumptions such as normality. The histograms show that a log transformation reduced the skewness to a great extent but removing outliers were necessary to meet the normality assumption. Couple of filters are also applied to the dataset as part of cleaning, so that there are enough observations for each of the combination of categorical variables. Therefore, we assume following filters on the unclean dataset.

  • Neighbourhood: >= 5 listings
  • Property Type: >= 100 listings
Log Rate without Outliers
Log Rate
Rate

As seen earlier, the median rates for various boroughs are different and, hence, their individual effects need to be considered. Similarly, neighbourhoods in each of the boroughs differ in terms of median rate per guest per night as shown. Although the rates in various neighbourhoods vary identically around their borough averages, it is important to see whether the neighbourhood effects are stronger than the borough effects.

Manhattan
Brooklyn
Bronx
Queens
Staten Island

Around 80% of all the properties listed with Airbnb are apartments, aligning to the company’s main idea of lodgings and homestays. Hotels and serviced apartments tend to be costlier, adhering to the general notion. Airbnb also provides private and shared rooms for cheaper accommodation options with shared rooms only accounting for 2% of all the registered listings.

Property Type
Room Type

As a result of cleaning and filtering, records in the dataset are reduced by around 1500. The base levels are:

  • Borough: Manhattan
  • Neighbourhood: Harlem
  • Property Type: Apartment
  • Room Type: Entire home/Apartment


Regression Results

One of the important factors in choosing regressors is to explain the model in a simpler way. Running a multilevel linear regression on the dataset with boroughs results in adjusted R-Squared of 56.55%.

R-Squared Adjusted R-Squared AIC BIC
0.5657243 0.565495 37898.79 38067.54

The signs of the coefficients on boroughs, room types, # of bedrooms, # of bathrooms and # of guests are as expected and are significant. Except for guest suite and house, the coefficients of other property types are positive and significant.

Although the previous model explains the variation fairly decent, we also need to see whether there is any significant improvement in model fit if neighbourhood effects are considered over borough effects. By including neighbourhood effects, the adjusted R-Squared increased to 62.84%. Similarly, AIC and BIC values have also reduced. The signs and magnitudes of the coefficients on boroughs, room types, # of bedrooms and # of bathrooms remain almost same. Coefficients for guest suite and townhouse are similar to that of apartment.

R-Squared Adjusted R-Squared AIC BIC
0.6306414 0.6284191 32747.82 34485.92

Both Forward and Backward Selection have chosen the model with neighbourhood effects as the best and, hence, this full model is chosen for prediction. Even with a low R-Squared, statistically significant p-values continue to identify relationships and coefficients have the same interpretation.

Performance Metrics for Training: 70% split

RMSE MAE R2
0.3886954 0.2969978 0.6306414

Performance Metrics for Test: 30% split

RMSE MAE R2
0.3872014 0.2941435 0.6320344

Prediction:

Below is the user interface for hosts to input parameters for suggesting the price and 95% prediction interval at which they can register their listing.

Note: If blank frame -> Right click -> Reload frame. Else click here.



Best Explainable Model

The regressors chosen in the linear regression are also used in Machine Learning models like 1) Tree-based models like Decision Tree, Random Forest, AdaBoosted Decision Tree, and XGBoost (Gradient Boosting Framework) and 2) Neural Network model to predict the logarithm of price variable. Grid Search is used to choose the hyper-parameters that lead to the best model with lower RMSE (or greater negative RMSE) on 5-fold Cross Validation Set. The training set of 70% is used to fit the model and test set of 30% is utilized to evaluate the performance of the model. The following tables contain the information regarding best hyper-parameters of the model and its performance metrics on test set such as RMSE, MAE and R-Squared.

Tree-based Models
Model Hyper-Parameters Negative CV RMSE Test RMSE Test MAE Test R-Squared
Decision Tree max_depth=15, min_samples_leaf=8 -0.3947 0.3905 0.2981 0.6268
Random Forest n_estimators=100 -0.3959 0.3866 0.2914 0.6343
AdaBoosted Decision Tree n_estimators=50 , learning_rate=0.1 -0.4412 0.4400 0.3424 0.5262
XGBoost (GBM Framework) objective=reg:squarederror, learning_rate=0.1, n_estimators=100, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, scale_pos_weight=1 - 0.3806 0.2896 0.6455

Refer Jupyter Notebook.

Neural Network Model
Hyper-Parameters Test RMSE Test MAE Test R-Squared
batch_size=100, epochs=10
Hidden Layer 1:
neurons=10, kernel_initializer=normal, activation=ReLU
Hidden Layer 2:
neurons=5, kernel_initializer=normal, activation=ReLU
Output Layer:
neurons=1, kernel_initializer=normal, activation=ReLU
Compiler:
optimizer=Adam(learning_rate=0.1), loss=MeanSquaredError, metrics=MeanSquaredError
0.3890 0.2969 0.6327

Refer Jupyter Notebook.

All the models performed similar to that of linear regression on the test set. XGBoost with a tree-based booster has the best test metric. However, XGBoost model did not significantly perform better than the multilevel linear regression model. In terms of explainability and interpretability, linear regression pips tree-based and neural network models. Linear model’s Test R-Squared is similar to Train R-Squared suggesting the generalization of model. Therefore, linear regression model is chosen as the final model to predict listing rates.

Hosts: What are the pain points that a guest finds in Airbnb?

It is imperative for hosts to understand the customer expectations. Since most of Airbnb hosts are a part of informal sector in hospitality industry, it is important for them to provide service which is on par with those of formal sector. Reviews provide a feedback to the hosts on how the stay was and what can be improved, if necessary. Text analytics on the reviews of listings with poor ratings (i.e., ratings less than 50%) would provide crucial insights about bad customer experience.

The bar plot shows that customers tend to give high ratings because people generally like to say good things. But bad rating means that there are some major issues with the Airbnb rental. Around 300 listings have net ratings less than 50%.

For this task, reviews are tokenized, lemmatized, and void of stop words as part of data cleaning. A TF-IDF matrix is constructed on the processed reviews. An interpretation of the word cloud reveals that the word ‘host’ appears possibly hinting a disconnect between the customer and the host.

A better designed word cloud with sentiment factor can give superior insights and help create guidelines for onboarding new hosts to warn them of potential do’s and don’ts.

Customers: What are the top 10 listing recommendations based on customer constraints?

As a customer, one would like to get recommendations for their given budget and other constraints such as number of bedrooms and number of guests included.

To proceed with this analysis, top 100 locations are selected which are in close proximity to the neighbourhood centers that the user has selected. Then they are ranked according to the Euclidean distances calculated on the three scaled parameters, namely – # of bedrooms, # of guests included and rate of the listing per day. Standardization is done to make sure that the data is internally consistent i.e., each variable has equal dominating effect in recommending the output. Caution is observed while using the rate variable. Euclidean distances are calculated on the log-transformed rate that are per single guest i.e., entire rate is divided with number of guests that were included in the listing record and then it is log-transformed. Top 10 records are then recommended to the customer.

Below is the user interface for the customer to choose parameters and to see the suggestions graphically on a map. Larger the size of the bubble, better is the match. The populated dataset contains the detailed information about the recommended listings with decreasing order of priority.

Note: If blank frame -> Right click -> Reload frame. Else click here.

Conclusion

The massive dataset has a lot of insights to offer. What has been presented in this report is tip of an iceberg. The dataset provided key insights into how Airbnb grew in New York City, especially in the boroughs of Manhattan and Brooklyn. The rental landscape painted the same picture as the real estate setting of New York. Insights into various listing attributes led to the development of a multilevel linear regression model that help hosts to list their new properties for a suitable price range. Text analytics on the reviews of low rated listings has suggested that customers hate when the hosts do not honor their commitment and cancel reservations. For customers, top 10 Airbnb rental recommendations were suggested based on their constraints. However, during these testing times the hospitality sector is badly hit. For hosts who occasionally rent out their spare room in the style of a real bed & breakfast, the lost Airbnb income due the Covid-19 is a frustration.