Deep Dive into Airbnb Listing Data

Hang Yuan, Xinghui Song, Tuo Wang and Hanbo Sun


1 Questions

Even though it was founded no more than 10 years ago, Airbnb has risen as a major competitor to traditional hotels, with the potential to even revolutionize the rental market as well as the way people travel. One of the main reasons people choose Airbnb over traditional hotels is that Airbnb hosts can offer accommodation with great amenities at affordable prices. In this project, we will focus on analyzing a wide range of variables in the dataset and how they affect the listing price.

Amenities is a major selling point for housing. In our dataset, the amenities of the listings are thoroughly listed. We make use of this advantage by expanding this variable to a sparse matrix with 0’s and 1’s that indicate the presence of each feature. With the transformed dataset, we aim to build models that use the available features to predict the housing price. The models built in this process can serve as a benchmark to estimate the standard price of a unit given its features. Moreover, we can also rank the variable importance during this process and see which amenities play a bigger role in the pricing of a unit.

This report also include exploratory data analyses that offer insights of the variability of listing price from a geographical standpoint, as well as using natural language processing techniques to uncover the relationship between listing price and the housing names.

2 Non-Technical Executive Summary

2.1 Data

First we merged the listing and demographic datasets to see the key features in each metropolitan. It’s surprised to find that even though LA has the largest average household income, its average price is the not the highest. We shrank the four key features’ value into range(1,2) for the sake of visualization and comparison.

We also included some additional columns from the inside Airbnb website. The data are under the same schema so it can be joined with ease. This additional data give us more information like review counts.

Metropolitan Population Average Price Households Mean Household Income
Asheville 22267315 125.36 9425819 61160.80
Austin 301151218 293.63 127771136 85036.04
LA 1151835013 179.55 466832778 96268.10
Nashville 87927985 196.64 35098801 69812.71
New Orleans 122339020 200.03 55164985 64908.06

Four key features bar plot

Next, we focused specifically on the metropolitan Asheville and Austin to get a vivid outlook of the distribution of Airbnb listings, some popular business and hotels. The point size represent the price. We plot the popular business such as restaurants, gas station as purple square. It is clear to see all the popular businesses are located along with the roads, especially the highways. The density of listings is rather denser at the business center than anywhere else. Further, we plot the hotels location in Asheville. Because hotels are the main competitors against the Airbnb houses. Notice that some hotels concentration has very sparse Airbnb collections and hotels usually closed to business concentration while Airbnb housing may not.

Rental activity in Asheville

However, in Austin the distribution pattern is not exactly the same as Asheville. Nearly all the Airbnb houses are located in the center of this metropolitan while most of the hotels are rather far from the center of the metropolitan. Popular businesses are as usual located along with the roads especially highways. In overall, the price of the Airbnb house is higher if the house is closer to the center of the metropolitan. Austin is different from Asheville which has two main concentration of Airbnb houses. Austin has only one concentration of Airbnb houses.

Rental activity in Austin

2.2 Analyzing keywords in listing names

Unlike the usual business naming custom, the names on Airbnb listings are usually rather verbose, as the hosts tend to put as much amenities and appeals of their housing on the names as possible. Names like “Venice Beach and Canals Art House” and “Lovely Private Home - Far East Side” are fairly common format on Airbnb listings. Therefore just the names alone can often provide us with ample information.

Here we will look at the most frequent words that appear in listing names. This can give us an idea of what are the hosts’ favorite keywords when the hosts are pitching their houses to Airbnb users. Moreover, since pricing is one of the major focus of this project, we want to investigate how those keywords are associated with the pricing.

Figure 1 shows the top 12 words used in Airbnb listings within our dataset. Note that uninformative words like “the” are removed from this ranking. From this bar chart, we can notice a few things. Words like “in” and “to” may seem uninformative at first, but these words are usually used in the context like “in downtown area”, “5 min to nearby beach” .Both of them suggest that hosts very often put the location appeal in the listing names. We can also see this from words like “downtown”, “hollywood” and “beach” in this ranking. This choice of names can attract visitors with ease since they save users the trouble of looking up the location on the map. Words “private” and “cozy” are the most used adjectives in listing names. Other popular adjectives include “beautiful”, “lovely”, “spacious”, etc. An interesting thing is that the popularity of “private” seems a bit counter-intuitive, as many Airbnb users would simply select private rooms in the filter options at the start of their searching. It might be certain Airbnb policies that instruct hosts to make sure the private/shared information is clear to the users. If not, it can be a meaningful A/B test problem to investigate whether putting “private” in the names will affect users’ choice.

Word Frequency in higher and lower price listing

In order to discover the relationship of listing names and price, we separate listings into “Top” and “Bottom” categories based on their price’s position relative to the median in their respective metropolitan area. For each word, we compute the base 2 log ratio of the count it appears in “Top” over the count it appears in “Bottom” category. A positive log ratio indicates that the word is more often associated with listings of higher price while a negative log ratio indicates otherwise. According to Figure 2, generic adjectives like “cute”, “quiet”, “cozy”, “clean” and “comfy” are some of the most commonly used words in lower priced listings. One of the most common word in listing names “private” also belongs to this category. Compared to the popular words in the “Top” category like “ocean”, “luxury”, “pool” and “modern”, it seems that lower priced listings usually don’t have much premium features to boast about so the hosts would more often use words that can attract visitors that look for a comfortable budget stay.

Words list in high and low price

We also generated word clouds to show word frequencies in popular listings (with a review count higher than 6) and regular listings. From the word cloud of the popular listings, we can see that “modern” and “cozy” are two of the most frequent words. Whereas the common words in other listings are more spread out and no words stand out in particular.

Words list in high and low price

Words list in high and low price

3 Technical Executive Summary

3.1 Classification Hotness

First, let’s define evaluation metrics:

  • True Positive(TP): Number of observations that correctly classified as “Fall” group.
  • True Negative(TN): Number of observations that correctly classified as “Non-Fall” group.
  • False Positive(FP): Number of observations that incorrectly classified as “Fall” group.
  • False Negative(FN): Number of observations that incorrectly classified as “Non-Fall” group.
  • Sensitivity (SENS) & specificity (SPEC): Sensitivity measures the proportion of “Falls” that are correctly classified while specificity measures the proportion of “Non-falls” that are correctly identified.
  • Positive Predictive Value (PPV) & Negative Predictive Value (NPV): Positive Predicted Value measures the proportion of true “Fall” observations among predicted “Fall” observations. Similarly, Negative Predicted Value measures the proportion of true “Non-fall” observations among predicted “Non-fall” observations.
Methods Accuracy Sensitivity Specificity ppv npv LOR
Logistic Regression 0.664 0.506 0.825 0.747 0.621 1.577
Random Forest 0.687 0.669 0.705 0.698 0.677 1.578
SVM 0.688 0.651 0.725 0.707 0.671 1.592
AdaBoost 0.681 0.688 0.674 0.683 0.679 1.517
XGBoost 0.685 0.675 0.694 0.692 0.677 1.553

Performance plot for classification

3.2 Prediction Rental Price

We would like to give insights for both Airbnb and housing providers what are the most important underlying factors to the rental pricing, since by obtaining knowledge of that, it gives housing providers appropriate suggestions on increasing the value of the rentals, on the other hand, it would help greatly for Airbnb to predict their revenue.

From the given dataset, the pricing of Airbnb rentals is affected in two dimensions, one is the housing’s intrinsic feature, the other is the external factor that affect the rental pricing greatly.

For the intrinsic features, basically it involves a large variety of housing’s internal features, for example the property type, a Boutique hotel is obviously leads to a higher price than a normal apartment. We can find a set of variables representing the rentals’ intrinsic value in the Listings dataset. We would like to give specific details on how we were dealing with the amenities variables in the dataset. Amenities variable contains a list of amenities available in the property. We aggregated all types of amenities together and built up a Sparse matrix with value one indicating possessed amenities and value zero indicating the house doesn’t own this amenity. We ended up obtaining 100 types of amenities’ variables, each representing an intrinsic feature about the rentals like whether the housing is equipped with cooking basics or whether there is elevator in the building.

To understanding how intrinsic features affect its rentals’ pricing, we trained common prediction model between the rentals’ price and the amenities features we extracted. We choosed Elastic Net model to train the model and conducted five fold cross-validation to find the optimal parameter. The final results gave us a Root Mean Squared Error (RSME) of 191.45 and Mean Absolute Error (MAE) of 99.98.

By checking correlation relationship with variable price, we obtained the importance rank of the amenities features as we could see in figure 9, we could see the top five important amenities features, among which whether a house has a indoor fireplace is the most important intrinsic features.

On the other hand, the rentals’ pricing is also greatly affected by the external factors, especially the location and the customer’s review. For any business, location is always one of the key factors. An apartment located in Manhattan could cost more in rental then a house in the countryside. In the Listings dataset, we could include location based variables like zipcode, longitude, latitude and metropolitan to give our more prediction power in the location dimension. For customer’s review, It is a critical component to this Internet sharing economy concept. Nowadays people tends to choose where to lodge based on the other’s reviews. Those rentals and hosts who are hot and highly commented among the Airbnb Community would naturally leads to higher pricing strategy. Based on this reason, we take the review scores variables from the Listings dataset into consideration in the purpose of increasing the prediction results.

In our final model, we included the intrinsic features and external factors we mentioned as predictors, we rentals’ price variable as the response variable. We choosed Linear Regression with forward and backward stepwise features selection, Ridge Regression, Elastic Net, Neural Network and XGBoost models to train the model and conducted five fold cross-validation to find the optimal parameter for these models. We compared the final results as follows:

As we observed from comparison of the models, the XGBoost model gave us both the lowest Root Mean Squared Error(RMSE) and Mean Absolute Error(MAE). The results show that we do approach a predictive models which would help both housing providers and Airbnb by propose appropriate suggestions on increasing the value of the rentals, and help greatly for Airbnb to predict their revenue.

Performance plot for classification

Performance plot for classification