Executive Summary

Airbnb is a popular online marketplace company focused on offering lodging, accommodations, and tourism experiences for travelers around the world. In this final project, I aimed predict Airbnb listing prices in the city of Chicago using a variety of machine learning methodologies including random forest regression, XGBoost, as well as a neural network. Overall, random forest regression was the best performing model with a average test error of 0.2393. With this in mind, I showed that a more complex deep learning model is not necessary in the case of this regression prediction problem.

Through random forest regression and XGBost it was determined that the top five most important features were the following:

  1. The number of people the property accommodates
  2. Room type (e.g. whole house or private room)
  3. Cleaning fees
  4. Walk score
  5. The price per additional guest above the guests

Overall, for practical applications, these models can give a host an optimal price they should charge for their new listing. On the consumer side, this will help travelers determine whether or not the listing price they see is “worth it”.


Model Fitting

For this final project, I explore a variety of machine learning methodologies including bagging, random forest, XGBoost, and neural networks. For each model, I used k-fold cross validation (5 folds) as my primary resampling method. Given I am dealing with a regression prediction problem, the model performance evaluation metric is mean squared error (MSE).

The data for my final project was sourced from Insideairbnb.com, a site that scrapes Airbnb listing data from multiple cities around the world. From this website, I downloaded and combined detailed Airbnb listing information for Chicago, Illinois as of April 2020. Originally, the plan was to combine data across 10 major US cities. However, due to computational restraints, I decided to scale down the project to only one city. Nevertheless, this Chicago dataset can still be used as a proof of concept.

Overall, my final cleaned dataset had 39 predictors with 8,183 observations. The codebook is available in the appendix for predictor explanations.

Bagging and Random Forests

First, I built bagging and random forest models, tuning the mtry parameter. Below is a plot of training, test, and OOB error versus mtry. At a first glance, it seems like a mtry between 7 to 10 works pretty well, with any mtry above 10 showing diminishing returns in test error.

Here are the results of my bagging and random forest models ordered by test error. It looks like the best performing model was a random forest model with a mtry of 8. This model has an overall average test_error across folds of 0.2393. Not bad for a starting point.

Below are the top 10 most important features selected by my best performing random forest model.

The most important feature the random forest model selected in predicting Airbnb listing prices are quite intuitive. Given the main options one is considering when browsing the Airbnb site for a booking, it is not surprising that the number of people the listing accommodates for, location, and reviews are in the top ten.


XGBoost

Next, I built a XGBoost machine learning model, tuning the learn_rate parameter. XGBoost is quite a popular method used by many people (especially in Kaggle competitions) given it superior model performance across different applications. Below is a plot of training, test, error versus learning_rate. Again, at a first glance, it seems like a learning_rate below the 0.05 threshold works pretty well, with any learning_rate above 0.10 showing diminishing returns in test error.

Here are the results of my tuned XGBoost models ordered by test error. It looks like the best performing model was one with a learning_rate of 0.0217. This model has an overall average test_error across folds of 0.2575.

Below are the top 10 most important features selected by my best performing XGBoost model.

The XGBoost model picked many of the same features my random forest model selected. However, the XGBoost model places heavy weight on room type (e.g. entire home or private room). Interestingly, the XGBoost determined that on-site parking is more important amenitiy.


Neural Network

Finally, I ran neural networks to see if I could improve upon my Random Forest and XGBoost models. After experimenting with different parameters, my best performing neural network had four hidden layers, using relu activation functions and L1 regularization. Below is a summary of my model.

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 128)                     4992        
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 256)                     33024       
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 256)                     65792       
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 512)                     131584      
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 1)                       513         
## ================================================================================
## Total params: 235,905
## Trainable params: 235,905
## Non-trainable params: 0
## ________________________________________________________________________________

However, despite testing and running several different iterations of my neural networks, this model architecture did not perform as well as my Random Forest and XGBoost models. My best neural network only resulted in a test error of 0.2858.


Conclusion

Overall, the random forest model performed the best with a test mean squared error of 0.2393. In this case, the simplest machine learning model performed the best. Given the regression prediction problem at hand, a more advanced deep learning neural network models are simply not necessary.

Future Work

Clearly there is still room for improvements in my models. For example, natural language processing (NPL) would be a interesting future development. Using NPL would allow me to conduct sentiment analysis or keyword optimization on reviews or listing descriptions to further improve my model performance.

As mentioned in the data source section, my methodologies can be expanded to other cities. A more expansive data set covering all US cities can provide a more general pricing model.


Appendix

Exploratory Data Analysis

Essential Findings

Response Variable

The primary response variable I will be using will be price. Below is a distribution of Airbnb listing prices:

As seen from the plot above, the distribution of listing prices is heavily skewed right. Most prices fall between $75 - $200 with the cheapest being $10 per night and the most expensive at $25,000 per night! The outliers will have to be removed to due to its potential to heavily influence the models (i.e. linear regression). Given the skewness of the data, I will have to log transform to normalize price data.


Predictor Variables

Overall, there are 72 different predictor variables. These variables range from the number people a listing can accommodate to fees to review ratings to amenities. With this said, the machine learning models we have learned in class should help with variable importance and selection.

Below, I have an intial investigation into some predictor variables that I believe to be important.


Geolocation Predictor Variables

To explore the effects of geolocation on listing prices, I made price heat maps for cities (Chicago and Seattle) using longitude and latitude pairs for each Airbnb listing observation.

Looking that heat maps, we can see that there is an effect of neighborhood on listing price. The heat maps indicate listings closer to city and sea-shores tend to have higher prices.


Histograms

Here, we can see histograms of some predictor variables I have initially believe to be important. These variables are accommodates, cleaning_fee, security_deposit, and extra_people.

Similar to prices, I will have to normalize these predictor variables values by log transforming them. Again, there are very large outliers for each of these variable (I filtered them out for the sake of the plot), so I will have to remove these outliers when modeling.


Correlation Matrix

Looking at the corrplot of some sample predictors to our response variables below, we can see that there is some multicollinearity with our date, especially with my predictor variables of accommodates, bedrooms, beds, and bathroom. This make intuitive sense because the amount of people a property can accommodate is usually limited by the amount of beds.

Moreover, we can see that the number of bedrooms, beds, and bathrooms show positive correlation with cleaning_fee. This intuatively makes sense, because as the number of these rooms increase, the more time / effort it takes to clean them after a booking!

Based on my past experience with Airbnb and intuition, it seems like the number of people the listing can accommodate and the property fees effects the price (i.e. the more people, the more expensive and vice versa.). However, looking at the corrplot, we can see that these variables I have selected only show very weak correlation with price.


Secondary Findings

Price by Property Type

Finally, I briefly looked to see if there was any differences between listing prices and property type, house or apartment. Below I have two boxplots, the left is a raw unfiltered plot, while the right is filtered for prices under $400 per night.

Looking at the unfiltered plot, we can see that the spread of house prices is drastically wider compared to apartments. However, looking at the filtered plot on the right, we can see that apartments actually have the higher average price. Given that the total observations between property types is pretty much the same, apartments may have higher prices because these properties are more likely to be located in a city center. With this said, the difference mean between property types is statistically insignificant. Thus, property types probably won’t be that useful in predicting prices.


Codebook - Feature Identification

  • price- price of listing
  • accommodates - total number of people property can accommodate for
  • security_deposit - amount required for security deposit
  • cleaning_fee - amount required for cleaning fee
  • extra_people - the price per additional guest
  • minimum_nights - minimum length of stay
  • maximum_nights - maximum length of stay
  • number_of_reviews - total number of reviews on listing
  • review_scores_rating_under80 - review score rating under 80
  • review_scores_rating_80_94 - review score rating between 80 and 90
  • review_scores_rating_95_100 - review score rating between 95 and 100
  • review_scores_rating_unknown - review score unknown
  • property_type_house - property type dummy (e.g. house or apartment)
  • room_type_private_room - room type dummy (e.g. whole house or private room)
  • host_is_superhost_true - host superhost identification
  • instant_bookable_true - whether or not the property can be instant booked
  • cancellation_policy_flexible - flexible cancellation policy
  • cancellation_policy_moderate - moderate cancellation policy
  • cancellation_policy_strict - strict cancellation policy
  • walkscore - international measure of walkability from listing to surrounding locations

Amenities Dummy Variables: 1 if listing has, 0 if listing does not have

  • tv
  • air_conditioning
  • balcony_patio
  • bed_linen
  • outdoor_space
  • breakfast
  • coffee_machine
  • cooking_equip
  • white_goods
  • child_friendly
  • parking
  • greet_by_host
  • internet
  • long_term_stay
  • pets
  • private_entrance
  • security_system
  • self_checkin
  • gym