Introduction

“He was no one, Zero, Zero Now he’s a honcho, He’s a hero, hero!”

Isn’t this familiar? Absolutely it is. No way you could forger about Disney’s cool animation, Hercules.

Hercules

Fig 1: Charming Hercules (Source)

“Zero to Hero” is a phrase that is often used to describe a transformation from a novice or beginner to an expert or accomplished individual. It can also refer to a person who starts with very little and through hard work and determination, becomes successful. Throughout this journey, I want to take you to a step-by-step guide on building a regression Machine Learning pipeline.

This mini-project is about building a regression pipeline to predict the number of reviews per month as a proxy for the popularity of the listing, trained based on a dataset collected on September 2022 in London.

Dataset

Airbnb is an online marketplace that connects people who need a place to stay with people who have a spare room or an entire home to share. The platform allows property owners to rent out their properties to travelers, who can book the properties through the Airbnb website or mobile app. The dataset to build this pipeline was captured from Detailed Airbnb Listing Data (London, Sep 2022) on Kaggle. The original data was prepared by Inside Airbnb project. The mission of Inside Airbnb is to empower residential communities with data and information that enables them to understand, make decisions and have control over the effects of Airbnb’s presence in their neighborhoods.

Some of the data that is typically included in the Inside Airbnb dataset includes property information (like the property’s location, price, number of bedrooms and bathrooms, and amenities) and review information (like the date of the review, the rating given by the guest, and the text of the review). The Inside Airbnb dataset can be used for various purposes like market analysis, demand forecasting, price optimization, and much more.

After retrieving the dataset with the pull_data.py script by running Step 1 in the usage section, it is the time to clean the data.

Data Cleaning

In this step, two scripts clean_data.py and preprocessing.py for data cleaning and data preprocessing purposes by running Step 2 and 3 in the usage section. The major cleaning task are: - Removing irrelevant or ID-based columns like listing_url, neighborhood_overview, host_id, host_url, host_name, and etc. - Dropping the null values - Transferring last_review and host_since date columns into datetime object - Removing the $ and , from the price column and changing its type to float - Removing the [] and " characters from amenities column to prepare it for countvectorizer.

Data Dictionary

This raw dataset includes 69351 entries and has 52 columns. After the data cleaning and data wrangling step we end of having 51726 observations and 23 columns. Some of the most important columns are: - host_since: The date when he host joined AirBnb - room_type: The type of the rental property, Entire home or Private room - neighbourhood_cleansed: The name of the neighbourhood - minimum_nights: The minimum required nights for booking - maximum_nights: Maximum allowed nights for booking - minimum_nights_avg_ntm: Average minimum nights booked - maximum_nights_avg_ntm: Average maximum nights booked - host_listing_count: The total number of active listing for the host - number_of_reviews: Total number of current reviews of the property - last_review: The date of the last review the property received - reviews_per_month (target) and last_review show some null values related to the properties with zero reviews. So the null values of both of these features will be dropped since we know that for reviews_per_month equal to zero, we observe the null values in the aforementioned feature.

Target Column

Based on an initial EDA of the dataset, it seems that the target column reviews_per_month is showing a skewed distribution, so target column transformation might be necessary for this problem.

Fig 6: How the target values (reviews_per_month) are distributed.

So normalizing the target values would be a good idea here. Normalizing the target values in machine learning can improve the performance of a model for a few reasons like: - Helping to ensure that the optimization algorithms converge more quickly and effectively - Helping to prevent numerical stability issues, as well as reduce the impact of outliers - Preventing the features of the model from dominating the optimization process

After normalizing the target column values with NumPy’s np.log10 function, their histogram is indicating a bell-shaped distribution.

Fig 7: Target values (reviews_per_month) distribution after normalization with np.log10 function.

Feature engineering

Feature engineering can be one of the most important stages of building a machine learning pipeline. It can help to improve the predictive power of a model by creating new features that are more informative or relevant to the problem. Feature engineering can also help to reduce the dimensionality of the data by combining or removing features that are redundant or not informative.

In the current dataset there are some feature that we can extract relevant features from them including last_review, host_since, and amenities. You can reproduce the feature engineered dataset of this pipeline by running Step 5 in the usage section.

Date Column: last_review, and host_since

Based on the initial evaluation of the last_review column, it is observed that this feature is a date column to report the late review date. The null value represents no review for a rental property. On other hand, there is another useful date column host_since which reports the first day of the host joining AirBnb. Experienced hosts who have been using AirBnb for a longer time probably have more reviews and higher ratings.

To take these features into account and make the date column more interpretable for the model, we create a new feature called time_diff that show the duration between last_review and host_since features in days. We believe a higher time_diff, results in higher reviews_per_month.

Name Column

name of a rental property is influential when users are looking for places. According to the rental tips, a well-crafted Airbnb title can attract up to 5X more bookings, and we know that more bookings result in more reviews and higher reviews_per_month.

To extract a probably useful feature out of the name feature, we’ll use SentimentIntensityAnalyzer from nltk.sentiment.vader to analyze the sentiment of the listing. Based on the initial evaluation, there are about 50% of zero sentiment scores for the data. So we need to note that the focus here is to identify their properties, either positive or negative sentiment. Positive sentiment results in a higher booking rate and a higher review rate. So the output of this function is three categories Positive, Negative, and Neutral.

Baseline and Linear Models

Based on the nature of the problem (regression) and the nature of the target, the metrics below will be chosen to assess the models are: - \(R^2\): The coefficient of determination to calculate the ratio between the variance of the model prediction and total variance. I will use this score for the model and hyperparameter optimization. - NRMSE: This will also be used for reporting purposes. Here the RMSE is the root square mean of the standard error that it’s mode interpretable MSE.

You can reproduce the linear model results by running Step 6 in the usage section.

As the baseline model, I picked the LinearRegression which is a linear model without any regularization. This model is a simple a simple model that we used as a starting point for comparison when developing more complex models. It serves as a benchmark or reference point against which the performance of other models can be measured. This model reports a training score of about 63.8% and a test score of about 58.2% which doesn’t see very appealing for a start point-to-start model, feature selection, and hyperparameter tuning.

Table 1: Performance of baseline and Linear models

At the next level, using regularization on the linear regression model might help with improving the scores. Here Ridge (Linear Regression with L2 regularization) model is used.

L2 regularization, also known as weight decay, is a technique used to prevent overfitting in machine learning models. It works by adding a penalty term to the cost function that the model is trying to optimize. The penalty term is the sum of the squares of the model’s weights (coefficients). The idea behind L2 regularization is that it will encourage the model to have smaller weights, which can help to prevent overfitting by reducing the complexity of the model.

Based on the initial analysis, it seems that both linear regression and ridge (linear model with L2 regularization) are reporting close results but the difference between train and test score has been decreased, and this sings that the L2 regularization had already smoothened the model coefficients. But how L2 regularization improved the model interoperability? By taking a deeper look to the coefficients of the features in Baseline model, we realize that they don’t make any sense!

Fig 8: Highest coefficients for the Linear Regression model and features

As It is indicated above the top features for the baseline model are some random words generated by the OneHotEncoder of the neighborhood and CountVectorizer of the amenities. As it was expected the coefficient of features in LinearRegression model without any regularization are sharp and in the scale of -2 to 2. But how about the most important features afre regularization by using Ridge model?

Fig 9: Highest coefficients for the Ridge model and features

As we are expecting the magnitude of the coefficients have dropped dramatically to -0.3 to 0.2 which means the model responds smoother and the result are more interpretable. Interestingly, we see number_of_reviews and review_scores_ratings.

The bar plots below present the range of the coefficients for each model, and we see a lower range of the coefficient by choosing a higher hyperparameter of \(\alpha\) (in this case \(\alpha = 100\)) for the Ridge model (Best_Ridge model). Increasing this hyperparameter intensifies the penalty term in the cost term of the ridge model and results in a lower range of the coefficients in the cost function.

Non-Linear Models

In this step we will try a couple of non-linear, or let’s say ensemble models, LGBM Regressor, and XGB Regressor. You can reproduce the ensemble model results by running Step 7 in the usage section.

LightGBM Regressor

LightGBM Regressor is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be efficient and scalable, and it is particularly well-suited for large datasets and high-dimensional datasets.

Gradient boosting is a method of ensemble learning that combines multiple weak learners (such as decision trees) to create a strong learner (a model that can make accurate predictions). LightGBM is an efficient implementation of gradient boosting that uses a tree-based learning algorithm. It builds the trees leaf-wise, which is different from the traditional level-wise approach. This allows LightGBM to focus on the more important features and build more accurate models.

XGB Regressor

XGBoost Regressor is an implementation of the gradient boosting algorithm for regression problems. It is a powerful and popular machine learning algorithm that is designed for both efficiency and performance. It is an optimized version of the gradient boosting algorithm and is known for its speed and accuracy.

Like other gradient boosting algorithms, XGBoost Regressor creates an ensemble of decision trees to make predictions. Each tree is trained to correct the mistakes of the previous tree, and the final ensemble is a combination of all the trees. This allows XGBoost to capture non-linear relationships in the data, and it can handle large datasets and high-dimensional datasets.

Table 2: Performance of baseline, Linear, and Ensemble models

Regarding their results on the problem, LGBM and XGB showed better performance on the cross-validation scores. The test score increased from a range of 60% increase to the range of 80% for both LGBM Regressor and XGB Regressor. On the other hand both of these models tend towards overfitting with a training score of 84% and 88% and a test score of 81%. These models generally perform better than linear models. In terms of speed, the XGB Regressor model takes much longer to run un compare to the LGBM model.

When it comes to the interpretability of the results, we need to consider how the model takes the features into account. The table below presents the most important features (top 20) for both LGBM Regressor and XGB Regressor.

Table 3: Top 20 features for XGB and LGBM Regressors

We also observe more interpretable results here for LGBM model, leading to a better generalization. Moving forward to the hyperparameter optimization and model interoperability, LGBM Regressor will be chosen as the final model.

Hyperparameter optimization

Like any other essential stages of the building a pipeline, we need to pass this stage to seeking for best possible combination of the hyperparameters. By finding the best combination of hyperparameters, the model can perform better on the test data. Hyperparameter optimization can help to find the best trade-off between bias and variance and can help the model to generalize better. In this stage the below range is chosen for hyperparameter optimization: - n_estimators: [10, 50, 100, 500] - learning_rate: [0.0001, 0.001, 0.01, 0.1, 1.0] - subsample: [0.5, 0.7, 1.0] - max_depth [3, 4, 5]

After running the hyperparameter optimization we reach to the best combination of them which is highlighted above. This combination leads to an increase in the test score, but makes to model prone to the overfitting, since the difference between the train and test scores is higher.

Interpretation and Feature Importance

In this section, I have used the permutation feature importance and SHAP (SHapley Additive exPlanations) to explain the importance of the features.

Permutation Feature Importance

Permutation feature importance is a method for determining the importance of individual features in a machine learning model. The method works by randomly shuffling the values of a single feature and measuring the impact on the model’s performance. The idea is that if a feature is important, then shuffling its values should result in a significant decrease in performance. This can be done for each feature in the dataset, allowing for a ranking of feature importance. You can reproduce the permutation feature importance results by running Step 8 in the usage section.

Fig 10: Feature importances with permutation method

What we observe in the permutation feature importance results is closely aligned with what we observed with Ridge model. The number_of_reviews, amenities, and time_diff are the most important features for these two ensemble models.

SHAP

On the other hand, SHAP (SHapley Additive exPlanations) is a method for explaining the output of any machine learning model. It is based on the concept of Shapley values from cooperative game theory. SHAP values provide a way to assign importance to each feature (or input) in a model’s prediction for a specific instance. The method gives an explanation of the prediction in terms of the contribution of each feature, and it guarantees that the sum of the feature importance values for any given prediction is equal to the difference between the prediction and the expected value of the model’s predictions. Additionally, SHAP is model-agnostic, which means it can be used to explain the predictions of any kind of machine learning model. You can reproduce the SHAP analysis results by running Step 9 in the usage section.

Fig 11: Feature importances with SHAP method

Based on the feature weight we observe in the SHAP method, features like price, number_of_reviews, and some amenities like heating are affecting the SHAP value more and the are more important for the model. Unlike our expectations the ensemble models are performing and explaining the result better than linear models, which reveals the non-linear nature of the dataset. The final trained model object has also been saved for deployment purposes.

Deployment Performance

The test score is about 81% for LGBM model which we decided to move forward and also 82% for LGBM Regressor with hyperparameter optimization and this shows that this model generalizes fairly well.

Using LGBM Regressor model seems more intuitive and easy to understand since we saw some features like number_of_reviews, and time_diff are contributing most in the prediction weights. In reality having more reviews signals that the rental is active and more people renting it and leaving their comments, while this naturally affects the target which is the number of reviews per month. Having lower minimum night shows again represents more activity of the property and a higher number of reviews per month. Also receiving more recent reviews signals the model’s properties activity in the past and hints at more reviews per month.

Fig 12: Various models score on deployment data (test data)

Stage Important Result
EDA and Data Wrangling In this mini-project, I tried a variety of different linear and non-linear models on a regression problem. We were specifically interested in predicting reviews_per_month per month for Airbnb rentals in London. Predicting the reviews_per_month and presenting it to the hosts play a critical role in their effort to collect reviews and boost their listing. By performing the initial EDA, it can be inferred that the number_of_reviews shows a good correlation with reviews_per_month and it’s quite interpretable too. The higher the number_of_reviews, the higher the properties activity, and the higher reviews_per_month. The target values are skewed and transforming them with the log function, increased the model’s accuracy by about 40%.
Baseline and Linear Models Running linear regression without and with L2 regularization shows that the performance of the linear models is about 61% while offering interpretable results with the regularization making sense in the real world. L2 regularization offers the most promising results in comparison to the other linear models.
Non-Linear Models Using non-linear models like LGBM and XGB regressors opens another door to the model’s performance while sacrificing the simplicity and interpretability of the results for XGB model. These models on average increase the performance is about accuracy by 20%.
Hyperparameter Optimization Carrying out hyperparameter optimization results in very close train and test scores which in fact ensures us that we are not leaning toward the optimization bias.

Throughout this mini-project, there were several times that the interpretability-accuracy trade-off was obvious and the main lesson learned from this project was seeing how different models are responding to the real problem we are trying to solve. For this particular problem using non-linear would be a better choice due to the nature of the problem (building a recommendation engine to motivate the hosts to collect more reviews). To improve the results, using model combinations lie stacking, or voting systems would be helpful.

I hope you have enjoyed this journey and I would love to hear your feedback. Feel free to email me at hello@nabi.me