“He was no one, Zero, Zero Now he’s a honcho, He’s a hero, hero!”
Isn’t this familiar? Absolutely it is. No way you could forger about Disney’s cool animation, Hercules.
Fig 1: Charming Hercules (Source)
“Zero to Hero” is a phrase that is often used to describe a transformation from a novice or beginner to an expert or accomplished individual. It can also refer to a person who starts with very little and through hard work and determination, becomes successful. Throughout this journey, I want to take you to a step-by-step guide on building a regression Machine Learning pipeline.
This mini-project is about building a regression pipeline to predict
the number of reviews per month as a proxy for the
popularity of the listing, trained based on a dataset collected on
September 2022 in London.
Airbnb is an online marketplace that connects people who need a place to stay with people who have a spare room or an entire home to share. The platform allows property owners to rent out their properties to travelers, who can book the properties through the Airbnb website or mobile app. The dataset to build this pipeline was captured from Detailed Airbnb Listing Data (London, Sep 2022) on Kaggle. The original data was prepared by Inside Airbnb project. The mission of Inside Airbnb is to empower residential communities with data and information that enables them to understand, make decisions and have control over the effects of Airbnb’s presence in their neighborhoods.
Some of the data that is typically included in the Inside Airbnb dataset includes property information (like the property’s location, price, number of bedrooms and bathrooms, and amenities) and review information (like the date of the review, the rating given by the guest, and the text of the review). The Inside Airbnb dataset can be used for various purposes like market analysis, demand forecasting, price optimization, and much more.
After retrieving the dataset with the pull_data.py
script by running Step
1 in the usage section, it is the time to clean the data.
In this step, two scripts clean_data.py
and preprocessing.py
for data cleaning and data preprocessing purposes by running Step
2 and 3 in the usage section. The major cleaning task are: -
Removing irrelevant or ID-based columns like listing_url,
neighborhood_overview, host_id,
host_url, host_name, and etc. - Dropping the
null values - Transferring last_review and
host_since date columns into datetime object - Removing the
$ and , from the price column and
changing its type to float - Removing the [] and
" characters from amenities column to prepare
it for countvectorizer.
This raw dataset includes 69351 entries and has 52 columns. After the
data cleaning and data wrangling step we end of having 51726
observations and 23 columns. Some of the most important columns are: -
host_since: The date when he host joined AirBnb -
room_type: The type of the rental property, Entire home or
Private room - neighbourhood_cleansed: The name of the
neighbourhood - minimum_nights: The minimum required nights
for booking - maximum_nights: Maximum allowed nights for
booking - minimum_nights_avg_ntm: Average minimum nights
booked - maximum_nights_avg_ntm: Average maximum nights
booked - host_listing_count: The total number of active
listing for the host - number_of_reviews: Total number of
current reviews of the property - last_review: The date of
the last review the property received - reviews_per_month
(target) and last_review show some null values related to
the properties with zero reviews. So the null values of both of these
features will be dropped since we know that for
reviews_per_month equal to zero, we observe the null values
in the aforementioned feature.
reviews_per_month is showing a skewed distribution, so
target column transformation might be necessary for this problem.
Fig 6: How the target values (reviews_per_month) are
distributed.
So normalizing the target values would be a good idea here. Normalizing the target values in machine learning can improve the performance of a model for a few reasons like: - Helping to ensure that the optimization algorithms converge more quickly and effectively - Helping to prevent numerical stability issues, as well as reduce the impact of outliers - Preventing the features of the model from dominating the optimization process
After normalizing the target column values with NumPy’snp.log10 function, their histogram is indicating a
bell-shaped distribution.
Fig 7: Target values (reviews_per_month) distribution
after normalization with np.log10 function.
Feature engineering can be one of the most important stages of building a machine learning pipeline. It can help to improve the predictive power of a model by creating new features that are more informative or relevant to the problem. Feature engineering can also help to reduce the dimensionality of the data by combining or removing features that are redundant or not informative.
In the current dataset there are some feature that we can extract
relevant features from them including last_review,
host_since, and amenities. You can reproduce
the feature engineered dataset of this pipeline by running Step
5 in the usage section.
Based on the initial evaluation of the last_review
column, it is observed that this feature is a date column to report the
late review date. The null value represents no review for a rental
property. On other hand, there is another useful date column
host_since which reports the first day of the host joining
AirBnb. Experienced hosts who have been using AirBnb for a longer time
probably have more reviews and higher ratings.
To take these features into account and make the date column more
interpretable for the model, we create a new feature called
time_diff that show the duration between
last_review and host_since features in days.
We believe a higher time_diff, results in higher
reviews_per_month.
name of a rental property is influential when users are
looking for places. According to the rental tips, a
well-crafted Airbnb title can attract up to 5X more bookings, and we
know that more bookings result in more reviews and higher
reviews_per_month.
To extract a probably useful feature out of the name
feature, we’ll use SentimentIntensityAnalyzer from
nltk.sentiment.vader to analyze the sentiment of the
listing. Based on the initial evaluation, there are about 50% of zero
sentiment scores for the data. So we need to note that the focus here is
to identify their properties, either positive or negative sentiment.
Positive sentiment results in a higher booking rate and a higher review
rate. So the output of this function is three categories
Positive, Negative, and
Neutral.
Based on the nature of the problem (regression) and the nature of the target, the metrics below will be chosen to assess the models are: - \(R^2\): The coefficient of determination to calculate the ratio between the variance of the model prediction and total variance. I will use this score for the model and hyperparameter optimization. - NRMSE: This will also be used for reporting purposes. Here the RMSE is the root square mean of the standard error that it’s mode interpretable MSE.
You can reproduce the linear model results by running Step 6 in the usage section.
As the baseline model, I picked the LinearRegression
which is a linear model without any regularization. This model is a
simple a simple model that we used as a starting point for comparison
when developing more complex models. It serves as a benchmark or
reference point against which the performance of other models can be
measured. This model reports a training score of about 63.8% and a test
score of about 58.2% which doesn’t see very appealing for a start
point-to-start model, feature selection, and hyperparameter tuning.
Table 1: Performance of baseline and Linear models
At the next level, using regularization on the linear regression
model might help with improving the scores. Here Ridge
(Linear Regression with L2 regularization) model is used.
L2 regularization, also known as weight decay, is a technique used to prevent overfitting in machine learning models. It works by adding a penalty term to the cost function that the model is trying to optimize. The penalty term is the sum of the squares of the model’s weights (coefficients). The idea behind L2 regularization is that it will encourage the model to have smaller weights, which can help to prevent overfitting by reducing the complexity of the model.
Based on the initial analysis, it seems that both linear regression and ridge (linear model with L2 regularization) are reporting close results but the difference between train and test score has been decreased, and this sings that the L2 regularization had already smoothened the model coefficients. But how L2 regularization improved the model interoperability? By taking a deeper look to the coefficients of the features in Baseline model, we realize that they don’t make any sense!
Fig 8: Highest coefficients for the Linear Regression model and features
As It is indicated above the top features for the baseline model are
some random words generated by the OneHotEncoder of the neighborhood and
CountVectorizer of the amenities. As it was expected the coefficient of
features in LinearRegression model without any regularization are sharp
and in the scale of -2 to 2. But how about the most important features
afre regularization by using Ridge model?
Fig 9: Highest coefficients for the Ridge model and features
As we are expecting the magnitude of the coefficients have dropped
dramatically to -0.3 to 0.2 which means the model responds smoother and
the result are more interpretable. Interestingly, we see
number_of_reviews and
review_scores_ratings.
The bar plots below present the range of the coefficients for each
model, and we see a lower range of the coefficient by choosing a higher
hyperparameter of \(\alpha\) (in this
case \(\alpha = 100\)) for the
Ridge model (Best_Ridge model). Increasing
this hyperparameter intensifies the penalty term in the cost term of the
ridge model and results in a lower range of the coefficients in the cost
function.
In this step we will try a couple of non-linear, or let’s say ensemble models, LGBM Regressor, and XGB Regressor. You can reproduce the ensemble model results by running Step 7 in the usage section.
LightGBM Regressor is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be efficient and scalable, and it is particularly well-suited for large datasets and high-dimensional datasets.
Gradient boosting is a method of ensemble learning that combines multiple weak learners (such as decision trees) to create a strong learner (a model that can make accurate predictions). LightGBM is an efficient implementation of gradient boosting that uses a tree-based learning algorithm. It builds the trees leaf-wise, which is different from the traditional level-wise approach. This allows LightGBM to focus on the more important features and build more accurate models.
XGBoost Regressor is an implementation of the gradient boosting algorithm for regression problems. It is a powerful and popular machine learning algorithm that is designed for both efficiency and performance. It is an optimized version of the gradient boosting algorithm and is known for its speed and accuracy.
Like other gradient boosting algorithms, XGBoost Regressor creates an ensemble of decision trees to make predictions. Each tree is trained to correct the mistakes of the previous tree, and the final ensemble is a combination of all the trees. This allows XGBoost to capture non-linear relationships in the data, and it can handle large datasets and high-dimensional datasets.
Table 2: Performance of baseline, Linear, and Ensemble models
Regarding their results on the problem, LGBM and XGB showed better
performance on the cross-validation scores. The test score increased
from a range of 60% increase to the range of 80% for both
LGBM Regressor and XGB Regressor. On the other
hand both of these models tend towards overfitting with a training score
of 84% and 88% and a test score of 81%. These models generally perform
better than linear models. In terms of speed, the
XGB Regressor model takes much longer to run un compare to
the LGBM model.
When it comes to the interpretability of the results, we need to
consider how the model takes the features into account. The table below
presents the most important features (top 20) for both
LGBM Regressor and XGB Regressor.
Table 3: Top 20 features for XGB and LGBM Regressors
We also observe more interpretable results here for LGBM model,
leading to a better generalization. Moving forward to the hyperparameter
optimization and model interoperability, LGBM Regressor
will be chosen as the final model.
Like any other essential stages of the building a pipeline, we need to pass this stage to seeking for best possible combination of the hyperparameters. By finding the best combination of hyperparameters, the model can perform better on the test data. Hyperparameter optimization can help to find the best trade-off between bias and variance and can help the model to generalize better. In this stage the below range is chosen for hyperparameter optimization: - n_estimators: [10, 50, 100, 500] - learning_rate: [0.0001, 0.001, 0.01, 0.1, 1.0] - subsample: [0.5, 0.7, 1.0] - max_depth [3, 4, 5]
After running the hyperparameter optimization we reach to the best combination of them which is highlighted above. This combination leads to an increase in the test score, but makes to model prone to the overfitting, since the difference between the train and test scores is higher.
In this section, I have used the permutation feature
importance and SHAP (SHapley Additive exPlanations) to
explain the importance of the features.
Permutation feature importance is a method for determining the importance of individual features in a machine learning model. The method works by randomly shuffling the values of a single feature and measuring the impact on the model’s performance. The idea is that if a feature is important, then shuffling its values should result in a significant decrease in performance. This can be done for each feature in the dataset, allowing for a ranking of feature importance. You can reproduce the permutation feature importance results by running Step 8 in the usage section.
Fig 10: Feature importances with permutation method
What we observe in the permutation feature importance results is
closely aligned with what we observed with Ridge model. The
number_of_reviews, amenities, and
time_diff are the most important features for these two
ensemble models.
On the other hand, SHAP (SHapley Additive exPlanations) is a method for explaining the output of any machine learning model. It is based on the concept of Shapley values from cooperative game theory. SHAP values provide a way to assign importance to each feature (or input) in a model’s prediction for a specific instance. The method gives an explanation of the prediction in terms of the contribution of each feature, and it guarantees that the sum of the feature importance values for any given prediction is equal to the difference between the prediction and the expected value of the model’s predictions. Additionally, SHAP is model-agnostic, which means it can be used to explain the predictions of any kind of machine learning model. You can reproduce the SHAP analysis results by running Step 9 in the usage section.
Fig 11: Feature importances with SHAP method
Based on the feature weight we observe in the SHAP method, features like price, number_of_reviews, and some amenities like heating are affecting the SHAP value more and the are more important for the model. Unlike our expectations the ensemble models are performing and explaining the result better than linear models, which reveals the non-linear nature of the dataset. The final trained model object has also been saved for deployment purposes.
The test score is about 81% for LGBM model which we decided to move forward and also 82% for LGBM Regressor with hyperparameter optimization and this shows that this model generalizes fairly well.
Using LGBM Regressor model seems more intuitive and easy to
understand since we saw some features like
number_of_reviews, and time_diff are
contributing most in the prediction weights. In reality having more
reviews signals that the rental is active and more people renting it and
leaving their comments, while this naturally affects the target which is
the number of reviews per month. Having lower minimum night shows again
represents more activity of the property and a higher number of reviews
per month. Also receiving more recent reviews signals the model’s
properties activity in the past and hints at more reviews per month.
Fig 12: Various models score on deployment data (test data)
| Stage | Important Result |
|---|---|
| EDA and Data Wrangling | In this mini-project, I tried a variety of different linear and
non-linear models on a regression problem. We were specifically
interested in predicting reviews_per_month per month for
Airbnb rentals in London. Predicting the reviews_per_month
and presenting it to the hosts play a critical role in their effort to
collect reviews and boost their listing. By performing the initial EDA,
it can be inferred that the number_of_reviews shows a good
correlation with reviews_per_month and it’s quite
interpretable too. The higher the number_of_reviews, the
higher the properties activity, and the higher
reviews_per_month. The target values are skewed and
transforming them with the log function, increased the
model’s accuracy by about 40%. |
| Baseline and Linear Models | Running linear regression without and with
L2 regularization shows that the performance of the linear
models is about 61% while offering interpretable results with the
regularization making sense in the real world. L2
regularization offers the most promising results in comparison to the
other linear models. |
| Non-Linear Models | Using non-linear models like LGBM and XGB
regressors opens another door to the model’s performance while
sacrificing the simplicity and interpretability of the results for
XGB model. These models on average increase the performance
is about accuracy by 20%. |
| Hyperparameter Optimization | Carrying out hyperparameter optimization results in very close train and test scores which in fact ensures us that we are not leaning toward the optimization bias. |
Throughout this mini-project, there were several times that the interpretability-accuracy trade-off was obvious and the main lesson learned from this project was seeing how different models are responding to the real problem we are trying to solve. For this particular problem using non-linear would be a better choice due to the nature of the problem (building a recommendation engine to motivate the hosts to collect more reviews). To improve the results, using model combinations lie stacking, or voting systems would be helpful.
I hope you have enjoyed this journey and I would love to hear your feedback. Feel free to email me at hello@nabi.me