Introduction

Natural Language Processing (NLP) analysis, modeling, and prediction, has many annotated pipelines. Each pipeline utilizes a different approach to the NLP task at hand, whether it be at the processing stage (n-gram length, vectorization, stop word list, etc), or the model used (just to name a few).

The usage of linear regression based modeling is well documented in the realm of NLP analysis. However, there is not much in the way of decision-tree based models, and in particular boosted tree algorithms. This prompted me to compare two NLP modeling approaches, using ‘xgboost’ (boosted trees) and ‘glmnet’.

From the ‘xboost’ documentation (found here: https://xgboost.readthedocs.io/en/latest/index.html):

“XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.”

From the ‘glmnet’ documentation (found here: https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.pdf):

“Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x. It fits linear, logistic and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted models. It can also fit multi-response linear regression.”

The Goal

Ultimately, the question we want to ask here is - can we predict the department name that an article of clothing is assigned to, based only on the text review of that article of clothing submitted by a customer?

In doing so, I explore the performance of a decision-tree based model versus that of a generalized linear model. In particular, I base the comparison on each model’s predictive power, focusing on the predicted vs actual output.

The Data

The data I am using comes from this link: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews.

After formatting and extracting only the variables I need, the data looks like this:

id dept review
0 4 Absolutely wonderful - silky and sexy and comfortable
1 3 Love this dress! it’s sooo pretty. i happened to find it in a store, and i’m glad i did bc i never would have ordered it online bc it’s petite. i bought a petite and am 5’8“. i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.
2 3 I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
3 2 I love, love, love this jumpsuit. it’s fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
4 6 This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!
5 3 I love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of the package but its a lot of dress. the skirt is long and very full so it overwhelmed my small frame. not a stranger to alterations, shortening and narrowing the skirt would take away from the embellishment of the garment. i love the color and the idea of the style but it just did not work on me. i returned this dress.
6 6 I aded this in my basket at hte last mintue to see what it would look like in person. (store pick up). i went with teh darkler color only because i am so pale :-) hte color is really gorgeous, and turns out it mathced everythiing i was trying on with it prefectly. it is a little baggy on me and hte xs is hte msallet size (bummer, no petite). i decided to jkeep it though, because as i said, it matvehd everything. my ejans, pants, and the 3 skirts i waas trying on (of which i ]kept all ) oops.
7 6 I ordered this in carbon for store pick up, and had a ton of stuff (as always) to try on and used this top to pair (skirts and pants). everything went with it. the color is really nice charcoal with shimmer, and went well with pencil skirts, flare pants, etc. my only compaint is it is a bit big, sleeves are long and it doesn’t go in petite. also a bit loose for me, but no xxs… so i kept it and wil ldecide later since the light color is already sold out in hte smallest size…
8 3 I love this dress. i usually get an xs but it runs a little snug in bust so i ordered up a size. very flattering and feminine with the usual retailer flair for style.
9 3 I’m 5“5’ and 125 lbs. i ordered the s petite to make sure the length wasn’t too long. i typically wear an xs regular in retailer dresses. if you’re less busty (34b cup or smaller), a s petite will fit you perfectly (snug, but not tight). i love that i could dress it up for a party, or down for work. i love that the tulle is longer then the fabric underneath.

To expand on the table above, in order to model the multinomail response variable (Department Name; ‘dept’), I converted the ‘dept’ variable from type factor to type numeric.

Taking a look at the distribution of the response variable further clarifies this conversion:

The data was then split into a 90/10 Training/Test datasets, with 20,000 train observations and 2,641 test observations.

The train and test sets were then converted into Document-Term Matrix (DTM) objects using the same vocabulary vectorizer for both. (The code used to do this is not included here for the sake of brevity, but I am more than happy to prove it to anyone who is interested in seeing it.)

The ‘xgboost’ Model

The xgboost model was performed with parameters nthread = 4, nrounds = 1000, max_depth = 30, eta = 0.1. These parameters have been optimized to the dataset. The RMSE of the model through each of the 1,000 rounds decreases as such:

Extracting the feature importance from the model, the following top ten features are highlighted:

The ‘glmnet’ Model

The ‘glmnet’ model was done with a five-fold cross-validation and an L1 penalty. The max AUC of the model can be seen plotted here:

## [1] "max AUC = 2.7412"

Comparing Prediction Outcomes

The 2,641 observation test set was predicted using both of the models described above and the actuals were compared to the predicted outcomes.

First off, looking at the percent of total accuracy (response predicted is the actual), we can see that the glmnet model does a better job by predicting 80.9% of the observations correctly vs the 64.8% predicted by the xgboost model:

Breaking down the actual vs predicted on a per-model basis, the accuracy can be explored even further:

I call the above visuals “point coverage plots”, and they are extremely useful when looking at how well predicted values (orange) overlap the actual values (grey) when considering each of the multinomial responses.

As can be seen, the glmnet model has visibly more predicted values covering the actuals as compared to the xgboost model. In following, the predicted values outside of the actuals regions are fewer in the glmnet model than in the xgboost model.

This by-department accuracy can be further seen when comparing the number of correctly predicted test observations using the glmnet and xgboost models:

In fact, it seems that the prediction accuracy of glmnet over that of xgboost is weighted based on the number of observations it has to work with - as observation number increases, so does the accuracy of glmnet over that of xgboost.

Discussion

A possible explanation into the prediction accuracy differences observed here is the algorithms used by each model. More specifically, the variance of predicted values as compared the the variance of actuals in each of the models.

Because glmnet is at its core a linear regression model, variance comparisson is at the basis of its model development. No such method is applied during a decision-tree algorithm.

In following, when we look at the variance of actuals compared to the predicted variance of glmnet and xgboost, we can see that glmnet has near identical predicted variance to actual variance:

Summary

We have seen here that not only can we predict the department name that an article of clothing is assigned to, based only on the text review of that article of clothing submitted by a customer, but we can compare the prediction accuracy of two models that utilize different model-building algorithms.

Going further, I would like to look at optimization of the glmnet parameters on a case-by-case NLP dataset basis, and see if I can improve its prediction accuracy.

Thanks for reading!