Optimizing for NLP - Part 1: Boosted Trees Parameters and Stop Word Lists

Introduction

The field of Natural Language Processing (NLP) has evolved considerably as of late. Along with that evolution, the realm of its application in medicine, science, and business has also grown exponentially.

In this post I use Edmunds Car Reviews data in order to predict a customer’s multinomial rating (on a 1 - 5 continuous scale) of an automobile, using the text-based review that customer provides.

In this Optimization Part 1, I first optimize two aspects of the modeling procedure - the parameters ‘max_depth’ and ‘eta’ of the ‘xgboost’ algorithm, and the stop word list used in the vecorization step of the NLP portion of analysis.

In later posts I will continue the NLP optimization.

The Data

The data I am using comes from this source: https://www.kaggle.com/ankkur13/edmundsconsumer-car-ratings-and-reviews. After formatting for missing values, and filtering for only ‘Honda’ reviews, my data consists of three variables:

the ‘id’ variable,
the ‘review’ (text-based user-input) variable, and
the ‘rating’ (ie sentiment) multinomial and continous variable.

So ultimately, the data looks like this:

id	review	rating
0	4 years with our element. Honestly there isn’t much to complain about it, but I just never grew to love it. The transmission annoys me, it’s constantly hunting thru the gear, yes, I know it’s getting me the best mileage it can. The shifts aren’t hard or harsh. In fact they aren’t unpleasant at all, I just find it annoying listening to the engine cycle up and down as the trans constantly shifts at every rise or dip in the road. The highway at 70 is truly busy for the transmission. Road noise is a bit on the loud side as well. Crosswinds are fairly disruptive, pushing the car around on the road. The rear doors are a love/hate item for me as well. There is a different routine to using them in parking lots and requires some forethought. Getting in/out with kids can be an exercise in frustration. I’ve been trapped between the doors many times, doing some sort of dance/juggling to get into position to be able to get out from between the doors so they can be closed. The doors open about 90 degrees which makes getting in/out and loading easier but brings the doors in range of neighboring cars in parking lots. When a 5 yr old leans on the door to get out it is simple for the door to swing open enough to bump the car beside you. Aside from the low tire pressure monitor coming on and staying on about a year after we got the car it has been almost faultless. Fuel mileage is not very good. we get about 22-23 most of the time, a mile or 2 less in the winter and it will get 24+ on the highway at speed. the best mileage I’ve seen was 26. On the plus side, it’s roomy, for its size. The seats fold flat or swing up clearing most of the floor. They can be removed but I’d rather take a punch from mike Tyson. Our 5 year old loves to “decorate” the car by drawing on the matte black plastic panels with sidewalk chalk. It’s a nice conversation piece. The people at my wife’s work love it. It is an excellent choice for someone with small kids and/or dogs. We fold the back flat on one seat (rear) and our great Dane lays there next to the kid in his car seat. 3-6-18. still have this thing. we will probably be trading it in the next few months when we decide on a replacement. It has held up pretty well. The seats have stains and the passenger side dash has a weird discoloration on the accent panel. The tach needle fell off about 6 months ago. there is a rust spot on the driverside rear door. all in al we got our moneys worth out of it. going on 160,000 miles.	4
1	I have owned my Element since 2007 when I purchased it NEW. It has been a great little SUV. Have almost 88,000 miles on it. It will plow through snow with the best of other suvs. I have done a lot of winter driving here in Interior Alaska, the heaters have kept me warm. The stopping ability on icy roads is wonderful. If they still made them I would buy a NEW one, but they don’t make them anymore. I have the all wheel drive version. Am still very happy with my Element. Am thinking about replacing it with a new rig this year, as I would like to have some of the new safety features on a vehicle I drive daily. I think Honda made a real mistake not continuing this utilitarian vehicle. It has been a great little car. The newer Honda’s do not have the ground clearance I would like to have, we have had 80 inches of snow since December and getting around in a car with less clearance would create problems.	5
2	Lowest maintenance of any vehicle I’ve ever owned, but it’s time to move on. I don’t need the large cargo space anymore to transport my wares. Fun to drive, easy to take care of, I love the element.	1
3	The only thing I’d change is the drink holders. They are hard to get to with the arm down. I love the cargo room with the seats folded up to the sides. I have brought a 50 gallon water heater home with my car and a new dishwasher. I wish they hadn’t discontinued the Element I would buy a new one. My husband simply doesn’t fit in other cars and I got this to ease his getting in and out. It provides him with the necessary head room. Hey Honda bring it back just like it is. SINCE I DID THE FIRST REVIEW I HAD TO GET RID OF IT. THE TRANS AXELS AND OTHER MAJOR EXPENSES HIT THIS CAR. I’D LOVE TO HAVE ANOTHER BUT I’M BURIED IN A CHEVROLET SILVERADO NOW.	5
4	I bought this vehicle new in 2007. got 97K miles now. It has served us reliably well as you hope from a honda with 2 exceptions. right after the warranty expired we had to shell out 500 bucks to replace a rear strut that was gone. and the original battery left us stranded multiple times when it was only 3y old. . i read on blogs that many people had that problem. since it got replaced, out of my pocket, that issue is gone. the door design is weird if you have people actually using the rear seats. especially in tight parking spots. you get trapped in the space between 2 doors. it gets worse if you have to get little children out of their child seats in the rear.	4

The multinomial and continuous (1-5) ‘rating’ response variable in the Train data is distributed as follows:

Optimizing ‘xgboost’ Parameters

From the ‘xboost’ documentation (found here: https://xgboost.readthedocs.io/en/latest/index.html):

“XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.”

Optimization of the xgboost model will be performed on it’s max_depth and eta parameters:

— max_depth: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit. RANGE [0, Inf]

— eta: Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative. RANGE [0, 1]

Parameter Combinations:

A vector of values for each of the max_depth and eta parameters was used and all possible combinations of those values was considered during the optimization phase. The result was a table of 100 ‘max_depth’ and ‘eta’ combinations.

Optimization Outcome:

An xgboost model was created using each of the combinations from above. The model was created using an 10,000 observation Training Set, wherein the response variable (‘rating’) was modeled to the text-based ‘review’ for every consumer.

Each model was applied to a 2,384 observation Test Set, and a prediction was calculated. Each model’s RMSE score, as well as the MAPE calculations (on the Actual vs the Predicted values) were extracted.

max_depth	eta	RMSE	MAPE
1	0.10	0.788965	0.1935159
2	0.10	0.724256	0.1977676
4	0.10	0.636540	0.1955271
6	0.10	0.568803	0.1754697
8	0.10	0.498404	0.1744142
10	0.10	0.448443	0.1736319
15	0.10	0.340740	0.1775590
20	0.10	0.271850	0.1580275
25	0.10	0.226059	0.1807434
30	0.10	0.188744	0.1998198
1	0.15	0.760539	0.1981793
2	0.15	0.704093	0.1745988
4	0.15	0.610820	0.1748142
6	0.15	0.530867	0.1758540
8	0.15	0.470371	0.1767123
10	0.15	0.416429	0.1648756
15	0.15	0.306847	0.1843746
20	0.15	0.250357	0.1704905
25	0.15	0.201385	0.1819193
30	0.15	0.166862	0.1780852
1	0.20	0.743713	0.1828115
2	0.20	0.684608	0.1756396
4	0.20	0.591605	0.1647981
6	0.20	0.512704	0.1681213
8	0.20	0.444844	0.1743867
10	0.20	0.393756	0.1705610
15	0.20	0.288361	0.1740279
20	0.20	0.222606	0.1869992
25	0.20	0.187255	0.1710371
30	0.20	0.155240	0.1800256
1	0.25	0.724745	0.1929787
2	0.25	0.668022	0.1820722
4	0.25	0.571302	0.1779503
6	0.25	0.494402	0.1715981
8	0.25	0.425747	0.1726132
10	0.25	0.374366	0.1714479
15	0.25	0.269719	0.1604910
20	0.25	0.209664	0.1813129
25	0.25	0.170666	0.1703674
30	0.25	0.148999	0.1685008
1	0.40	0.701573	0.1920759
2	0.40	0.645785	0.1652770
4	0.40	0.531996	0.1701907
6	0.40	0.457079	0.1650576
8	0.40	0.380443	0.1752418
10	0.40	0.322239	0.1697779
15	0.40	0.234546	0.1705282
20	0.40	0.177101	0.1857418
25	0.40	0.141118	0.1735121
30	0.40	0.110015	0.1792835
1	0.50	0.696033	0.1777989
2	0.50	0.623631	0.1830551
4	0.50	0.509343	0.1882420
6	0.50	0.430061	0.1668777
8	0.50	0.350285	0.1801284
10	0.50	0.305835	0.1844584
15	0.50	0.211520	0.1784627
20	0.50	0.157297	0.1795269
25	0.50	0.131534	0.1767330
30	0.50	0.094105	0.1757878
1	0.70	0.690716	0.1681835
2	0.70	0.611939	0.1705008
4	0.70	0.493520	0.1783518
6	0.70	0.393230	0.1831915
8	0.70	0.330542	0.1864077
10	0.70	0.265065	0.2070414
15	0.70	0.185369	0.1802523
20	0.70	0.137032	0.1885839
25	0.70	0.104002	0.1797460
30	0.70	0.075753	0.1906450
1	0.80	0.682869	0.1803541
2	0.80	0.605625	0.1808100
4	0.80	0.478841	0.1767008
6	0.80	0.379843	0.1909951
8	0.80	0.315922	0.1870360
10	0.80	0.264258	0.1955364
15	0.80	0.173081	0.1959033
20	0.80	0.121600	0.1958819
25	0.80	0.092731	0.2166762
30	0.80	0.071302	0.2035385
1	0.90	0.687703	0.1817008
2	0.90	0.603145	0.1736850
4	0.90	0.477186	0.1785787
6	0.90	0.374151	0.1837062
8	0.90	0.308375	0.1935396
10	0.90	0.245963	0.1931310
15	0.90	0.156018	0.1921596
20	0.90	0.113611	0.2039000
25	0.90	0.090713	0.2127998
30	0.90	0.069224	0.2139673
1	1.00	0.682988	0.1838755
2	1.00	0.603243	0.1816620
4	1.00	0.473221	0.1814307
6	1.00	0.358273	0.1925958
8	1.00	0.287042	0.1921891
10	1.00	0.234851	0.2114338
15	1.00	0.148368	0.2160942
20	1.00	0.110505	0.2023217
25	1.00	0.074359	0.2155461
30	1.00	0.059046	0.2299742

In analyzing the outcome, we can clearly see that there is little correlation between a model’s RMSE and its ability to predict test data (MAPE). Consider the plots below, where I extract the average RMSE compared to the average MAPE across all ‘max_depth’ and all ‘eta’ values individually:

Choosing for MAPE:

Because I am ultimately interested in my model’s ability to predict test data, I chose to optimize for parameters that gave me the lowest MAPE.

Overall, this was accomplished at ‘max_depth = 20’ and ‘eta = 0.1’.

Optimizing the Stop Word List

A crucial step of NLP analysis is creating the Document-Term Matrix (DTM) object from the text variable in the data. This is done via the vectorization step, which is itself done by first ignoring certain words that might give off false connections to the sentiment response variable. Common stop words include ‘the’, ‘and’, etc.

Here, I compare three stop word lists:

— 1) English Stop Words: A list of length 571, which includes typical non-descriptive words (‘the’, ‘and’, etc).

— 2) English Stop Words + ‘Honda’: The same list as above, but with the addition of the word ‘honda’.

— 3) Google Stop Words: A list of length 333,333 typical stop words extracted by the Google API.

I created a model with each of the three stop word lists and looked at each model’s MAPE as well as the Top 10 Features extracted from the models. This is a crucial step in NLP analysis as it shows which words the model chose to be the most important in describing the sentiment variable (in NLP, each word is transformed into a model feature).

Along with the visual representation of the Top 10 Features for each model, I also included the MAPE for comparison:

English Stop Words:

English Stop Words + ‘Honda’:

Google Stop Words:

When considering these plots, a few interesting results can be extracted.

First off, comparing the MAPE of the three models, we can clearly see that the model using the Google stop word list has a higher MAPE as compared to the other two. Additionally (not shown here), its RMSE is ~0.65 as compared to ~0.25 of the other two models. A likely explanation is that using 333,333 stop words truncates the usable text too much for accurate modeling and prediction.

Of greater interest, consider the Top 10 Features of the first two models (with and without the addition of the word ‘Honda’ in the stop list). When ‘Honda’ is allowed to be used as a feature, the model considers it of highest importance.

Since this entire data set is only considering Honda reviews this is where we have to step in manually tell the model to ignore it, as it is likely present in reviews of highly varying sentiment.

The Optimized Model & Prediction Accuracy

The Best Model:

As can be seen from above, the best (lowest MAPE) model came from using max_depth = 20 and eta = 0.1, and a stop word list that includes the word ‘Honda’. Therefore, the model used from here on out was the model containing those parameter values.

Boost Rounds:

An important aspect of a boosted decision-tree algorithm is the number of boost rounds (‘nrounds’) that I am using–100 rounds for this particular exercise.

With each round, the train RMSE value decreases, ie the model is getting better with every round. However, that improvement is definitely not a linear progression with regards to iteration:

Feature Importance:

We saw above how vital the ability to extract feature importance when using ‘xgboost’ is. In NLP analysis our features are unique words, therefore by extracting importance from our model, we are actually able to see which words are most “descriptive” for our sentiment response. (Of note, this is analogous to the feature importance output that can be retrieved using the Random Forest algorithm, which is itself a ‘bagging’ tree model.)

It is easily understandable why being able to extract the most important words from an NLP classification is quite valuable. After all, these are the words that most-influence the model’s outcome.

Predicting the ‘rating’ sentiment

Multinomial Response Prediction Accuracy:

To evaluate accuracy, I looked at “prediction off actual”; in other words, the literal difference of actual minus predicted for the response factor ‘rating’:

off = actual - predicted.

As an example, if a customer gave a car a rating of 4.50 and my model predicted a rating of 4.75, it would have an “off” value -0.25.

NOTE: I am using the term ‘off’ as opposed to the more common term ‘residual’ because I personally reserve the residual term when considering fitted vs actual y-values during a modeling phase.

A look at the distribution of the ‘off’ result looks quite promissing, taking on the form of a normal distribution centered around 0:

Finally, a look at the Actual vs Predicted per Observation plot shows just how well this model’s predicted values are able to follow the trend of actual data points.

Summary:

At the end of the day, we would like to know if an algorithm can predict whether or not a consumer is satisfied with a product based solely on a text-based review. To this end, NLP analysis must be done with the idea of maximizing the parameters that go into each step of the process. In later posts I will continue the optimization of other NLP parameters, but for now there are a few key takeaways to consider.

Extracting Feature Importance, I think, is the real power of using tree-based methods in NLP. Having an easily-interpretable outcome from a modeling stage is uniquely powerful when considering the words that the model deems most important for its analysis.

In terms of choosing the corect stop words list, the possibilities are essentially endless. When performing NLP analysis, you must first ask yourself what the objective is. In doing so, you are able to fine-tune the list of words you tell your model to ignore.

Like I said, I will continue optimizing parameters in later posts.

As always, thanks for reading!