Considering N-Gram length on Multinomial Response Models with Amazon reviews

Intro

Natural Language Processing (NLP) and its subsequent modeling is rooted in the development of Document Term Matrices (DTMs), which are themselves constructed as a function of n-gram models.

In this post, I construct three lasso-regularized models, using n-gram max lengths of 1, 2, and 3. I then compare their respective accuracy of predicted vs actual sentiment.

Amazon Reviews

The data set I will be using is Amazon unlocked phone reviews. The data itself can be found here. The Amazon review structure follows the familiar 1-5 star rating, each matched to a free-response text review:

Rating	Reviews
5	I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn’t want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!
4	nice phone, nice up grade from my pantach revue. Very clean set up and easy set up. never had an android phone but they are fantastic to say the least. perfect size for surfing and social media. great phone samsung
5	Very pleased
4	It works good but it goes slow sometimes but its a very good phone I love it
4	Great phone to replace my lost phone. The only thing is the volume up button does not work, but I can still go into settings to adjust. Other than that, it does the job until I am eligible to upgrade my phone again.Thaanks!
1	I already had a phone with problems… I know it stated it was used, but dang, it did not state that it did not charge. I wish I would have read these comments then I would have not purchased this item…. and its cracked on the side.. damaged goods is what it is…. If trying to charge it another way does not work I am requesting for my money back… AND I WILL GET MY MONEY BACK…SIGNED AN UNHAPPY CUSTOMER….

Multinomial Response

I particulalrly enjoy working with Amazon data, especially due to the disjointed nature of the reviews. There is no defined structure that users have to adhere to when submitting their reviews, which results in a deliciously maddening set of text data to play with.

Additionally, this analysis was particularly interesting to perform due to the ratings being a multinomial response - a qualitative 1-5 star rating.

I filtered my data to contain only Samsung phone models, which left me with a data set of 65,747 observations. The following chart shows the sentiment (n-star rating) breakdown of all observations:

The Model

Using R’s glmnet package, I constructed multiple Lasso Elastic-Net GLMs, applying a multinomial method because my response vector was a qualitative 1-5 star rating. I used a 80/20 Train/Test split for model validation.

For prediction validation, a binary vector of TRUE or FALSE values was created, based on wether or not the actual rating completely matched the predicted rating (EX: predicted = 5, actual = 5, marked as TRUE).

To address “partial match” accuracy, I also created a separate binary vector of TRUE or FALSE values based on if the actual rating partially matched the predicted rating with a leniency of one (EX: predicted = 4, actual = 5, marked as TRUE).

The following tables provide examples of these two methods:

Complete Match ex:

actual	predicted	correct
2	2	TRUE
2	3	FALSE
1	3	FALSE
5	5	TRUE
4	4	TRUE

Partial Match ex:

actual	predicted	correct
2	2	TRUE
2	3	TRUE
1	3	FALSE
5	5	TRUE
4	4	TRUE

The vectors of TRUE and FALSE values for each of the two study designs (Complete & Partial) were carried forward for plotting and comparison.

The Complete Match Outcome

Considering Complete Match Accuracy, the overall Percent Error for each of the n-gram lengths used is as follows:

n-gram max lengh = 1: 34.6%
n-gram max lengh = 2: 20.9%
n-gram max lengh = 2: 21.7%

When we go from a max n-gram of 1 (ng_1) to a max n-gram of 2 (ng_2), we see a ~30% decrease in FALSE predictions (continous with a ~30% increase in TRUE predictions). However, when we go from a max n-gram of 2 (ng_2) to a max n-gram of 3 (ng_3), we see a slight increase in FALSE predictions.

The Partial Match Outcome

Considering Partial Match Accuracy, the overall Percent Error for each of the n-gram lengths used is as follows:

n-gram max lengh = 1: 17.7%
n-gram max lengh = 2: 10.6%
n-gram max lengh = 2: 10.7%

Obviously, the Percent Error for the Partial Match will be lower than that of the Complete Match because we allow for the “one-off leniency” of predicted to actual responses.

When we go from a max n-gram of 1 (ng_1) to a max n-gram of 2 (ng_2), we see the same ~30% decrease in FALSE predictions (continous with a ~30% increase in TRUE predictions). And as we saw in the Complete Match outcome, when we go from a max n-gram of 2 (ng_2) to a max n-gram of 3 (ng_3), we see a slight increase in FALSE predictions.

Take-Away Summary

Given Amazon user review text data, it is quite possible to predict those user’s 1-5 star rating on the products that they reviewed. And as it turns out, we can drop our percent error for those predictions to ~10% (if we are ok with a partial match lenience). In other words, it is quite possible to perform NLP multinomial-response modeling and robust prediction on text data.

What is interesting is that while increasing the n-gram length from 1 to 2 shows significan improvement in prediction outcome, the increase from 2 to 3 shows no improvement. At least for this specific data set.

Of course, n-gram length is not (by any means) the only paramater one would maximize when performing NLP modeling/prediction analysis. Aside from the canonical parameters inherent to any model, NLP involves its own caveats, such as the stop-word list used during vocabulary development as one example.

My intention here was to isolate the n-gram length parameter and asses its role in prediction accuracy. It is one of the many steps I take when perfoming NLP modeling/prediction. In a future post I plan on performing a similar analysis comparison using different stop-word lists.