Predicting Multinomial Wine Scores Using xgboost in R

Introduction

Natural Language Processing (NLP) has many real-world applications in characterizing, modeling, and even predicting text-based tasks. Here I will attempt to model and predict wine taste scores given by sommeliers based on their description of the wine. The data set used can be found here: https://www.kaggle.com/zynicide/wine-reviews.

The xgboost package (implemented in R) will be used for this process. Info can be found here: https://xgboost.readthedocs.io/en/latest/R-package/index.html.

In terms of the setup, this is a fairly straightforward NLP (Natural Language Processing) modeling exercise so I will be brief on the details of setting up the data (creating DTM objects, etc).

The majority of the focus will be spent on the utilization of xgboost in NLP prediction - especially in the case of multinomial (as opposed to binomial) response prediction.

The Data

As stated above, the data used will be from the Wine Reviews dataset. I randomly select 10,000 observations and split them into a 80/20 train/test sets.

8,000 observations for the Train Set, and
2,000 observations for the Test Set.

Additionally, I will only need to focus on three variables for the task at hand;

the ‘id’ variable,
the ‘sentiment’ (ie “points” score) variable, and
the ‘description’ variable.

So, ultimately my data looks like this:

wine_data_subset %>% 
  head() %>% 
  kable()

	id	sentiment	description
127952	127951	84	Extracted in jammy blackberry-cherry fruit, with a heavy plaster of sweetly caramelized oak, this Cabernet is a bit heavy-handed. Ready now.
95951	95950	90	Scents of ripe black cherry and raspberry are at the center of the bouquet, with attractive accents of fudgy brownie, caramel, coffee and baking spice. The palate boasts a velvety texture, with medium tannins and a lush, ripe black-fruit core. Hints of cured meat and toasty mocha linger long on the close.
119588	119587	85	Pale in color, this light-bodied Pinot Noir is appropriately supple in texture and offers modest berry and chocolate notes. The finish is surprisingly long, but a bit lemony. Drink now.
76664	76663	85	Showing good varietal character, it’s dry and tart in acidity, with a delicate mouthfeel. The flavors are of cherry pie, cranberries, mushrooms, sandalwood and white pepper. Drink up.
36876	36875	87	This mature, yeasty sparkler from Mendoza smells of brioche and baked apple. Flavors of sourdough bread, cider and white mushroom finish bready and soft, so drink now.
38629	38628	84	This Pinot is simple and on the soft, heavy side, with jammy blackberry, cherry, cola and spice flavors. It’s basically dry, and would benefit from higher acidity.

The multinomial ‘sentiment’ response variable is distributed as follows:

wine_data_subset %>% 
  ggplot(aes(x = sentiment)) +
  geom_histogram(bins = 20, color = "white") +
  tom_plot_theme()

Creating DTM Train & Test objects: Training and Test sets have to be converted to Document-Term Matrix (DTM) format. I will do this using a couple of functions I wrote that are powered by the text2vec R package.

Classification Using ‘xgboost’

The Model: The data used is the DTM Train object that I created above, and the label (‘response’) object is the ‘sentiment’ (ie ‘points’ score) variable.

bst <- xgboost(data = dtm_train_ret$"dtm_train", 
               label = dat_train$"sentiment", verbose = 0,
               max_depth = 4, eta = 1, nthread = 2, nrounds = 5000)

Boost Rounds:

An important takeaway here is the number of boost rounds (‘nrounds’) that I am using - 5000. With each round, the train RMSE value decreases, ie the model is getting better with every round. But, that improvement is definitely not a linear progression with regards to iteration:

bst$evaluation_log %>% 
  ggplot(aes(x = iter, y = train_rmse, group = 1)) +
  geom_line() +
  tom_plot_theme() +
  labs(title = "Train RMSE Decreases Iteratively",
       subtitle = "(to an extent)")

Feature Importance:

Also of interest is the ability to extract feature importance when using ‘xgboost’ (this is analogous to the importance output that can be retrieved using the Random Forest algorithm). In NLP analysis our features are unique words, therefore by extracting importance from our model, we are actually able to see which words are most “descriptive” for our sentiment response:

imp_ret <- xgb.importance(feature_names = colnames(dtm_train_ret$"dtm_train"), model = bst, 
                          data = dtm_train_ret$"dtm_train", label = dat_train$"sentiment")

## Warning in xgb.importance(feature_names =
## colnames(dtm_train_ret$dtm_train), : xgb.importance: parameters 'data',
## 'label' and 'target' are deprecated

imp_ret %>% 
  head() %>% 
  kable()

Feature	Gain	Cover	Frequency
rich	0.0444722	0.0011326	0.0032608
simple	0.0427779	0.0005274	0.0011811
black	0.0355853	0.0008250	0.0045188
years	0.0247634	0.0003409	0.0011554
complex	0.0197782	0.0002517	0.0005392
vineyard	0.0193186	0.0002630	0.0009500

Predicting ‘sentiment’

Sentiment prediction was done using the boosted model from above with the 2000-observation test data set.

Along with the predictions, I also created two leniency-prediction vars; a +/-2 and +/-5 predicted value, and compared all of three to the actual sentiment scores.

Multinomial Response Prediction Accuracy:

As can be seen below, the model’s accuracy is greatly improved when allowing for a +/-2 point leniency, and even more so when allowing for a +/-5 point leniency.

In fact, nearly 100% of the actual sentiment is captured and predicted for when allowing for the latter.

full accuracy	2 off	5 off
0.1855	0.7025	0.966

Predicting Multinomial Wine Scores Using ‘xgboost’ in R

Thomas Atanasov

9/07/2018

Introduction

The Data

Classification Using ‘xgboost’

Predicting ‘sentiment’