TF-IDF vs Boosted Trees Feature Importance for Extracting Sentiment-Driven Text Relevance

Introduction

The Natural Language Processing (NLP) universe is expanding at a fantastic rate in terms of its application and adoption in many fields, including science and business. However, the concepts of text document analysis and interpretation are nothing new, having origins that span decades, and some of those classic NLP techniques are still widely utilized.

One such classic NLP technique is term frequency–inverse document frequency (TF-IDF). Envisioned in part by Hans Peter Luhn in 1957 and expanded upon by Karen Sparck Jones in 1972, TF-IDF attempts to extrapolate a document’s word importance. In basic terms, it uses a weight factor to emphasize (or de-emphasize) the number of times a word appears in a document, offset by the total number of documents containing the word. A wikipedia article on TF-IDF can be found here.

Boosted Trees are a machine learning technique that utilize tree-building algorithms for modeling. One such method is XGBoost.

From the ‘xboost’ documentation:

“XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.”

Among the many advantages of using boosted trees for modeling is the ability to extract Feature Importance from your model. In NLP analysis our features are unique words, and so by extracting importance from our model we are actually able to see which words are most descriptive for our sentiment response.

For example, a document describing ice-cream flavors might contain descriptive (and therefore important) words such as “vanilla” or “chocolate”. Therefore, it is easy to see the similarities between TF-IDF and feature importance.

The Focus

In this article I will compare using TF-IDF vs Boosted Trees Feature Importance for Extracting Sentiment-Driven Text Relevance on movie reviews from Rotten Tomatoes. In other words, using the user-input review, paired with a “freshness” sentiment, I will extract sentiment-driven words from movie reviews that reviewers used to deem a movie to be “fresh” or “not fresh”.

The Data

The Rotten Tomatoes Reviews data comes from here. After cleaning and formatting for missing values, my data consists of these variables:

the ‘review’ variable - the user-input text review
the ‘fresh_sent_bin’ variable - a “freshness” sentiment variable where 1 = “fresh”, and 0 = “not fresh”

review	fresh_sent_bin
A distinctly gallows take on contemporary financial mores, as one absurdly rich man’s limo ride across town for a haircut functions as a state-of-the-nation discourse.	1
Cronenberg is not a director to be daunted by a scenario in which the antihero spends most of his time in a stretch limo. Turning it into a film that interests anyone … is another matter	0
Robert Pattinson works mighty hard to make Cosmopolis more than just an erudite slap at modern capitalism. The Twilight heartthrob ultimately fails to rescue a meandering story hitting stale versions of the same talking points.	0
For those who like their Cronenberg thick and chewy	1
For better or worse - often both - Cosmopolis is a quintessential David Cronenberg film. Cosmopolis is simultaneously fascinating and impenetrable, profound and absurd, labyrinthine yet intimate.	1

There were a total of 28,749 observations (reviews), where 11,131 fell into the Fresh catagory, and 17,618 fell into the Not-Fresh catagory.

TF-IDF

The canonical steps were utilized to create two TF-IDF outputs; one for “fresh” reviews, and another for “not fresh” reviews. This is an important division, as we are interested in catagorizing important words relevant to each sentiment rating. (A good tutorial on tf-idf using the ‘tidytext’ package in R can be found here.)

When filtering for the top 10 words in each case (by tf-idf), the following results were seen:

Fresh Reviews:

Not-Fresh Reviews:

It makes sense that the word “enjoy” is found in the Fresh batch. However, also found in the Fresh batch are words like “minimal”, “gross”, and “farce”. It very well may be that certain reviews contained those words to describe the plot of the movie, and not the reviewer’s sentiment of that movie, but since we are more interested in the latter, they certainly do not seem intuitive.

The same non-intuitive nature is found in the words belonging to the Not-Fresh batch.

Feature Importance using ‘XGboost’

I have previously writen an article on my method of modeling with the xgboost library, so rather than reiterate the process I will simply link it here.

When I modeled the freshness sentiment score (both sentiments ‘fresh’ and ‘not-fresh’ used for modeling), and extracted the Importance Features, these were my top 10 most important words:

Feature	Gain	Cover	Frequency
bad	0.009278351	0.0012181384	0.0022633249
dull	0.008090868	0.0013968608	0.0018180806
film	0.006948207	0.0008002499	0.0074763928
fun	0.006074220	0.0015231279	0.0032280207
mess	0.005972885	0.0013889072	0.0012058698
movie	0.005901447	0.0009456396	0.0054356900
performance	0.005579203	0.0009349112	0.0041556129
performances	0.005572259	0.0008940787	0.0040443018
fails	0.005048974	0.0014564582	0.0007606256
worst	0.004647403	0.0012701182	0.0007606256

As we can see, the words deemed most important by xgboost are indeed intuitive to the sentiment of the review.

In fact, we can use these features to actually predict sentiment with fairly high accuracy based on reviews alone:

Summary:

When approaching an NLP task, we are ultimately interested in quantifying human-input-free-text, and applying that quantification to modeling for the downstream uses of analysis and prediction. To this end, one such quantity of utmost importance is the extraction of important words used to characterize a sentiment of interest.

Extracting Feature Importance is a real power of using tree-based methods for NLP. Having an easily-interpretable outcome from a modeling stage is uniquely relevant when considering the words that the model deems most important for its pipeline. These are the words that the model uses to learn the data - a vastly important step in machine learning.

In terms of the comparison made here, while TF-IDF does indeed have its merits in the NLP multiverse, I believe it underperforms in word-importance extraction when compared to using tree-based methods like xgboost.

As always, thanks for reading!