Ch14. Sentiment Analysis

Learning Objectives

  1. Understand the tasks of subjectivity and sentiment analysis

  2. Learn about resources for subjectivity and sentiment analysis, specifically addressing lexicon-based sentiment analysis

  3. Learn about tidy text approach to lexicon-based sentiment analysis

What is Sentiment Analysis?

Sentiment analysis is the computational study of people’s opinions, emotions, and attitudes, which are all part of sentiment.

Sentiment analysis is increasingly important in business and society. It offers numerous research challenges but promises insight useful for opinion analysis and social media analysis. So, sentiment analysis can ask these questions:

For sentiment analysis, we usually use NLP, lexicons, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit or a tweet in our case. Using sentiment analysis, we might ask about how people respond to the Covid-19 issue based on a sample of tweets.

Sentiment Analysis with Tidy Data

We explored in depth what we mean by the tidy text data format and showed how this format can be used to approach questions about word frequency. We counted the frequency of the words and visualized a word cloud from the tidy text data. By doing so, we analyzed which words were used mostly frequently in tweets about COVID-19.

The tidy text data format is also useful for lexicon-based sentiment analysis. When we read a text or a tweet, we use our understanding of the emotional intent of words to infer whether a section of text or a tweet is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tidy tools of text mining to approach the emotional content of text programmatically, as shown in the following figure:

A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge

A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge

What is Lexicon-based Sentiment Analysis?

From this week, we are going to do some lexicon-based sentiment analysis. This approach assumes that the contextual sentiment orientation of text is the sum of the sentiment orientation of each word or phrase. So, to analyze the sentiment of a text, we consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

Specifically, it is to find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text that is matched with the words in sentiment lexicons. For example, if a tweet includes 5 positive words and 15 negative words, then we can learn that the sentiment toward COVID-19 on this tweet is negative. In doing so, we can measure the overall degree of sentiments expressed on Twitter by counting tweets classified into positive and negative sentiment ones.

This is not the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

Sentiment Lexicons

Lexicon-based sentiment analysis begins with annotating words in text with a type of sentiment or its intensity score.

Words in sentiment lexicons have association with sentiment. For example, honest and competent are associated with positive sentiment, whereas dishonest and dull are associated with negative sentiment.

Furthermore, the degree of positivity (or negativity), also referred to as sentiment intensity, can vary. For example, most people will agree that succeed is more positive (or less negative) than improve, and failure is more negative (or less positive) than decline.

Sentiment associations are commonly captured in sentiment lexicons, which are lists of associated word-sentiment pairs (optionally with a score indicating the degree of association). Using the sentiment lexicons, we can measure the sentiment content for words in the text.

Sentiment Lexicons from the textdata package

Of course, there exists a number of sentiment lexicons that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text. The textdata package provides four main sentiment lexicons, which are 1) AFINN from Finn Arup Nielsen, 2) Bing from Bing Liu and collaborators, 3) Loughran from Loughran and McDonald, and 4) NRC from Saif Mohammad and Peter Turney.

Two more NRC companion lexicons: 1) NRC Emotion Intensity Lexicon (NRC-EIL) and 2) NRC Valence, Arousal, and Dominance (NRC-VAD) Lexicon.

All sentiment lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

Four main sentiment lexicons

  1. lexicon_afinn() returns the AFINN lexicon that contains 2,477 English words rated for valence, which labels words with an integer score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

  2. lexicon_bing() returns the Bing lexicon as one of the most popular general purpose English sentiment lexicons that categorizes 6,787 words in a binary fashion into positive and negative categories.

  3. lexicon_loughran() returns the Loughran-McDonald sentiment lexicon, which is created for use with financial documents. This lexicon labels 4,150 words with 6 possible sentiments important in financial contexts: “positive”, “negative”, “constraining”, “litigious”, “superfluous”, and “uncertainty”.

  4. lexicon_nrc() returns the NRC lexicon, which is also a general purpose English sentiment lexicon. This lexicon labels 13,901 words with 10 possible categories of sentiments or emotions: “positive”, “negative”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, and “trust”.

Two additional lexicons from nrc

4-1. lexicon_nrc_eil() returns the NRC Emotion Intensity Lexicon (NRC-EIL), which is a list of 5,814 English words and their associations with four basic emotions (anger, fear, sadness, and joy). And for a given word and emotion X, the assigned score ranges from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.

4-2. lexicon_nrc_vad() returns the NRC Valence, Arousal, and Dominance (NRC-VAD) Lexicon that includes a list of more than 20,007 English words and their valence, arousal, and dominance scores. For a given word and a dimension of valuence, arousal, or dominance, the assigned score ranges from 0 (lowest degree of V/A/D) to 1 (highest V/A/D).

All of this information is tabulated in each dataset, and from the textdata package the dataset can be downloaded to get the list of words and their annotated sentiments or values.

To sum up, the textdata datasets include the following features:

  • word, an English word (unigram)

  • sentiment/AffectDimension, one of either positive, negative, or specific emotions

    • the Bing lexicon has only positive/negative,
    • the Loughran lexicon has positive, negative, constraining, litigious, superfluous, and uncertainty, and
    • the NRC lexicon has positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust
    • the NRC-EIL has anger, fear, sadness, and joy
    • the NRC-VAD Lexicon has valence(positiveness-negativeness/pleasure-displeasure), arousal (active-passive), and dominance (dominant-submissive)
  • value/score, a numerical score for the sentiment, running between -5 and 5 for the AFINN lexicon, between 0 and 1 for the NRC-EIL and NRC-VAD Lexicon

  • Note that sentiment lexicons are in tidy data frame with one word per row. But, not every English word is in the lexicons because many English words are pretty neutral. Also, words with non-ASCII characters were removed from the lexicons. Finally, lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.