Ch14. Sentiment Analysis

Learning Objectives

  1. Understand the tasks of subjectivity and sentiment analysis

  2. Learn about resources for subjectivity and sentiment analysis, specifically addressing lexicon-based sentiment analysis

  3. Learn about tidy text approach to lexicon-based sentiment analysis

What is Sentiment Analysis?

Sentiment analysis is the computational study of people’s opinions, emotions, and attitudes, which are all part of sentiment.

Sentiment analysis is increasingly important in business and society. It offers numerous research challenges but promises insight useful for opinion analysis and social media analysis. So, sentiment analysis can ask these questions:

For sentiment analysis, we usually use NLP, lexicons, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit or a tweet in our case. Using sentiment analysis, we might ask about how people respond to the Covid-19 issue based on a sample of tweets.

Sentiment Analysis with Tidy Data

We explored in depth what we mean by the tidy text data format and showed how this format can be used to approach questions about word frequency. We counted the frequency of the words and visualized a word cloud from the tidy text data. By doing so, we analyzed which words were used mostly frequently in tweets about COVID-19.

The tidy text data format is also useful for lexicon-based sentiment analysis. When we read a text or a tweet, we use our understanding of the emotional intent of words to infer whether a section of text or a tweet is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tidy tools of text mining to approach the emotional content of text programmatically, as shown in the following figure:

A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge

A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge

What is Lexicon-based Sentiment Analysis?

From this week, we are going to do some lexicon-based sentiment analysis. This approach assumes that the contextual sentiment orientation of text is the sum of the sentiment orientation of each word or phrase. So, to analyze the sentiment of a text, we consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

Specifically, it is to find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text that is matched with the words in sentiment lexicons. For example, if a tweet includes 5 positive words and 15 negative words, then we can learn that the sentiment toward COVID-19 on this tweet is negative. In doing so, we can measure the overall degree of sentiments expressed on Twitter by counting tweets classified into positive and negative sentiment ones.

This is not the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

Sentiment Lexicons

Lexicon-based sentiment analysis begins with annotating words in text with a type of sentiment or its intensity score.

Words in sentiment lexicons have association with sentiment. For example, honest and competent are associated with positive sentiment, whereas dishonest and dull are associated with negative sentiment.

Furthermore, the degree of positivity (or negativity), also referred to as sentiment intensity, can vary. For example, most people will agree that succeed is more positive (or less negative) than improve, and failure is more negative (or less positive) than decline.

Sentiment associations are commonly captured in sentiment lexicons, which are lists of associated word-sentiment pairs (optionally with a score indicating the degree of association). Using the sentiment lexicons, we can measure the sentiment content for words in the text.

Sentiment Lexicons in the sentiments Dataset from the textdata package

Of course, there exists a number of sentiment lexicons that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text. The textdata package contains four sentiment lexicons in the sentiments dataset, which are 1) AFINN from Finn Arup Nielsen, 2) bing from Bing Liu and collaborators, 3) loughran from Loughran and McDonald, and 4) nrc from Saif Mohammad and Peter Turney.

These four lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

Two additional lexicons from nrc

4-1. lexicon_nrc_eil() returns the NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon), which is a list of English words and their associations with four basic emotions (anger, fear, sadness, and joy). And for a given word and emotion X, the assigned score ranges from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.

4-2. lexicon_nrc_vad() returns the NRC Valence, Arousal, and Dominance Lexicon that includes a list of more than 20,000 English words and their valence, arousal, and dominance scores. For a given word and a dimension of valuence, arousal, or dominance, the assigned score ranges from 0 (lowest degree of V/A/D) to 1 (highest V/A/D).

All of this information is tabulated in each dataset, and from the textdata package the dataset can be downloaded to get specific sentiment lexicons without the columns that are not used in that lexicon.

To sum up, the textdata datasets include the following features:

  • word, an English word (unigram)

  • sentiment/AffectDimension, one of either positive, negative, or specific emotions

    • the Bing lexicon has only positive/negative,
    • the Loughran lexicon has positive, negative, constraining, litigious, superfluous, and uncertainty, and
    • the NRC lexicon has positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust
    • the NRC Emotion Intensity Lexicon (EIL) has anger, fear, sadness, and joy
    • the NRC Valence, Arousal, and Dominance (VAD) Lexicon has valence(positiveness-negativeness/pleasure-displeasure), arousal (active-passive), and dominance (dominant-submissive)
  • value/score, a numerical score for the sentiment, running between -5 and 5 for the AFINN lexicon, between 0 and 1 for the NRC Emotion Intensity Lexicon and Valence/Arousal/Dominance Lexicon

  • Note that sentiment lexicons are in tidy data frame with one word per row. But, not every English word is in the lexicons because many English words are pretty neutral. Also, words with non-ASCII characters were removed from the lexicons. Finally, lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.0     √ purrr   0.3.4
## √ tibble  3.0.0     √ dplyr   0.8.5
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(textdata) 

# AFINN lexicon uses the `score` feature
## Number of words by score
lexicon_afinn() %>% 
  count(value)
## # A tibble: 11 x 2
##    value     n
##    <dbl> <int>
##  1    -5    16
##  2    -4    43
##  3    -3   264
##  4    -2   966
##  5    -1   309
##  6     0     1
##  7     1   208
##  8     2   448
##  9     3   172
## 10     4    45
## 11     5     5
# bing lexicon uses the 'sentiment' feature
## Number of words by sentiment
lexicon_bing() %>% 
  group_by(sentiment) %>% 
  summarise(n = n())
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4782
## 2 positive   2005
lexicon_bing() %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4782
## 2 positive   2005
# loughran and nrc lexicons specifies different emotions of the sentiment
## Number of words by sentiment
lexicon_loughran() %>% 
  group_by(sentiment) %>% 
  summarise(n = n())
## # A tibble: 6 x 2
##   sentiment        n
##   <chr>        <int>
## 1 constraining   184
## 2 litigious      904
## 3 negative      2355
## 4 positive       354
## 5 superfluous     56
## 6 uncertainty    297
lexicon_nrc() %>% 
  group_by(sentiment) %>% 
  summarise(n = n())
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1247
##  2 anticipation   839
##  3 disgust       1058
##  4 fear          1476
##  5 joy            689
##  6 negative      3324
##  7 positive      2312
##  8 sadness       1191
##  9 surprise       534
## 10 trust         1231
lexicon_nrc_eil() %>% 
  count(AffectDimension)
## # A tibble: 4 x 2
##   AffectDimension     n
##   <chr>           <int>
## 1 anger            1483
## 2 fear             1765
## 3 joy              1268
## 4 sadness          1298
library(ggplot2)
lexicon_nrc_eil() %>% 
  ggplot(aes(x=score)) +
  geom_histogram(color="black", fill="white") +
  facet_wrap(as.factor(AffectDimension) ~ ., ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

lexicon_nrc_vad() %>% 
  ggplot(aes(x=Valence)) +
  geom_histogram(color="black", fill="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Basic Lexicon-based Sentiment Analysis with inner_join()

The lexicon-based sentiment analysis can be performed using our tweet data in a tidy format. That is, our tweet data are in a tidy format that each row has a single word from each tweet.

library(tidytext)
library(stringr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stopwords)
load("covid19_tweets_df.RData")

covid19_tweets_tidy <- covid19_tweets_df %>% 
  select(created_at, text) %>% 
  filter(!duplicated(text)) %>% 
  mutate(date = floor_date(created_at, unit="day")) %>% 
  mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>% 
  mutate(text = str_replace_all(text, "&amp;|&lt;|&gt;|&quot;|RT", " ")) %>% 
  unnest_tweets(word, text) %>% 
  filter(!word %in% stopwords()) %>% 
  filter(str_detect(word, "[a-z]"))
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy %>% count(word, sort=T)
## # A tibble: 1,393,892 x 2
##    word              n
##    <chr>         <int>
##  1 covid19      395194
##  2 #covid19     325356
##  3 #coronavirus 208448
##  4 people        90593
##  5 s             84213
##  6 can           81292
##  7 us            80525
##  8 cases         78857
##  9 now           75707
## 10 #covid2019    67658
## # ... with 1,393,882 more rows

Understanding inner_join()

With data in the tidy format, sentiment analysis can be done as an inner join. When a tidy data b is joined to a tidy data a using a %>% inner_join(b), this returns all rows from a where there are matching values in b, and all columns from a and b.

text <- data_frame(word = c("holiday","makes","me","happy","but","this","song","is","sad"))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
lexicon <- data_frame(word = c("happy","sad","holiday","funeral"), 
                      sentiment = c("positive","negative","positive","negative"))
inner_join(text, lexicon)
## Joining, by = "word"
## # A tibble: 3 x 2
##   word    sentiment
##   <chr>   <chr>    
## 1 holiday positive 
## 2 happy   positive 
## 3 sad     negative

Let’s look at the words with a positive sentiment from the bing lexicon. What are the most common negative words in tweets on COVID-19? We can use count() from the dplyr package.

#Using the Bing lexicon, select only the words that are associated to a sentiment of 'negative'
bing_negative <- lexicon_bing() %>% 
  filter(sentiment == "negative")
bing_negative
## # A tibble: 4,782 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 4,772 more rows
# We can count the usage frequency of 'positive' words in tweets on BTS
covid19_tweets_tidy %>% 
  inner_join(bing_negative) %>% 
  count(word, sort=T)
## Joining, by = "word"
## # A tibble: 3,887 x 2
##    word          n
##    <chr>     <int>
##  1 virus     39870
##  2 crisis    29285
##  3 outbreak  18410
##  4 death     18399
##  5 symptoms  12908
##  6 emergency 12519
##  7 risk      11825
##  8 died      10866
##  9 die       10840
## 10 infected   9879
## # ... with 3,877 more rows
# We can count the numbers of positive and negative words

covid19_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(sentiment)
## Joining, by = "word"
## # A tibble: 2 x 2
##   sentiment      n
##   <chr>      <int>
## 1 negative  807340
## 2 positive  767181
# Or we can count the frequency of 'fear' words in tweets on BTS
covid19_tweets_tidy %>% 
  inner_join(get_sentiments("nrc")) %>% 
  filter(sentiment == "fear") %>% 
  count(word, sort=T)
## Joining, by = "word"
## # A tibble: 1,427 x 2
##    word           n
##    <chr>      <int>
##  1 pandemic   54287
##  2 fight      22013
##  3 government 21564
##  4 death      18399
##  5 medical    17978
##  6 hospital   16391
##  7 case       12735
##  8 emergency  12519
##  9 risk       11825
## 10 watch      11405
## # ... with 1,417 more rows
# We can also summarise different emotions 
covid19_tweets_tidy %>% 
  inner_join(get_sentiments("nrc")) %>% 
  group_by(sentiment) %>% 
  summarise(freq = n()) %>% 
  arrange(desc(freq))
## Joining, by = "word"
## # A tibble: 10 x 2
##    sentiment       freq
##    <chr>          <int>
##  1 positive     1214572
##  2 negative      899418
##  3 trust         799147
##  4 fear          619911
##  5 anticipation  569580
##  6 sadness       448993
##  7 joy           383489
##  8 anger         333921
##  9 surprise      261166
## 10 disgust       229475
covid19_tweets_tidy %>% 
  inner_join(get_sentiments("nrc")) %>% 
  group_by(sentiment) %>% 
  summarise(freq = n()) %>% 
  arrange(desc(freq))
## Joining, by = "word"
## # A tibble: 10 x 2
##    sentiment       freq
##    <chr>          <int>
##  1 positive     1214572
##  2 negative      899418
##  3 trust         799147
##  4 fear          619911
##  5 anticipation  569580
##  6 sadness       448993
##  7 joy           383489
##  8 anger         333921
##  9 surprise      261166
## 10 disgust       229475

Visualizing the result of sentiment analysis

library(ggplot2)

# Bar chart
covid19_tweets_tidy %>% 
  inner_join(lexicon_nrc()) %>% 
  count(sentiment, sort=TRUE) %>%
  mutate(sentiment = reorder(sentiment, n)) %>% 
  ggplot(aes(x=sentiment, y=n)) +
  labs(x="Emotion", y="Frequency", title="Bar Chart of Sentiment toward COVID-19") +
  geom_bar(stat="identity", width=.5, fill="tomato3")  
## Joining, by = "word"

# Pie chart
covid19_tweets_tidy %>% 
  inner_join(get_sentiments("nrc")) %>% 
  count(sentiment, sort=TRUE) %>%
  mutate(sentiment = reorder(sentiment, n)) %>% 
  ggplot(aes(x="", y=n, fill=factor(sentiment))) +
  geom_bar(width=1, stat="identity") +
  labs(fill="sentiment", x=NULL, y=NULL, title="Pie Chart of Sentiment toward COVID-19") +
  coord_polar(theta="y", start=0) +
  theme_void()
## Joining, by = "word"

We can also visualize the top 20 words for each sentiment in the bing or nrc lexicons:

covid19_tweets_tidy %>% 
   inner_join(get_sentiments("nrc")) %>% 
   group_by(sentiment) %>% 
   count(word, sort=T) %>% 
   top_n(20) %>% 
   ggplot(aes(reorder(word, n), n, fill=sentiment)) +
   geom_bar(stat="identity", show.legend = FALSE) +
   facet_wrap(~sentiment, scales="free_y", ncol=5) +
   labs(y = "Contribution to sentiment", x = NULL) +
   coord_flip()
## Joining, by = "word"
## Selecting by n

covid19_tweets_tidy %>% 
   inner_join(get_sentiments("bing")) %>% 
   group_by(sentiment) %>% 
   count(word, sort=T) %>% 
   top_n(20) %>% 
   ggplot(aes(reorder(word, n), n, fill=sentiment)) +
   geom_bar(stat="identity", show.legend = FALSE) +
   facet_wrap(~sentiment, scales="free_y") +
   labs(y = "Contribution to sentiment", x = NULL) +
   coord_flip()
## Joining, by = "word"
## Selecting by n

Visualization of sentiment word clouds

library(wordcloud)
## Loading required package: RColorBrewer
# Positive words
covid19_tweets_tidy %>% 
  inner_join(get_sentiments("bing")) %>% # Joining with the Bing dataset
  filter(!word %in% c("trump", "like","positive","virus")) %>% # Removing irrelevant words to sentiment in this context
  group_by(sentiment) %>% 
  count(word, sort=T) %>% 
  filter(sentiment=="positive") %>% 
  with(wordcloud(words = word, # The with( ) function applys an expression to a dataset. 
                 freq = n, 
                 max.words = 100, # Maximum numbers of words plotted
                 random.order = FALSE, # Highly frequent words placed in the middle
                 rot.per = 0.2, # Rate of words rotated in plot
                 scale = c(3, 0.3), # Range of words in size
                 colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"
## Joining, by = "word"

covid19_tweets_tidy %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(!word %in% c("trump", "like","positive","virus")) %>% 
  group_by(sentiment) %>% 
  count(word, sort=T) %>% 
  filter(sentiment=="negative") %>% 
  with(wordcloud(words = word, # The with( ) function applys an expression to a dataset. 
                 freq = n, 
                 max.words = 100, # Maximum numbers of words plotted
                 random.order = FALSE, # Highly frequent words placed in the middle
                 rot.per = 0.2, # Rate of words rotated in plot
                 scale = c(3, 0.3), # Range of words in size
                 colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"
## Joining, by = "word"