Understand the tasks of subjectivity and sentiment analysis
Learn about resources for subjectivity and sentiment analysis, specifically addressing lexicon-based sentiment analysis
Learn about tidy text approach to lexicon-based sentiment analysis
Sentiment analysis is the computational study of people’s opinions, emotions, and attitudes, which are all part of sentiment.
Sentiment analysis is increasingly important in business and society. It offers numerous research challenges but promises insight useful for opinion analysis and social media analysis. So, sentiment analysis can ask these questions:
For sentiment analysis, we usually use NLP, lexicons, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit or a tweet in our case. Using sentiment analysis, we might ask about how people respond to the Covid-19 issue based on a sample of tweets.
We explored in depth what we mean by the tidy text data format and showed how this format can be used to approach questions about word frequency. We counted the frequency of the words and visualized a word cloud from the tidy text data. By doing so, we analyzed which words were used mostly frequently in tweets about COVID-19.
The tidy text data format is also useful for lexicon-based sentiment analysis. When we read a text or a tweet, we use our understanding of the emotional intent of words to infer whether a section of text or a tweet is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tidy tools of text mining to approach the emotional content of text programmatically, as shown in the following figure:
A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge
From this week, we are going to do some lexicon-based sentiment analysis. This approach assumes that the contextual sentiment orientation of text is the sum of the sentiment orientation of each word or phrase. So, to analyze the sentiment of a text, we consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.
Specifically, it is to find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text that is matched with the words in sentiment lexicons. For example, if a tweet includes 5 positive words and 15 negative words, then we can learn that the sentiment toward COVID-19 on this tweet is negative. In doing so, we can measure the overall degree of sentiments expressed on Twitter by counting tweets classified into positive and negative sentiment ones.
This is not the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.
Lexicon-based sentiment analysis begins with annotating words in text with a type of sentiment or its intensity score.
Words in sentiment lexicons have association with sentiment. For example, honest and competent are associated with positive sentiment, whereas dishonest and dull are associated with negative sentiment.
Furthermore, the degree of positivity (or negativity), also referred to as sentiment intensity, can vary. For example, most people will agree that succeed is more positive (or less negative) than improve, and failure is more negative (or less positive) than decline.
Sentiment associations are commonly captured in sentiment lexicons, which are lists of associated word-sentiment pairs (optionally with a score indicating the degree of association). Using the sentiment lexicons, we can measure the sentiment content for words in the text.
textdata
packageOf course, there exists a number of sentiment lexicons that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text. The textdata package provides four main sentiment lexicons, which are 1) AFINN
from Finn Arup Nielsen, 2) Bing
from Bing Liu and collaborators, 3) Loughran
from Loughran and McDonald, and 4) NRC
from Saif Mohammad and Peter Turney.
Two more NRC companion lexicons: 1) NRC Emotion Intensity Lexicon (NRC-EIL) and 2) NRC Valence, Arousal, and Dominance (NRC-VAD) Lexicon.
All sentiment lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
lexicon_afinn()
returns the AFINN
lexicon that contains 2,477 English words rated for valence, which labels words with an integer score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
lexicon_bing()
returns the Bing
lexicon as one of the most popular general purpose English sentiment lexicons that categorizes 6,787 words in a binary fashion into positive and negative categories.
lexicon_loughran()
returns the Loughran-McDonald
sentiment lexicon, which is created for use with financial documents. This lexicon labels 4,150 words with 6 possible sentiments important in financial contexts: “positive”, “negative”, “constraining”, “litigious”, “superfluous”, and “uncertainty”.
lexicon_nrc()
returns the NRC
lexicon, which is also a general purpose English sentiment lexicon. This lexicon labels 13,901 words with 10 possible categories of sentiments or emotions: “positive”, “negative”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, and “trust”.
nrc
4-1. lexicon_nrc_eil()
returns the NRC Emotion Intensity Lexicon (NRC-EIL), which is a list of 5,814 English words and their associations with four basic emotions (anger, fear, sadness, and joy). And for a given word and emotion X, the assigned score ranges from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.
4-2. lexicon_nrc_vad()
returns the NRC Valence, Arousal, and Dominance (NRC-VAD) Lexicon that includes a list of more than 20,007 English words and their valence, arousal, and dominance scores. For a given word and a dimension of valuence, arousal, or dominance, the assigned score ranges from 0 (lowest degree of V/A/D) to 1 (highest V/A/D).
All of this information is tabulated in each dataset, and from the textdata
package the dataset can be downloaded to get the list of words and their annotated sentiments or values.
To sum up, the textdata
datasets include the following features:
word
, an English word (unigram)sentiment
/AffectDimension
, one of either positive, negative, or specific emotions
value
/score
, a numerical score for the sentiment, running between -5 and 5 for the AFINN lexicon, between 0 and 1 for the NRC-EIL and NRC-VAD Lexicon
Note that sentiment lexicons are in tidy data frame with one word per row. But, not every English word is in the lexicons because many English words are pretty neutral. Also, words with non-ASCII characters were removed from the lexicons. Finally, lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.
library(dplyr)
library(textdata)
# AFINN lexicon uses the `value` feature
## Number of words by value
lexicon_afinn() %>%
count(value)
## # A tibble: 11 x 2
## value n
## <dbl> <int>
## 1 -5 16
## 2 -4 43
## 3 -3 264
## 4 -2 966
## 5 -1 309
## 6 0 1
## 7 1 208
## 8 2 448
## 9 3 172
## 10 4 45
## 11 5 5
lexicon_afinn() %>% filter(value==-5)
## # A tibble: 16 x 2
## word value
## <chr> <dbl>
## 1 bastard -5
## 2 bastards -5
## 3 bitch -5
## 4 bitches -5
## 5 cock -5
## 6 cocksucker -5
## 7 cocksuckers -5
## 8 cunt -5
## 9 motherfucker -5
## 10 motherfucking -5
## 11 niggas -5
## 12 nigger -5
## 13 prick -5
## 14 slut -5
## 15 son-of-a-bitch -5
## 16 twat -5
# bing lexicon uses the 'sentiment' feature
## Number of words by sentiment
lexicon_bing() %>%
group_by(sentiment) %>%
summarise(n = n())
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2005
lexicon_bing() %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2005
# loughran and nrc lexicons specifies different emotions of the sentiment
## Number of words by sentiment
lexicon_loughran() %>%
group_by(sentiment) %>%
summarise(n = n())
## # A tibble: 6 x 2
## sentiment n
## <chr> <int>
## 1 constraining 184
## 2 litigious 904
## 3 negative 2355
## 4 positive 354
## 5 superfluous 56
## 6 uncertainty 297
lexicon_nrc() %>%
group_by(sentiment) %>%
summarise(n = n())
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
lexicon_nrc() %>%
filter(word == "hate")
## # A tibble: 5 x 2
## word sentiment
## <chr> <chr>
## 1 hate anger
## 2 hate disgust
## 3 hate fear
## 4 hate negative
## 5 hate sadness
lexicon_nrc_eil() %>%
count(AffectDimension)
## # A tibble: 4 x 2
## AffectDimension n
## <chr> <int>
## 1 anger 1483
## 2 fear 1765
## 3 joy 1268
## 4 sadness 1298
Let’s create a histogram to show a distribution of word counts with different intensities in each sentiment group
library(ggplot2)
lexicon_nrc_eil() %>%
ggplot(aes(x=score)) +
geom_histogram(color="black", fill="white") +
facet_wrap(as.factor(AffectDimension) ~ ., ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lexicon_nrc_vad() %>%
ggplot(aes(x=Dominance)) +
geom_histogram(color="black", fill="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
inner_join()
The lexicon-based sentiment analysis can be performed using our tweet data in a tidy format. That is, our tweet data are in a tidy format that each row has a single word from each tweet.
library(tidytext)
library(stringr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(stopwords)
load("covid19_tweets_df.RData")
covid19_tweets_tidy <- covid19_tweets_df %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>%
mutate(text = str_replace_all(text, "&|<|>|"|RT", " ")) %>%
unnest_tweets(word, text) %>%
filter(!word %in% stopwords()) %>%
filter(str_detect(word, "[a-z]"))
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy
## # A tibble: 16,135,045 x 3
## created_at date word
## <dttm> <dttm> <chr>
## 1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating
## 2 2020-03-27 04:28:33 2020-03-27 00:00:00 news
## 3 2020-03-27 04:28:33 2020-03-27 00:00:00 england
## 4 2020-03-27 04:28:33 2020-03-27 00:00:00 uk
## 5 2020-03-27 04:28:33 2020-03-27 00:00:00 firms
## 6 2020-03-27 04:28:33 2020-03-27 00:00:00 academics
## 7 2020-03-27 04:28:33 2020-03-27 00:00:00 also
## 8 2020-03-27 04:28:33 2020-03-27 00:00:00 developed
## 9 2020-03-27 04:28:33 2020-03-27 00:00:00 selftest
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 kits
## # … with 16,135,035 more rows
covid19_tweets_tidy %>% count(word, sort=T)
## # A tibble: 1,393,892 x 2
## word n
## <chr> <int>
## 1 covid19 395194
## 2 #covid19 325356
## 3 #coronavirus 208448
## 4 people 90593
## 5 s 84213
## 6 can 81292
## 7 us 80525
## 8 cases 78857
## 9 now 75707
## 10 #covid2019 67658
## # … with 1,393,882 more rows
covid19_tweets_tidy <- covid19_tweets_tidy %>%
filter(str_length(word) > 1)
inner_join()
With data in the tidy format, sentiment analysis can be done as an inner join. When a tidy data b
is joined to a tidy data a
using a %>% inner_join(b)
, this returns all rows from a
where there are matching values in b
, and all columns from a
and b
.
library(tibble)
text <- tibble(word = c("holiday","makes","me","happy","but","this","song","is","sad"))
text
## # A tibble: 9 x 1
## word
## <chr>
## 1 holiday
## 2 makes
## 3 me
## 4 happy
## 5 but
## 6 this
## 7 song
## 8 is
## 9 sad
lexicon <- tibble(word = c("happy","sad","holiday","funeral"),
sentiment = c("positive","negative","positive","negative"))
lexicon
## # A tibble: 4 x 2
## word sentiment
## <chr> <chr>
## 1 happy positive
## 2 sad negative
## 3 holiday positive
## 4 funeral negative
inner_join(text, lexicon)
## Joining, by = "word"
## # A tibble: 3 x 2
## word sentiment
## <chr> <chr>
## 1 holiday positive
## 2 happy positive
## 3 sad negative
Let’s look at the words with positive and negative sentiment from the bing lexicon. What are the most common negative words in tweets on COVID-19? We can use count()
from the dplyr package.
#Using the Bing lexicon, we can select the words in covid19_tweets_tidy that are only annotated to convey sentiments.
covid19_tweets_tidy
## # A tibble: 15,941,162 x 3
## created_at date word
## <dttm> <dttm> <chr>
## 1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating
## 2 2020-03-27 04:28:33 2020-03-27 00:00:00 news
## 3 2020-03-27 04:28:33 2020-03-27 00:00:00 england
## 4 2020-03-27 04:28:33 2020-03-27 00:00:00 uk
## 5 2020-03-27 04:28:33 2020-03-27 00:00:00 firms
## 6 2020-03-27 04:28:33 2020-03-27 00:00:00 academics
## 7 2020-03-27 04:28:33 2020-03-27 00:00:00 also
## 8 2020-03-27 04:28:33 2020-03-27 00:00:00 developed
## 9 2020-03-27 04:28:33 2020-03-27 00:00:00 selftest
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 kits
## # … with 15,941,152 more rows
covid19_tweets_tidy %>%
inner_join(lexicon_bing())
## Joining, by = "word"
## # A tibble: 1,574,521 x 4
## created_at date word sentiment
## <dttm> <dttm> <chr> <chr>
## 1 2020-03-27 04:28:33 2020-03-27 00:00:00 fascinating positive
## 2 2020-03-27 04:28:33 2020-03-27 00:00:00 available positive
## 3 2020-03-27 04:28:33 2020-03-27 00:00:00 virus negative
## 4 2020-03-27 04:28:33 2020-03-27 00:00:00 hard negative
## 5 2020-03-27 04:28:33 2020-03-27 00:00:00 fucking negative
## 6 2020-03-27 04:28:33 2020-03-27 00:00:00 like positive
## 7 2020-03-27 04:28:33 2020-03-27 00:00:00 shit negative
## 8 2020-03-27 04:27:01 2020-03-27 00:00:00 support positive
## 9 2020-03-27 04:27:01 2020-03-27 00:00:00 like positive
## 10 2020-03-27 04:28:33 2020-03-27 00:00:00 myth negative
## # … with 1,574,511 more rows
# We can count the usage frequency of positive and negative words in tweets on COVID-19
covid19_tweets_tidy %>%
inner_join(lexicon_bing()) %>%
count(sentiment, sort=T)
## Joining, by = "word"
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 807340
## 2 positive 767181
# Or we can count the frequency of 'fear' words in tweets on COVID-19
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
filter(sentiment == "fear") %>%
count(word, sort=T)
## Joining, by = "word"
## # A tibble: 1,427 x 2
## word n
## <chr> <int>
## 1 pandemic 54287
## 2 fight 22013
## 3 government 21564
## 4 death 18399
## 5 medical 17978
## 6 hospital 16391
## 7 case 12735
## 8 emergency 12519
## 9 risk 11825
## 10 watch 11405
## # … with 1,417 more rows
# We can also summarise different emotions
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
group_by(sentiment) %>%
summarise(freq = n()) %>%
arrange(desc(freq))
## Joining, by = "word"
## # A tibble: 10 x 2
## sentiment freq
## <chr> <int>
## 1 positive 1214572
## 2 negative 899418
## 3 trust 799147
## 4 fear 619911
## 5 anticipation 569580
## 6 sadness 448993
## 7 joy 383489
## 8 anger 333921
## 9 surprise 261166
## 10 disgust 229475
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
group_by(sentiment) %>%
summarise(freq = n()) %>%
arrange(desc(freq))
## Joining, by = "word"
## # A tibble: 10 x 2
## sentiment freq
## <chr> <int>
## 1 positive 1214572
## 2 negative 899418
## 3 trust 799147
## 4 fear 619911
## 5 anticipation 569580
## 6 sadness 448993
## 7 joy 383489
## 8 anger 333921
## 9 surprise 261166
## 10 disgust 229475
library(ggplot2)
# Bar chart
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
count(sentiment, sort=TRUE) %>%
mutate(sentiment = reorder(sentiment, n)) %>%
ggplot(aes(x=sentiment, y=n)) +
labs(x="Emotion", y="Frequency", title="Bar Chart of Sentiment toward COVID-19") +
geom_bar(stat="identity", width=.5, fill="tomato3")
## Joining, by = "word"
# Pie chart
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
count(sentiment, sort=TRUE) %>%
mutate(sentiment = reorder(sentiment, n)) %>%
ggplot(aes(x="", y=n, fill=factor(sentiment))) +
geom_bar(width=1, stat="identity") +
labs(fill="sentiment", x=NULL, y=NULL, title="Pie Chart of Sentiment toward COVID-19") +
coord_polar(theta="y", start=0) +
theme_void()
## Joining, by = "word"
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
group_by(sentiment) %>%
count(word, sort=T) %>%
top_n(20) %>%
ggplot(aes(reorder(word, n), n, fill=sentiment)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~sentiment, scales="free_y", ncol=5) +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
## Joining, by = "word"
## Selecting by n
covid19_tweets_tidy %>%
inner_join(lexicon_bing()) %>%
group_by(sentiment) %>%
count(word, sort=T) %>%
top_n(20) %>%
ggplot(aes(reorder(word, n), n, fill=sentiment)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~sentiment, scales="free_y") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
## Joining, by = "word"
## Selecting by n
library(wordcloud)
## Loading required package: RColorBrewer
# Positive words
covid19_tweets_tidy %>%
inner_join(lexicon_bing()) %>% # Joining with the Bing dataset
filter(!word %in% c("trump", "like","positive","virus")) %>% # Removing irrelevant words to sentiment in this context
group_by(sentiment) %>%
count(word, sort=T) %>%
filter(sentiment=="positive") %>%
with(wordcloud(words = word, # The with( ) function applys an expression to a dataset.
freq = n,
max.words = 100, # Maximum numbers of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.2, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"
## Joining, by = "word"
covid19_tweets_tidy %>%
inner_join(lexicon_bing()) %>%
filter(!word %in% c("trump", "like","positive","virus")) %>%
group_by(sentiment) %>%
count(word, sort=T) %>%
filter(sentiment=="negative") %>%
with(wordcloud(words = word, # The with( ) function applys an expression to a dataset.
freq = n,
max.words = 100, # Maximum numbers of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.2, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"
## Joining, by = "word"