Understand the tasks of subjectivity and sentiment analysis
Learn about resources for subjectivity and sentiment analysis, specifically addressing lexicon-based sentiment analysis
Learn about tidy text approach to lexicon-based sentiment analysis
Sentiment analysis is the computational study of people’s opinions, emotions, and attitudes, which are all part of sentiment.
Sentiment analysis is increasingly important in business and society. It offers numerous research challenges but promises insight useful for opinion analysis and social media analysis. So, sentiment analysis can ask these questions:
For sentiment analysis, we usually use NLP, lexicons, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit or a tweet in our case. Using sentiment analysis, we might ask about how people respond to the Covid-19 issue based on a sample of tweets.
We explored in depth what we mean by the tidy text data format and showed how this format can be used to approach questions about word frequency. We counted the frequency of the words and visualized a word cloud from the tidy text data. By doing so, we analyzed which words were used mostly frequently in tweets about COVID-19.
The tidy text data format is also useful for lexicon-based sentiment analysis. When we read a text or a tweet, we use our understanding of the emotional intent of words to infer whether a section of text or a tweet is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tidy tools of text mining to approach the emotional content of text programmatically, as shown in the following figure:
A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge
From this week, we are going to do some lexicon-based sentiment analysis. This approach assumes that the contextual sentiment orientation of text is the sum of the sentiment orientation of each word or phrase. So, to analyze the sentiment of a text, we consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.
Specifically, it is to find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text that is matched with the words in sentiment lexicons. For example, if a tweet includes 5 positive words and 15 negative words, then we can learn that the sentiment toward COVID-19 on this tweet is negative. In doing so, we can measure the overall degree of sentiments expressed on Twitter by counting tweets classified into positive and negative sentiment ones.
This is not the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.
Lexicon-based sentiment analysis begins with annotating words in text with a type of sentiment or its intensity score.
Words in sentiment lexicons have association with sentiment. For example, honest and competent are associated with positive sentiment, whereas dishonest and dull are associated with negative sentiment.
Furthermore, the degree of positivity (or negativity), also referred to as sentiment intensity, can vary. For example, most people will agree that succeed is more positive (or less negative) than improve, and failure is more negative (or less positive) than decline.
Sentiment associations are commonly captured in sentiment lexicons, which are lists of associated word-sentiment pairs (optionally with a score indicating the degree of association). Using the sentiment lexicons, we can measure the sentiment content for words in the text.
sentiments
Dataset from the textdata
packageOf course, there exists a number of sentiment lexicons that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text. The textdata package contains four sentiment lexicons in the sentiments
dataset, which are 1) AFINN
from Finn Arup Nielsen, 2) bing
from Bing Liu and collaborators, 3) loughran
from Loughran and McDonald, and 4) nrc
from Saif Mohammad and Peter Turney.
These four lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
lexicon_afinn()
returns the AFINN
lexicon that contains English words rated for valence, which labels words with an integer score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
lexicon_bing()
returns the Bing
lexicon as one of the most popular general purpose English sentiment lexicons that categorizes words in a binary fashion into positive and negative categories.
lexicon_bing()
returns the Loughran-McDonald
sentiment lexicon, which is created for use with financial documents. This lexicon labels words with 6 possible sentiments important in financial contexts: “positive”, “negative”, “constraining”, “litigious”, “superfluous”, and “uncertainty”.
lexicon_nrc()
returns the NRC
lexicon, which is also a general purpose English sentiment lexicon. This lexicon labels words with 10 possible categories of sentiments or emotions: “positive”, “negative”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, and “trust”.
nrc
4-1. lexicon_nrc_eil()
returns the NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon), which is a list of English words and their associations with four basic emotions (anger, fear, sadness, and joy). And for a given word and emotion X, the assigned score ranges from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.
4-2. lexicon_nrc_vad()
returns the NRC Valence, Arousal, and Dominance Lexicon that includes a list of more than 20,000 English words and their valence, arousal, and dominance scores. For a given word and a dimension of valuence, arousal, or dominance, the assigned score ranges from 0 (lowest degree of V/A/D) to 1 (highest V/A/D).
All of this information is tabulated in each dataset, and from the textdata
package the dataset can be downloaded to get specific sentiment lexicons without the columns that are not used in that lexicon.
To sum up, the textdata
datasets include the following features:
word
, an English word (unigram)
sentiment
/AffectDimension
, one of either positive, negative, or specific emotions
value
/score
, a numerical score for the sentiment, running between -5 and 5 for the AFINN lexicon, between 0 and 1 for the NRC Emotion Intensity Lexicon and Valence/Arousal/Dominance Lexicon
Note that sentiment lexicons are in tidy data frame with one word per row. But, not every English word is in the lexicons because many English words are pretty neutral. Also, words with non-ASCII characters were removed from the lexicons. Finally, lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.0 √ purrr 0.3.4
## √ tibble 3.0.0 √ dplyr 0.8.5
## √ tidyr 1.0.2 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(textdata)
# AFINN lexicon uses the `score` feature
## Number of words by score
lexicon_afinn() %>%
count(value)
## # A tibble: 11 x 2
## value n
## <dbl> <int>
## 1 -5 16
## 2 -4 43
## 3 -3 264
## 4 -2 966
## 5 -1 309
## 6 0 1
## 7 1 208
## 8 2 448
## 9 3 172
## 10 4 45
## 11 5 5
# bing lexicon uses the 'sentiment' feature
## Number of words by sentiment
lexicon_bing() %>%
group_by(sentiment) %>%
summarise(n = n())
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2005
lexicon_bing() %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2005
# loughran and nrc lexicons specifies different emotions of the sentiment
## Number of words by sentiment
lexicon_loughran() %>%
group_by(sentiment) %>%
summarise(n = n())
## # A tibble: 6 x 2
## sentiment n
## <chr> <int>
## 1 constraining 184
## 2 litigious 904
## 3 negative 2355
## 4 positive 354
## 5 superfluous 56
## 6 uncertainty 297
lexicon_nrc() %>%
group_by(sentiment) %>%
summarise(n = n())
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
lexicon_nrc_eil() %>%
count(AffectDimension)
## # A tibble: 4 x 2
## AffectDimension n
## <chr> <int>
## 1 anger 1483
## 2 fear 1765
## 3 joy 1268
## 4 sadness 1298
library(ggplot2)
lexicon_nrc_eil() %>%
ggplot(aes(x=score)) +
geom_histogram(color="black", fill="white") +
facet_wrap(as.factor(AffectDimension) ~ ., ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lexicon_nrc_vad() %>%
ggplot(aes(x=Valence)) +
geom_histogram(color="black", fill="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
inner_join()
The lexicon-based sentiment analysis can be performed using our tweet data in a tidy format. That is, our tweet data are in a tidy format that each row has a single word from each tweet.
library(tidytext)
library(stringr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
##
## intersect, setdiff, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(stopwords)
load("covid19_tweets_df.RData")
covid19_tweets_tidy <- covid19_tweets_df %>%
select(created_at, text) %>%
filter(!duplicated(text)) %>%
mutate(date = floor_date(created_at, unit="day")) %>%
mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>%
mutate(text = str_replace_all(text, "&|<|>|"|RT", " ")) %>%
unnest_tweets(word, text) %>%
filter(!word %in% stopwords()) %>%
filter(str_detect(word, "[a-z]"))
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid19_tweets_tidy %>% count(word, sort=T)
## # A tibble: 1,393,892 x 2
## word n
## <chr> <int>
## 1 covid19 395194
## 2 #covid19 325356
## 3 #coronavirus 208448
## 4 people 90593
## 5 s 84213
## 6 can 81292
## 7 us 80525
## 8 cases 78857
## 9 now 75707
## 10 #covid2019 67658
## # ... with 1,393,882 more rows
inner_join()
With data in the tidy format, sentiment analysis can be done as an inner join. When a tidy data b
is joined to a tidy data a
using a %>% inner_join(b)
, this returns all rows from a
where there are matching values in b
, and all columns from a
and b
.
text <- data_frame(word = c("holiday","makes","me","happy","but","this","song","is","sad"))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
lexicon <- data_frame(word = c("happy","sad","holiday","funeral"),
sentiment = c("positive","negative","positive","negative"))
inner_join(text, lexicon)
## Joining, by = "word"
## # A tibble: 3 x 2
## word sentiment
## <chr> <chr>
## 1 holiday positive
## 2 happy positive
## 3 sad negative
Let’s look at the words with a positive sentiment from the bing lexicon. What are the most common negative words in tweets on COVID-19? We can use count()
from the dplyr package.
#Using the Bing lexicon, select only the words that are associated to a sentiment of 'negative'
bing_negative <- lexicon_bing() %>%
filter(sentiment == "negative")
bing_negative
## # A tibble: 4,782 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 4,772 more rows
# We can count the usage frequency of 'positive' words in tweets on BTS
covid19_tweets_tidy %>%
inner_join(bing_negative) %>%
count(word, sort=T)
## Joining, by = "word"
## # A tibble: 3,887 x 2
## word n
## <chr> <int>
## 1 virus 39870
## 2 crisis 29285
## 3 outbreak 18410
## 4 death 18399
## 5 symptoms 12908
## 6 emergency 12519
## 7 risk 11825
## 8 died 10866
## 9 die 10840
## 10 infected 9879
## # ... with 3,877 more rows
# We can count the numbers of positive and negative words
covid19_tweets_tidy %>%
inner_join(lexicon_bing()) %>%
count(sentiment)
## Joining, by = "word"
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 807340
## 2 positive 767181
# Or we can count the frequency of 'fear' words in tweets on BTS
covid19_tweets_tidy %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment == "fear") %>%
count(word, sort=T)
## Joining, by = "word"
## # A tibble: 1,427 x 2
## word n
## <chr> <int>
## 1 pandemic 54287
## 2 fight 22013
## 3 government 21564
## 4 death 18399
## 5 medical 17978
## 6 hospital 16391
## 7 case 12735
## 8 emergency 12519
## 9 risk 11825
## 10 watch 11405
## # ... with 1,417 more rows
# We can also summarise different emotions
covid19_tweets_tidy %>%
inner_join(get_sentiments("nrc")) %>%
group_by(sentiment) %>%
summarise(freq = n()) %>%
arrange(desc(freq))
## Joining, by = "word"
## # A tibble: 10 x 2
## sentiment freq
## <chr> <int>
## 1 positive 1214572
## 2 negative 899418
## 3 trust 799147
## 4 fear 619911
## 5 anticipation 569580
## 6 sadness 448993
## 7 joy 383489
## 8 anger 333921
## 9 surprise 261166
## 10 disgust 229475
covid19_tweets_tidy %>%
inner_join(get_sentiments("nrc")) %>%
group_by(sentiment) %>%
summarise(freq = n()) %>%
arrange(desc(freq))
## Joining, by = "word"
## # A tibble: 10 x 2
## sentiment freq
## <chr> <int>
## 1 positive 1214572
## 2 negative 899418
## 3 trust 799147
## 4 fear 619911
## 5 anticipation 569580
## 6 sadness 448993
## 7 joy 383489
## 8 anger 333921
## 9 surprise 261166
## 10 disgust 229475
library(ggplot2)
# Bar chart
covid19_tweets_tidy %>%
inner_join(lexicon_nrc()) %>%
count(sentiment, sort=TRUE) %>%
mutate(sentiment = reorder(sentiment, n)) %>%
ggplot(aes(x=sentiment, y=n)) +
labs(x="Emotion", y="Frequency", title="Bar Chart of Sentiment toward COVID-19") +
geom_bar(stat="identity", width=.5, fill="tomato3")
## Joining, by = "word"
# Pie chart
covid19_tweets_tidy %>%
inner_join(get_sentiments("nrc")) %>%
count(sentiment, sort=TRUE) %>%
mutate(sentiment = reorder(sentiment, n)) %>%
ggplot(aes(x="", y=n, fill=factor(sentiment))) +
geom_bar(width=1, stat="identity") +
labs(fill="sentiment", x=NULL, y=NULL, title="Pie Chart of Sentiment toward COVID-19") +
coord_polar(theta="y", start=0) +
theme_void()
## Joining, by = "word"
covid19_tweets_tidy %>%
inner_join(get_sentiments("nrc")) %>%
group_by(sentiment) %>%
count(word, sort=T) %>%
top_n(20) %>%
ggplot(aes(reorder(word, n), n, fill=sentiment)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~sentiment, scales="free_y", ncol=5) +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
## Joining, by = "word"
## Selecting by n
covid19_tweets_tidy %>%
inner_join(get_sentiments("bing")) %>%
group_by(sentiment) %>%
count(word, sort=T) %>%
top_n(20) %>%
ggplot(aes(reorder(word, n), n, fill=sentiment)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~sentiment, scales="free_y") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
## Joining, by = "word"
## Selecting by n
library(wordcloud)
## Loading required package: RColorBrewer
# Positive words
covid19_tweets_tidy %>%
inner_join(get_sentiments("bing")) %>% # Joining with the Bing dataset
filter(!word %in% c("trump", "like","positive","virus")) %>% # Removing irrelevant words to sentiment in this context
group_by(sentiment) %>%
count(word, sort=T) %>%
filter(sentiment=="positive") %>%
with(wordcloud(words = word, # The with( ) function applys an expression to a dataset.
freq = n,
max.words = 100, # Maximum numbers of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.2, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"
## Joining, by = "word"
covid19_tweets_tidy %>%
inner_join(get_sentiments("bing")) %>%
filter(!word %in% c("trump", "like","positive","virus")) %>%
group_by(sentiment) %>%
count(word, sort=T) %>%
filter(sentiment=="negative") %>%
with(wordcloud(words = word, # The with( ) function applys an expression to a dataset.
freq = n,
max.words = 100, # Maximum numbers of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.2, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2"))) # Retrieve 8 colors from the list of "Dark2"
## Joining, by = "word"