Today’s goal is to analyze sentiment lexicons across a few main examples. We hope to learn more about the nuances between using each one and see if there is any difference when using different corpora.
The sentiment analyses are divided into the example code from chapter 2 of the textbook “Text Mining with R” and mirroring and adding to the process using another text and sentiment lexicon. Toggle between each section using the respective tabs.
The base code from the textbook begins by showing what each tidytext lexicon provides. Generally they contain a word and a sentiment or sentiment value.
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.3
library(textdata)
## Warning: package 'textdata' was built under R version 4.4.3
get_sentiments('afinn')
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
get_sentiments('bing')
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments('nrc')
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
Get Jane Austen book texts and tidy them into a corpus to use sentiment analysis on. This analysis is on the book “Emma”.
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.4.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
Good, friend, and hope are the most common joy words found.
Compare the sentiments per line with the big sentiment lexicon.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
These sentiments can now be plotted.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
The plots depict the sentiments throughout each text’s story.
Take the words from only “Pride & Prejudice” and obtain the net sentiments based on afinn, bing, and nrc.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Plot the sentiments according to each sentiment lexicon.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
The sentiment trajectories are similar for each lexicon, but have individual nuances as well.
We can study each lexicon individually to understand why there may be differences.
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Bing skews more negative which leads to a lower sentiment usually.
Build a few more bar plots to see which words contribute the most to each sentiment.
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ℹ 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ℹ 1,140 more rows
Miss is the most common negative sentiment word by a lot. Positive sentiment words are a bit more diverse such as well, good, great. Realistically, miss likely refers to young, unmarried women, so we can use custom stop words to prevent it from impacting our results.
Build word clouds to see the most common words in another way.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.4.3
## Loading required package: RColorBrewer
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
## [1] "by jane austen"
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
A number of words look common such as lady, time, emma, and dear. For sentiment words, miss is notably common, but we already addressed how its meaning might be wrong. Otherwise, good and well appear to be the next most common words, while poor stands out from the negative sentiment lexicon.
Robinson, Julia Silge and David. 2 Sentiment Analysis with Tidy Data | Text Mining with R. Www.tidytextmining.com, www.tidytextmining.com/sentiment.html.
The corpus chosen is the text of “A Tale of Two Cities” by Charles Dickens. Using the gutenbergr package, download the text using gutenberg_id = 98. Modify the corpus in a similar way to the Jane Austen example. Since only 1 book is being used here, there is no need for the book column this time.
remotes::install_github('quanteda/quanteda.sentiment')
## Using GitHub PAT from the git credential store.
## Skipping install of 'quanteda.sentiment' from a github remote, the SHA1 (934c1e1f) has not changed since last install.
## Use `force = TRUE` to force installation
library(quanteda.sentiment)
## Loading required package: quanteda
## Warning: package 'quanteda' was built under R version 4.4.3
## Package version: 4.2.0
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda.sentiment'
## The following object is masked from 'package:quanteda':
##
## data_dictionary_LSD2015
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.4.3
# The value 98 was found on the project gutenberg site.
dickens_books <- gutenberg_download(98)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_dickens_text <- dickens_books %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Once again, we apply a technique from the earlier example. We do an inner join to see what matches the joy sentiments from nrc.
tidy_dickens_text %>% inner_join(nrc_joy) %>% count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 333 × 2
## word n
## <chr> <int>
## 1 good 217
## 2 child 89
## 3 hope 84
## 4 friend 76
## 5 daughter 62
## 6 found 61
## 7 love 56
## 8 saint 56
## 9 mother 43
## 10 true 42
## # ℹ 323 more rows
The most recurring nrc joy word is good with 217 occurrences, followed by child, hope, friend, and daughter in order. Interestingly, good, friend, and hope are also among the most common in the book “Emma”.
We will use the bing sentiments to compare the positive and negative sentiments throughout the text.
dickens_sentiment <- tidy_dickens_text %>%
inner_join(get_sentiments("bing")) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
Repeat the process with a sentiment lexicon known as HuLiu found in the Quanteda package.
quant_sent_hu <- rep('positive', each = length(data_dictionary_HuLiu$positive))
quant_sent_hu <- append(quant_sent_hu, rep('negative', each = length(data_dictionary_HuLiu$negative)))
quant_df <- data.frame(word = unlist(data_dictionary_HuLiu, use.names = FALSE), sentiment = quant_sent_hu)
dickens_quanteda_sentiment <- tidy_dickens_text %>%
inner_join(quant_df) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
The HuLiu sentiment lexicon needed some tidying to be able to be used in the same functions. I collected the words into a column and associated the appropriate sentiment with them in a dataframe.
Next, check the sentiment plots to see how they compare.
ggplot(dickens_sentiment, aes(index, sentiment)) +
geom_col(show.legend = FALSE)
ggplot(dickens_quanteda_sentiment, aes(index, sentiment)) +
geom_col(show.legend = FALSE)
Oddly, the sentiment plots are identical for Bing and HuLiu. According to this article, https://medium.com/@laurenflynn1211/comparing-sentiment-analysis-dictionaries-in-r-c695fca64326, this surprising result is to be expected. They are nearly identical sentiment lexicons.
Compare all four lexicons.
afinn_dickens <- tidy_dickens_text %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc_dickens <- bind_rows(
tidy_dickens_text %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_dickens_text %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 11 of `x` matches multiple rows in `y`.
## ℹ Row 2003 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
dickens_quanteda_sentiment <- dickens_quanteda_sentiment %>%
mutate(method = "HuLiu")
bind_rows(afinn_dickens,
bing_and_nrc_dickens,
dickens_quanteda_sentiment) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
NRC skews much more positive than the other lexicons in the earlier story. We established the reason for this in the textbook example. This is because of the amount of positive and negative words skews more negative for Bing and HuLiu.
bing_word_counts <- tidy_dickens_text %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
bing_word_counts
## # A tibble: 1,870 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 233
## 2 good positive 217
## 3 like positive 214
## 4 well positive 179
## 5 great positive 161
## 6 prisoner negative 115
## 7 better positive 90
## 8 dark negative 89
## 9 work positive 88
## 10 poor negative 87
## # ℹ 1,860 more rows
huliu_word_counts <- tidy_dickens_text %>%
inner_join(quant_df) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
huliu_word_counts
## # A tibble: 1,870 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 233
## 2 good positive 217
## 3 like positive 214
## 4 well positive 179
## 5 great positive 161
## 6 prisoner negative 115
## 7 better positive 90
## 8 dark negative 89
## 9 work positive 88
## 10 poor negative 87
## # ℹ 1,860 more rows
Once again, miss is very common and likely not meant negatively in Dickens’s works. Additionally, the HuLiu results are again identical to Bing.
Plot the word contributions for each lexicon.
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
huliu_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
As there is still no difference between our lexicons, it is more interesting to note that miss is still the most common negative word, but is not as extreme as with Jane Austen’s works. Prisoner appears fairly often. Good instead of well is the most common positive word.
Let’s check the word cloud as before.
tidy_dickens_text %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
## Warning in wordcloud(word, n, max.words = 100): defarge could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): cruncher could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): time could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): doctor could not be fit on
## page. It will not be plotted.
There is a different assortment of words than the example. Miss is still
very common for the same reasons. Otherwise, we have madame, day, miss,
night, doctor, and time as standout words. Some common words such as
carton and darnay are understandably common, unique words to this text
as they were some of the main characters of “A Tale of Two Cities”.
Let’s redo the example word cloud while tagging positive and negative words.
set.seed(123)
tidy_dickens_text %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining with `by = join_by(word)`
set.seed(123)
tidy_dickens_text %>%
inner_join(quant_df) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining with `by = join_by(word)`
The two word clouds were separated out with seeds just to prove that they were giving identical results. Word clouds have a degree of randomness to them which can make the results look different when they actually are not.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_dickens_text %>%
group_by(chapter) %>%
summarize(words = n())
tidy_dickens_text %>%
semi_join(bingnegative) %>%
group_by(chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining with `by = join_by(word)`
## # A tibble: 1 × 4
## chapter negativewords words ratio
## <int> <int> <int> <dbl>
## 1 44 231 4665 0.0495
huliunegative <- quant_df %>%
filter(sentiment == "negative")
wordcounts <- tidy_dickens_text %>%
group_by(chapter) %>%
summarize(words = n())
tidy_dickens_text %>%
semi_join(huliunegative) %>%
group_by(chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining with `by = join_by(word)`
## # A tibble: 1 × 4
## chapter negativewords words ratio
## <int> <int> <int> <dbl>
## 1 44 231 4665 0.0495
Chapter 44 appears to be the saddest in “A Tale of Two Cities” according to both lexicons.
Let’s conduct one final test to see what was going on with the sentiment lexicons.
length(quant_df$sentiment)
## [1] 6789
length(get_sentiments('bing')$word)
## [1] 6786
HuLiu is 6789 words long and Bing is 6786. There apparently are differences, but very minute ones.
When picking out a new corpus and an extra sentiment lexicon, I would have expected to see more differences in results than the textbook example. Selecting another prominent author from a little later than Jane Austen did not change the frequency of the word miss. Additionally, I did not realize that HuLiu was so similar to Bing until performing my analysis.
Sentiment analysis is a powerful tool to help apply feelings from words to analyze texts. They are very heavily impacted by the choices of words used to measure each text. I believe that properly matching sentiment lexicons to the right era and type of work would contribute to better analyses of the texts.