We’ve reproduced the example code for sentiment analysis from chapter two of Text Mining With R. Using the sotu library (filled with state of the union speeches up until 2020), we break up the text and perform a similar sentiment analysis, with appropriate adjustments for shorter texts. At the end, we compare different word cloud packages.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Note for Running this Code Successfully
Getting this document to render correctly requires first running tidytext::get_sentiments(“nrc”) and tidytext::get_sentiments(“afinn”) in the console. You must accept license/citation agreement before rendering (press 1 to agree.). Otherwise, execution will halt.
#run these lines of code in your console before rendering the rest of the document:tidytext::get_sentiments("nrc")
# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# ℹ 13,862 more rows
# A tibble: 122,204 × 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Pride & Prejudice 1 0 pride
2 Pride & Prejudice 1 0 and
3 Pride & Prejudice 1 0 prejudice
4 Pride & Prejudice 3 0 by
5 Pride & Prejudice 3 0 jane
6 Pride & Prejudice 3 0 austen
7 Pride & Prejudice 7 1 chapter
8 Pride & Prejudice 7 1 1
9 Pride & Prejudice 10 1 it
10 Pride & Prejudice 10 1 is
# ℹ 122,194 more rows
`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.
tidy_books %>%semi_join(bingnegative, by =join_by(word)) %>%group_by(book, chapter) %>%summarize(negativewords =n()) %>%left_join(wordcounts, by =c("book", "chapter")) %>%mutate(ratio = negativewords/words) %>%filter(chapter !=0) %>%slice_max(ratio, n =1) %>%ungroup()
`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
book chapter negativewords words ratio
<fct> <int> <int> <int> <dbl>
1 Sense & Sensibility 43 161 3405 0.0473
2 Pride & Prejudice 34 111 2104 0.0528
3 Mansfield Park 46 173 3685 0.0469
4 Emma 15 151 3340 0.0452
5 Northanger Abbey 21 149 2982 0.0500
6 Persuasion 4 62 1807 0.0343
Sentiment Analysis With the sotu Package
The sotu package contains state of the union addresses.
library(sotu)data(sotu_text)#make a tidy data framesotu_df <-tibble(president = sotu_meta$president,year = sotu_meta$year,text = sotu_text)#unnesttidy_sotu <- sotu_df %>%unnest_tokens(word, text)
Choosing three sets of presidential speeches
There are hundreds of state of the union addresses. For this exercise, we’ll look at three presidents: Abraham Lincoln, Franklin D. Roosevelt, and Barack Obama.
Instead of grouping by chapters, we’ll group by year, and show overall sentiments together, so we can see how it changed over a presidency.
obama_sentiment <- tidy_obama %>%inner_join(get_sentiments("bing"), by =join_by(word)) %>%count(year, index = linenumber, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)ggplot(obama_sentiment, aes(x = year, y = sentiment, fill = year)) +geom_col(show.legend =FALSE) +scale_x_continuous(breaks =seq(min(obama_sentiment$year), max(obama_sentiment$year), by =1)) +labs (title ="Obama's net sentiment over time")
2009 and 2010 speeches were not very positive, possibly due to the global economic crisis. After the midterms (which the democrats lost big time), he became more positive, and then was more negative in the latter half of his second term.
for FDR:
FDR had more terms, and therefore more states of the union, than any other president.
fdr_sentiment <- tidy_roosevelt %>%inner_join(get_sentiments("bing"), by =join_by(word)) %>%count(year, index = linenumber, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)ggplot(fdr_sentiment, aes(x = year, y = sentiment, fill = year)) +geom_col(show.legend =FALSE) +scale_x_continuous(breaks =seq(min(fdr_sentiment$year), max(fdr_sentiment$year), by =1)) +labs (title ="FDR's net sentiment over time")
There are net negatives in 1938 and 1942, and a big positivity spike in 1945. Unfortunately, we have no way of telling what was happening.
Lincoln:
lincoln_sentiment <- tidy_lincoln %>%inner_join(get_sentiments("bing"), by =join_by(word)) %>%count(year, index = linenumber, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)ggplot(lincoln_sentiment, aes(x = year, y = sentiment, fill = year)) +geom_col(show.legend =FALSE) +scale_x_continuous(breaks =seq(min(lincoln_sentiment$year), max(lincoln_sentiment$year), by =1)) +labs (title ="Lincoln's net sentiment over time")
Lincoln’s net sentiment shows as generally very positive, with a small dip in 1863.
Joy Words
What “joy” words did Obama use most?
tidy_obama %>%inner_join(nrc_joy, by =join_by(word)) %>%count(word, sort =TRUE)
# A tibble: 170 × 2
word n
<chr> <int>
1 good 54
2 laughter 48
3 pay 41
4 clean 37
5 money 37
6 create 35
7 vote 34
8 finally 32
9 progress 29
10 save 29
# ℹ 160 more rows
Apparently “pay,” “clean,” and “vote” are joy words. The other words seem like they belong, but 48 instances of the word “laughter” is notable.
Creating a Comparison for FDR’s Speeches
#FDR and AFINNafinn_fdr <- tidy_roosevelt %>%inner_join(get_sentiments("afinn"), by =join_by(word)) %>%group_by(year) %>%summarise(sentiment =sum(value)) %>%mutate(method ="AFINN")bing_and_nrc_fdr <-bind_rows( tidy_roosevelt %>%inner_join(get_sentiments("bing"), by =join_by(word), relationship ="many-to-many") %>%mutate(method ="Bing et al."), tidy_roosevelt %>%inner_join(get_sentiments("nrc"),by =join_by(word), relationship ="many-to-many",filter(sentiment %in%c("positive", "negative")) ) %>%mutate(method ="NRC")) %>%count(method, year, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)fdr_compiled <-bind_rows(afinn_fdr, bing_and_nrc_fdr)fdr_compiled |>ggplot(aes(year, sentiment, fill = year)) +geom_col(show.legend =FALSE) +facet_wrap(~method, ncol =1, scales ="free_y") +scale_x_continuous(breaks =seq(min(fdr_compiled$year), max(fdr_compiled$year), by =1)) +labs(title ="Comparing Afinn, Bing, and NRC using FDR's SOTUs")
Interestingly, between Bing and AFINN, the net positive years look somewhat similar. However, the negative years look a lot more negative with AFINN. NRC registers every speech as net positive.
Most Common Positive and Negative Words
Bing (we’ll do it for all three sets of speeches here):
# A tibble: 617 × 3
word sentiment n
<chr> <chr> <int>
1 great positive 61
2 well positive 29
3 free positive 27
4 debt negative 23
5 good positive 19
6 proper positive 18
7 slave negative 17
8 important positive 16
9 peace positive 15
10 best positive 14
# ℹ 607 more rows
Graphing Positive and Negative Word Counts
#FDRbing_word_counts_fdr %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment (FDR)",y =NULL)
#Obamabing_word_counts_obama %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment (Obama)",y =NULL)
#Lincolnbing_word_counts_lincoln %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment (Lincoln)",y =NULL)
“Great” shows up a lot in FDR’s positive words. Was he talking about the Great Depression or the Great War?
A quick search revealed that people did call it the “Great Depression” while it was happening, but FDR did not use the term in his speeches. In 1939, he used the term “great unemployment of capital” to refer to the national debt, which was apparently not bad.
Presidential Word Clouds
#specifying the wordcloud package for thesetidy_obama %>%anti_join(stop_words, by =join_by(word)) %>%count(word) %>%with(wordcloud::wordcloud(word, n, max.words =100))
tidy_roosevelt %>%anti_join(stop_words, by =join_by(word)) %>%count(word) %>%with(wordcloud::wordcloud(word, n, max.words =100))
tidy_lincoln %>%anti_join(stop_words, by =join_by(word)) %>%count(word) %>%with(wordcloud::wordcloud(word, n, max.words =100))
[1] "vice president, members of congress, the first lady of the united states--she's around here somewhere: i have come here tonight not only to address the distinguished men and women in this great chamber, but to speak frankly and directly to the men and women who sent us here."
This is a sentence fragment because the first sentence ends with “mr.”
#the most negative yeartidy_obama %>%semi_join(bingnegative, by =join_by(word)) %>%group_by(year) %>%summarize(negativewords =n()) %>%left_join(obama_speech_length, by ="year") %>%mutate(ratio = negativewords/word_count) %>%slice_max(ratio, n =1) %>%ungroup()
# A tibble: 1 × 4
year negativewords word_count ratio
<int> <int> <int> <dbl>
1 2010 222 7263 0.0306
For FDR:
tidy_roosevelt %>%semi_join(bingnegative, by =join_by(word)) %>%group_by(year) %>%summarize(negativewords =n()) %>%left_join(fdr_speech_length, by ="year") %>%mutate(ratio = negativewords/word_count) %>%ungroup()
#the most negative yeartidy_roosevelt %>%semi_join(bingnegative, by =join_by(word)) %>%group_by(year) %>%summarize(negativewords =n()) %>%left_join(fdr_speech_length, by ="year") %>%mutate(ratio = negativewords/word_count) %>%slice_max(ratio, n =1) %>%ungroup()
# A tibble: 1 × 4
year negativewords word_count ratio
<int> <int> <int> <dbl>
1 1938 159 4716 0.0337
tidy_lincoln %>%semi_join(bingnegative, by =join_by(word)) %>%group_by(year) %>%summarize(negativewords =n()) %>%left_join(lincoln_speech_length, by ="year") %>%mutate(ratio = negativewords/word_count) %>%ungroup()
#the most negative yeartidy_lincoln %>%semi_join(bingnegative, by =join_by(word)) %>%group_by(year) %>%summarize(negativewords =n()) %>%left_join(lincoln_speech_length, by ="year") %>%mutate(ratio = negativewords/word_count) %>%slice_max(ratio, n =1) %>%ungroup()
# A tibble: 1 × 4
year negativewords word_count ratio
<int> <int> <int> <dbl>
1 1861 169 6998 0.0241
Lincoln was the least negative president. (Read: Lincoln’s speeches hat the fewest ratio of negative words, as per the Bing scale).
Finding a Better Word Cloud
let’s try Wordcloud2:
library(wordcloud2)
Warning: package 'wordcloud2' was built under R version 4.5.3
ggplot includes a word cloud function. You can add color and tilt the words. If you don’t limit the number of words displayed, they’ll just stack on top of each other. It also looks kind of sparse.
Overall, wordcloud2 is the most aesthetically pleasing and T-shirt ready, but it can be overwhelming if you’re just trying to understand the tone. PubMed seems like a more organized and straightforward way to get an idea of the quantitative data you’re working with. ggword cloud looked a lot better in the demo cited, but it’s having problems rendering, as is modelwordcloud.
It’s also interesting to see that the tone and language of speeches from another time represented the zeitgeist.
Sources
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.
http://saifmohammad.com/WebDocs/Lexicons/NRC-Emotion-Lexicon.zip’ Content type ‘application/zip’ length 25878449 bytes (24.7 MB)
Geissman, Martin. (2022).”Comparison of Word Cloud R Packages”. https://rpubs.com/mgei/1259234