We’ve reproduced the example code for sentiment analysis from chapter two of Text Mining With R. Using the sotu library (filled with state of the union speeches up until 2020), we will break up the text and perform a similar sentiment analysis.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janeaustenr)
Warning: package 'janeaustenr' was built under R version 4.5.3
library(tidytext)
Warning: package 'tidytext' was built under R version 4.5.3
library(textdata)
Warning: package 'textdata' was built under R version 4.5.3
library(wordcloud)
Warning: package 'wordcloud' was built under R version 4.5.3
Loading required package: RColorBrewer
library(tidyr)library(reshape2)
Warning: package 'reshape2' was built under R version 4.5.2
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 122,204 × 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Pride & Prejudice 1 0 pride
2 Pride & Prejudice 1 0 and
3 Pride & Prejudice 1 0 prejudice
4 Pride & Prejudice 3 0 by
5 Pride & Prejudice 3 0 jane
6 Pride & Prejudice 3 0 austen
7 Pride & Prejudice 7 1 chapter
8 Pride & Prejudice 7 1 1
9 Pride & Prejudice 10 1 it
10 Pride & Prejudice 10 1 is
# ℹ 122,194 more rows
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bing_word_counts
# A tibble: 2,585 × 3
word sentiment n
<chr> <chr> <int>
1 miss negative 1855
2 well positive 1523
3 good positive 1380
4 great positive 981
5 like positive 725
6 better positive 639
7 enough positive 613
8 happy positive 534
9 love positive 495
10 pleasure positive 462
# ℹ 2,575 more rows
bing_word_counts %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment",y =NULL)
tidy_books %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud(word, n, max.words =100))
Joining with `by = join_by(word)`
I skipped the part where he adds the word “miss” to stop words.
tidy_books %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud(word, n, max.words =100))
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.
tidy_books %>%semi_join(bingnegative) %>%group_by(book, chapter) %>%summarize(negativewords =n()) %>%left_join(wordcounts, by =c("book", "chapter")) %>%mutate(ratio = negativewords/words) %>%filter(chapter !=0) %>%slice_max(ratio, n =1) %>%ungroup()
Joining with `by = join_by(word)`
`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
book chapter negativewords words ratio
<fct> <int> <int> <int> <dbl>
1 Sense & Sensibility 43 161 3405 0.0473
2 Pride & Prejudice 34 111 2104 0.0528
3 Mansfield Park 46 173 3685 0.0469
4 Emma 15 151 3340 0.0452
5 Northanger Abbey 21 149 2982 0.0500
6 Persuasion 4 62 1807 0.0343
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media.
Sentiment Analysis With the sotu Package
The sotu package contains state of the union addresses.
library(sotu)
Warning: package 'sotu' was built under R version 4.5.2
data(sotu_text)#make a tidy data framesotu_df <-tibble(president = sotu_meta$president,year = sotu_meta$year,text = sotu_text)#unnesttidy_sotu <- sotu_df %>%unnest_tokens(word, text)
Choosing three sets of presidential speeches
There are hundreds of state of the union addresses. For this exercise, we’ll look at three presidents: Abraham Lincoln, Franklin D. Roosevelt, and Barack Obama.
ggplot(obama_sentiment, aes(x = year, y = sentiment, fill = year)) +geom_col(show.legend =FALSE) +scale_x_continuous(breaks =seq(min(obama_sentiment$year), max(obama_sentiment$year), by =1)) +labs (title ="Obama's net sentiment over time")
2009 and 2010 speeches were not very positive, possibly due to the global economic crisis. After the midterms (which the democrats lost big time), he became more positive, and then was more negative in the latter half of his second term.
for FDR:
FDR had more terms, and therefore more states of the union, than any other president.
ggplot(fdr_sentiment, aes(x = year, y = sentiment, fill = year)) +geom_col(show.legend =FALSE) +scale_x_continuous(breaks =seq(min(fdr_sentiment$year), max(fdr_sentiment$year), by =1)) +labs (title ="FDR's net sentiment over time")
There are net negatives in 1938 and 1942, and a big positivity spike in 1945. Unfortunately, we have no way of telling what was happening.
ggplot(lincoln_sentiment, aes(x = year, y = sentiment, fill = year)) +geom_col(show.legend =FALSE) +scale_x_continuous(breaks =seq(min(lincoln_sentiment$year), max(lincoln_sentiment$year), by =1)) +labs (title ="Lincoln's net sentiment over time")
Lincoln’s net sentiment shows as generally very positive, with a small dip in 1863.
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1063 of `x` matches multiple rows in `y`.
ℹ Row 4872 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
fdr_compiled <-bind_rows(afinn_fdr, bing_and_nrc_fdr)fdr_compiled |>ggplot(aes(year, sentiment, fill = year)) +geom_col(show.legend =FALSE) +facet_wrap(~method, ncol =1, scales ="free_y") +scale_x_continuous(breaks =seq(min(fdr_compiled$year), max(fdr_compiled$year), by =1)) +labs(title ="Comparing Afinn, Bing, and NRC using FDR's SOTUs")
Interestingly, between Bing and AFINN, the net positive years look somewhat similar. However, the negative years look a lot more negative with AFINN. NRC registers everything as positive. There’s a lot of variation in the way 1944 is scored.
Most Common Positive and Negative Words
Bing (we’ll do it for all three sets of speeches here):
# A tibble: 617 × 3
word sentiment n
<chr> <chr> <int>
1 great positive 61
2 well positive 29
3 free positive 27
4 debt negative 23
5 good positive 19
6 proper positive 18
7 slave negative 17
8 important positive 16
9 peace positive 15
10 best positive 14
# ℹ 607 more rows
Graphing Positive and Negative Word Counts
#FDRbing_word_counts_fdr %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment (FDR)",y =NULL)
#Obamabing_word_counts_obama %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment (Obama)",y =NULL)
#Lincolnbing_word_counts_lincoln %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment (Lincoln)",y =NULL)
“Great” shows up a lot in FDR’s positive words. Was he talking about the Great Depression or the Great War?
A quick search revealed that people did call it the “Great Depression” while it was happening, but FDR did not use the term in his speeches. In 1939, he used the term “great unemployment of capital” to refer to the national debt, which was apparently not bad.
Presidential Word Clouds
#specifying the wordcloud package for thesetidy_obama %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud::wordcloud(word, n, max.words =100))
Joining with `by = join_by(word)`
tidy_roosevelt %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud::wordcloud(word, n, max.words =100))
Joining with `by = join_by(word)`
tidy_lincoln %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud::wordcloud(word, n, max.words =100))
[1] "vice president, members of congress, the first lady of the united states--she's around here somewhere: i have come here tonight not only to address the distinguished men and women in this great chamber, but to speak frankly and directly to the men and women who sent us here."
This is a sentence fragment because the first sentence ends with “mr.”
ggplot includes a word cloud function. You can add color and tilt the words. If you don’t limit the number of words displayed, they’ll just stack on top of each other. It also looks kind of sparse.
Overall, wordcloud2 is the most aesthetically pleasing and T-shirt ready, but it can be overwhelming if you’re just trying to understand the tone. PubMed seems like a more organized and straightforward way to get an idea of the quantitative data you’re working with. ggword cloud looked a lot better in the demo cited, but it’s having problems rendering, as is modelwordcloud.
It’s also interesting to see that the tone and language of speeches from another time represented the zeitgeist.