This document replicates and extends the sentiment analysis example from Chapter 2 of “Text Mining with R: A Tidy Approach” (Silge & Robinson, 2017). The original chapter demonstrates how to tokenize text, apply sentiment lexicons (Bing, AFINN, NRC), and visualize emotional arcs using Jane Austen’s novels.
The task here is twofold:
Reproduce the core Austen-based analysis exactly as presented in the book.
Extend the methodology by changing the corpus to a different genre: U.S. State of the Union (SOTU) addresses (1960–2020), and incorporating an additional lexicon — the Loughran-McDonald dictionary.
Setup and Prepare Jane Austen Text
We begin by loading required libraries and preparing the complete works of Jane Austen as a tidy data structure.
library(tidytext)
Warning: package 'tidytext' was built under R version 4.5.3
library(janeaustenr)
Warning: package 'janeaustenr' was built under R version 4.5.3
# A tibble: 6 × 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Sense & Sensibility 1 0 sense
2 Sense & Sensibility 1 0 and
3 Sense & Sensibility 1 0 sensibility
4 Sense & Sensibility 3 0 by
5 Sense & Sensibility 3 0 jane
6 Sense & Sensibility 3 0 austen
Sentiment analysis with NRC Lexicon - Joy words
# Get NRC "joy" wordsnrc_joy <-get_sentiments("nrc") |>filter(sentiment =="joy")# Filter Emma and join with joy wordstidy_books |>filter(book =="Emma") |>inner_join(nrc_joy) |>count(word, sort =TRUE) |>slice_head(n =15)
Joining with `by = join_by(word)`
# A tibble: 15 × 2
word n
<chr> <int>
1 good 359
2 friend 166
3 hope 143
4 happy 125
5 love 117
6 deal 92
7 found 92
8 present 89
9 kind 82
10 happiness 76
11 pretty 68
12 true 66
13 comfort 65
14 spirits 64
15 marry 63
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +geom_col(show.legend =FALSE) +facet_wrap(~book, ncol =2, scales ="free_x") +labs(title ="Sentiment Through Jane Austen's Novels",subtitle ="Bing lexicon, net sentiment per 80-line section",x ="Narrative position (80-line index)",y ="Net sentiment (positive − negative)",caption ="Reproduced from Silge & Robinson (2017), Chapter 2" ) +theme_minimal(base_size =12)
# Bing and NRC: binary positive/negative countbing_and_nrc <-bind_rows( pride_prejudice %>%inner_join(get_sentiments("bing")) %>%mutate(method ="Bing et al."), pride_prejudice %>%inner_join(get_sentiments("nrc") %>%filter(sentiment %in%c("positive", "negative"))) %>%mutate(method ="NRC")) %>%count(method, index = linenumber %/%80, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bind_rows(afinn, bing_and_nrc) %>%ggplot(aes(index, sentiment, fill = method)) +geom_col(show.legend =FALSE) +facet_wrap(~method, ncol =1) +labs(title ="Sentiment in Pride & Prejudice — Three Lexicons Compared",subtitle ="AFINN (numeric sum), Bing and NRC (positive minus negative count)",x ="Narrative position (80-line index)",y ="Net sentiment",caption ="Reproduced from Silge & Robinson (2017), Chapter 2" ) +theme_minimal(base_size =12)
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bing_word_counts %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(title ="Most Common Positive and Negative Words in Austen",subtitle ="Bing lexicon",x ="Frequency",y =NULL,caption ="Reproduced from Silge & Robinson (2017), Chapter 2" ) +theme_minimal(base_size =12)
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 1 × 5
book chapter negativewords words ratio
<fct> <int> <int> <int> <dbl>
1 Pride & Prejudice 34 111 2104 0.0528
The table above reproduces the book’s result: the most negative chapter by proportion in each Austen novel.
Part2
The extended analysis makes two changes:
Different corpus — State of the Union (SOTU) speeches from U.S. presidents, accessed via the sotu package. This is formal political discourse spanning over 200 years — very different in register from 19th-century fiction.
Additional lexicon — The Loughran-McDonald bundled in tidytext.
ggplot(sotu_bing, aes(year, net_sentiment)) +geom_col(aes(fill = net_sentiment >0), show.legend =FALSE) +geom_smooth(method ="loess", se =TRUE, colour ="black", linewidth =0.8) +scale_fill_manual(values =c("#d73027", "#4575b4")) +labs(title ="Net Sentiment in State of the Union Speeches (1960–2020)",x ="Year",y ="Net sentiment (positive − negative words)",caption ="Data: sotu package; lexicon: Bing et al." ) +theme_minimal(base_size =12)
`geom_smooth()` using formula = 'y ~ x'
Lexicon - Loughran_mcDonald has six categories: negative, positive, litigious, uncertainty, constraining, and superfluous. Unlike Bing or NRC, it was built from SEC financial filings, making it sensitive to the formal register of institutional writing like presidential speeches.
Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 596 of `x` matches multiple rows in `y`.
ℹ Row 2450 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 596 of `x` matches multiple rows in `y`.
ℹ Row 2450 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bind_rows(bing_trace, loughran_trace) %>%ggplot(aes(year, net_sentiment, colour = lexicon, fill = lexicon)) +geom_col(position ="dodge", alpha =0.7, show.legend =FALSE) +geom_smooth(method ="loess", se =FALSE, linewidth =1) +facet_wrap(~lexicon, ncol =1, scales ="free_y") +labs(title ="Bing vs. Loughran-McDonald: Net Sentiment in SOTU (1960–2020)",x ="Year",y ="Net sentiment (positive − negative)",caption ="Data: sotu package" ) +theme_minimal(base_size =12)
`geom_smooth()` using formula = 'y ~ x'
Conclusion
The Jane Austen corpus is dense with emotional language — words like miss, love, good, happy, poor, and dear dominate sentiment counts. The arc of each novel shows clear narrative tension and resolution, making sentiment analysis very legible.
State of the Union speeches are structurally different: they are formal, policy-focused documents designed to inform and persuade rather than evoke emotion. Sentiment-bearing words are a smaller fraction of the total vocabulary, and the signal is noisier.
The Bing lexicon was built from product reviews and social media. Applied to SOTU speeches it picks up everyday evaluative words. The Loughran-McDonald lexicon was designed for financial and legal documents using SEC filings. This underscores the key lesson from Chapter 2: lexicon choice matters, and the right choice depends on matching the vocabulary to the domain of the text.