10A

Author

XiaoFei Mei

Introduction

This document replicates and extends the sentiment analysis example from Chapter 2 of “Text Mining with R: A Tidy Approach” (Silge & Robinson, 2017). The original chapter demonstrates how to tokenize text, apply sentiment lexicons (Bing, AFINN, NRC), and visualize emotional arcs using Jane Austen’s novels.

The task here is twofold:

  1. Reproduce the core Austen-based analysis exactly as presented in the book.

  2. Extend the methodology by changing the corpus to a different genre: U.S. State of the Union (SOTU) addresses (1960–2020), and incorporating an additional lexicon — the Loughran-McDonald dictionary.

Setup and Prepare Jane Austen Text

We begin by loading required libraries and preparing the complete works of Jane Austen as a tidy data structure.

library(tidytext)
Warning: package 'tidytext' was built under R version 4.5.3
library(janeaustenr)
Warning: package 'janeaustenr' was built under R version 4.5.3
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(wordcloud)
Warning: package 'wordcloud' was built under R version 4.5.3
library(reshape2)

#from Source: Silge & Robinson (2017), Chapter 2
tidy_books <- austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",ignore_case = TRUE)))
) |>
  ungroup() |>
  unnest_tokens(word, text)

head(tidy_books)
# A tibble: 6 × 4
  book                linenumber chapter word       
  <fct>                    <int>   <int> <chr>      
1 Sense & Sensibility          1       0 sense      
2 Sense & Sensibility          1       0 and        
3 Sense & Sensibility          1       0 sensibility
4 Sense & Sensibility          3       0 by         
5 Sense & Sensibility          3       0 jane       
6 Sense & Sensibility          3       0 austen     

Sentiment analysis with NRC Lexicon - Joy words

# Get NRC "joy" words
nrc_joy <- get_sentiments("nrc") |>
  filter(sentiment == "joy")

# Filter Emma and join with joy words
tidy_books |>
  filter(book == "Emma") |>
  inner_join(nrc_joy) |>
  count(word, sort = TRUE) |>
  slice_head(n = 15)
Joining with `by = join_by(word)`
# A tibble: 15 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
11 pretty       68
12 true         66
13 comfort      65
14 spirits      64
15 marry        63

Sentiment over time - being Lexicon

# Source: Silge & Robinson (2017), Chapter 2
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x") +
  labs(
    title    = "Sentiment Through Jane Austen's Novels",
    subtitle = "Bing lexicon, net sentiment per 80-line section",
    x        = "Narrative position (80-line index)",
    y        = "Net sentiment (positive − negative)",
    caption  = "Reproduced from Silge & Robinson (2017), Chapter 2"
  ) +
  theme_minimal(base_size = 12)

Compare the three lexicons:

# Source: Silge & Robinson (2017), Chapter 2
pride_prejudice <- tidy_books %>%
  filter(book == "Pride & Prejudice")

afinn <- pride_prejudice %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "AFINN")
Joining with `by = join_by(word)`
# Bing and NRC: binary positive/negative count
bing_and_nrc <- bind_rows(
  pride_prejudice %>%
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>%
    inner_join(get_sentiments("nrc") %>%
                 filter(sentiment %in% c("positive", "negative"))) %>%
    mutate(method = "NRC")
) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bind_rows(afinn, bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1) +
  labs(
    title    = "Sentiment in Pride & Prejudice — Three Lexicons Compared",
    subtitle = "AFINN (numeric sum), Bing and NRC (positive minus negative count)",
    x        = "Narrative position (80-line index)",
    y        = "Net sentiment",
    caption  = "Reproduced from Silge & Robinson (2017), Chapter 2"
  ) +
  theme_minimal(base_size = 12)

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    title    = "Most Common Positive and Negative Words in Austen",
    subtitle = "Bing lexicon",
    x        = "Frequency",
    y        = NULL,
    caption  = "Reproduced from Silge & Robinson (2017), Chapter 2"
  ) +
  theme_minimal(base_size = 12)

# Source: Silge & Robinson (2017), Chapter 2
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

# Source: Silge & Robinson (2017), Chapter 2
p_and_p_sentences <- tibble(text = prideprejudice) %>%
  unnest_tokens(sentence, text, token = "sentences")

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  mutate(chapter = cumsum(str_detect(text,
             regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unnest_tokens(word, text)

# Proportion of negative words per chapter
bingnegative <- get_sentiments("bing") %>%
  filter(sentiment == "negative")

wordcounts <- austen_chapters %>%
  group_by(book, chapter) %>%
  summarise(words = n(), .groups = "drop")

austen_chapters %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarise(negativewords = n(), .groups = "drop") %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords / words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>%
  ungroup()
Joining with `by = join_by(word)`
# A tibble: 1 × 5
  book              chapter negativewords words  ratio
  <fct>               <int>         <int> <int>  <dbl>
1 Pride & Prejudice      34           111  2104 0.0528

The table above reproduces the book’s result: the most negative chapter by proportion in each Austen novel.

Part2

The extended analysis makes two changes:

  1. Different corpus — State of the Union (SOTU) speeches from U.S. presidents, accessed via the sotu package. This is formal political discourse spanning over 200 years — very different in register from 19th-century fiction.

  2. Additional lexicon — The Loughran-McDonald bundled in tidytext.

Load and set up and tokenize SOTU:

library(sotu)

sotu_df <- bind_cols(sotu_meta, tibble(text = sotu_text))

# from 1960–2020
sotu_modern <- sotu_df %>%
  filter(year >= 1960, year <= 2020) %>%
  select(president, year, text)

tidy_sotu <- sotu_modern %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
Joining with `by = join_by(word)`
cat("Total words (after stopword removal):", nrow(tidy_sotu), "\n")
Total words (after stopword removal): 199189 

Check sentiment over time:

sotu_bing <- tidy_sotu %>%
  inner_join(get_sentiments("bing")) %>%
  count(year, president, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative,
         pct_positive  = positive / (positive + negative))
Joining with `by = join_by(word)`
ggplot(sotu_bing, aes(year, net_sentiment)) +
  geom_col(aes(fill = net_sentiment > 0), show.legend = FALSE) +
  geom_smooth(method = "loess", se = TRUE, colour = "black", linewidth = 0.8) +
  scale_fill_manual(values = c("#d73027", "#4575b4")) +
  labs(
    title    = "Net Sentiment in State of the Union Speeches (1960–2020)",
    x        = "Year",
    y        = "Net sentiment (positive − negative words)",
    caption  = "Data: sotu package; lexicon: Bing et al."
  ) +
  theme_minimal(base_size = 12)
`geom_smooth()` using formula = 'y ~ x'

Lexicon - Loughran_mcDonald has six categories: negative, positive, litigious, uncertainty, constraining, and superfluous. Unlike Bing or NRC, it was built from SEC financial filings, making it sensitive to the formal register of institutional writing like presidential speeches.

get_sentiments("loughran") %>%
  count(sentiment, sort = TRUE)
# A tibble: 6 × 2
  sentiment        n
  <chr>        <int>
1 negative      2355
2 litigious      904
3 positive       354
4 uncertainty    297
5 constraining   184
6 superfluous     56
sotu_loughran <- tidy_sotu %>%
  inner_join(get_sentiments("loughran")) %>%
  count(year, sentiment)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 596 of `x` matches multiple rows in `y`.
ℹ Row 2450 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
target_cats <- c("positive", "negative", "uncertainty", "litigious")

sotu_loughran %>%
  filter(sentiment %in% target_cats) %>%
  ggplot(aes(year, n, colour = sentiment)) +
  geom_line(linewidth = 0.9) +
  geom_smooth(method = "loess", se = FALSE, linewidth = 0.4, linetype = "dashed") +
  facet_wrap(~sentiment, scales = "free_y", ncol = 2) +
  labs(
    title    = "SOTU Speeches — Loughran-McDonald Lexicon (1960–2020)",
    x        = "Year",
    y        = "Word count",
    colour   = "Category",
    caption  = "Data: sotu package; lexicon: Loughran & McDonald (2011)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")
`geom_smooth()` using formula = 'y ~ x'

Compare bing vs. Loughran on SOTU:

bing_trace <- sotu_bing %>%
  select(year, net_sentiment) %>%
  mutate(lexicon = "Bing")

loughran_trace <- tidy_sotu %>%
  inner_join(get_sentiments("loughran")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  count(year, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative,
         lexicon = "Loughran-McDonald") %>%
  select(year, net_sentiment, lexicon)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 596 of `x` matches multiple rows in `y`.
ℹ Row 2450 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bind_rows(bing_trace, loughran_trace) %>%
  ggplot(aes(year, net_sentiment, colour = lexicon, fill = lexicon)) +
  geom_col(position = "dodge", alpha = 0.7, show.legend = FALSE) +
  geom_smooth(method = "loess", se = FALSE, linewidth = 1) +
  facet_wrap(~lexicon, ncol = 1, scales = "free_y") +
  labs(
    title    = "Bing vs. Loughran-McDonald: Net Sentiment in SOTU (1960–2020)",
    x        = "Year",
    y        = "Net sentiment (positive − negative)",
    caption  = "Data: sotu package"
  ) +
  theme_minimal(base_size = 12)
`geom_smooth()` using formula = 'y ~ x'

Conclusion

The Jane Austen corpus is dense with emotional language — words like miss, love, good, happy, poor, and dear dominate sentiment counts. The arc of each novel shows clear narrative tension and resolution, making sentiment analysis very legible.

State of the Union speeches are structurally different: they are formal, policy-focused documents designed to inform and persuade rather than evoke emotion. Sentiment-bearing words are a smaller fraction of the total vocabulary, and the signal is noisier.

The Bing lexicon was built from product reviews and social media. Applied to SOTU speeches it picks up everyday evaluative words. The Loughran-McDonald lexicon was designed for financial and legal documents using SEC filings. This underscores the key lesson from Chapter 2: lexicon choice matters, and the right choice depends on matching the vocabulary to the domain of the text.