Sentiment Analysis with Text Mining in R

Author

Nana Kwasi Danquah

Published

April 19, 2026

Overview

This report has two parts. Part 1 reproduces the primary sentiment analysis example from Chapter 2 of Text Mining with R (Silge and Robinson 2017) using the Jane Austen corpus and the three built-in lexicons (AFINN, Bing, NRC). Part 2 extends the analysis with a different corpus — four science-fiction novels by H.G. Wells downloaded from Project Gutenberg — and an additional sentiment lexicon: the Loughran-McDonald lexicon (Loughran and McDonald 2011), originally designed for financial documents.

The central question driving the comparison is: how does the emotional texture of Victorian domestic fiction (Austen) differ from early science fiction (Wells), and what does each lexicon reveal or conceal?

Setup

Code

library(tidyverse)    # data wrangling + ggplot2
library(tidytext)     # tidy text mining
library(textdata)     # sentiment lexicons (AFINN, NRC, Loughran)
library(janeaustenr)  # base corpus
library(gutenbergr)   # extension corpus
library(wordcloud)    # word cloud visualisation
library(reshape2)     # acast() for comparison clouds
library(scales)       # percent_format()

Part 1 — Reproducing the Base Example

All code in this section is adapted directly from Chapter 2 of Text Mining with R: A Tidy Approach by Silge and Robinson (2017), available at https://www.tidytextmining.com/sentiment.

1.1 The Three Sentiment Lexicons

tidytext provides three English-language sentiment lexicons via get_sentiments(). Each encodes sentiment differently:

AFINN (Nielsen 2011) — integer scores from −5 (most negative) to +5 (most positive)
Bing (Liu 2012) — binary classification: positive or negative
NRC (Mohammad and Turney 2013) — ten categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust

All three are based on unigrams (individual words) and do not account for negation or context.

Code

get_sentiments("afinn") |> slice_head(n = 8)

Code

get_sentiments("bing")  |> slice_head(n = 8)

Code

get_sentiments("nrc")   |> slice_head(n = 8)

1.2 Tidying the Jane Austen Corpus

We load all six completed Austen novels from janeaustenr and convert to one-token-per-row format using unnest_tokens(). Line numbers and chapter markers are preserved for later windowed analysis.

Code

tidy_books <- austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter    = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlc]", ignore_case = TRUE)
    ))
  ) |>
  ungroup() |>
  unnest_tokens(word, text)

tidy_books

1.3 Most Common Joy Words in Emma (NRC Lexicon)

We filter the NRC lexicon to the “joy” category and inner-join it with the tokenised text of Emma to find the most frequent joy-associated words.

Code

nrc_joy <- get_sentiments("nrc") |>
  filter(sentiment == "joy")

tidy_books |>
  filter(book == "Emma") |>
  inner_join(nrc_joy, by = "word") |>
  count(word, sort = TRUE) |>
  slice_head(n = 15) |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(n, word)) +
  geom_col(fill = "#4e79a7") +
  labs(
    title    = "Most Common Joy Words in Emma",
    subtitle = "NRC Lexicon",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 13)

Top 15 joy words in *Emma* — NRC lexicon

1.4 Sentiment Arc Across All Six Novels (Bing Lexicon)

Each novel is sliced into 80-line windows; net sentiment (positive − negative word count) is computed per window and plotted as a bar chart, revealing the emotional trajectory of each narrative.

Code

jane_austen_sentiment <- tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(sentiment = positive - negative)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Sentiment Trajectory — Jane Austen Novels",
    subtitle = "Bing lexicon · 80-line rolling windows",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment (positive − negative)"
  ) +
  theme_minimal(base_size = 12)

Sentiment arc across Jane Austen novels — Bing lexicon

Observation: Every novel shows a broadly positive arc with dips during crisis points — the Wickham scandal in Pride & Prejudice, Marianne’s illness in Sense & Sensibility — before resolving positively, consistent with social comedy conventions.

1.5 Comparing All Three Lexicons on Pride & Prejudice

To see whether lexicon choice materially changes the story, we apply all three to Pride & Prejudice and plot the net sentiment arcs together.

Code

pride_prejudice <- tidy_books |>
  filter(book == "Pride & Prejudice")

# AFINN: numeric scores summed per window
afinn_pp <- pride_prejudice |>
  inner_join(get_sentiments("afinn"), by = "word") |>
  group_by(index = linenumber %/% 80) |>
  summarise(sentiment = sum(value)) |>
  mutate(method = "AFINN")

# Bing and NRC (positive/negative categories -> net count)
bing_nrc_pp <- bind_rows(
  pride_prejudice |>
    inner_join(get_sentiments("bing"), by = "word") |>
    mutate(method = "Bing"),
  pride_prejudice |>
    inner_join(
      get_sentiments("nrc") |>
        filter(sentiment %in% c("positive", "negative")),
      by = "word"
    ) |>
    mutate(method = "NRC")
) |>
  count(method, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(sentiment = positive - negative)

bind_rows(afinn_pp, bing_nrc_pp) |>
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = c("#e15759", "#4e79a7", "#59a14f")) +
  labs(
    title    = "Three Lexicons Compared — Pride & Prejudice",
    subtitle = "Each panel uses a different sentiment lexicon",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment"
  ) +
  theme_minimal(base_size = 12)

AFINN, Bing, and NRC compared on *Pride & Prejudice*

Observation: All three lexicons agree on narrative shape — early optimism, a prolonged negative centre, positive resolution — but AFINN produces the largest absolute swings because it uses a continuous scale. NRC scores higher overall because its “positive” category is broader than Bing’s.

1.6 Most Common Positive and Negative Words (Bing)

Code

bing_word_counts <- tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) |>
  ungroup()

bing_word_counts |>
  group_by(sentiment) |>
  slice_max(n, n = 10) |>
  ungroup() |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Top Positive & Negative Words — Jane Austen",
    subtitle = "Bing lexicon",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 12)

Top positive and negative words across all Austen novels — Bing

Note: “Miss” ranks among negative words because Bing codes it as the verb “to miss,” whereas in Austen it is almost always a honorific. This illustrates a classic limitation of unigram lexicons: words are context-free.

1.7 Comparison Word Cloud

Code

tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) |>
  acast(word ~ sentiment, value.var = "n", fill = 0) |>
  comparison.cloud(
    colors     = c("#e15759", "#4e79a7"),
    max.words  = 120,
    title.size = 1.5
  )

Positive (blue) vs negative (red) word cloud — Bing lexicon

1.8 Most Negative Chapter Across All Novels

Which chapter of each novel has the highest proportion of negative words under the Bing lexicon?

Code

bing_negative <- get_sentiments("bing") |>
  filter(sentiment == "negative")

word_counts <- tidy_books |>
  group_by(book, chapter) |>
  summarise(words = n(), .groups = "drop")

tidy_books |>
  semi_join(bing_negative, by = "word") |>
  group_by(book, chapter) |>
  summarise(negative_words = n(), .groups = "drop") |>
  left_join(word_counts, by = c("book", "chapter")) |>
  mutate(ratio = negative_words / words) |>
  filter(chapter != 0) |>
  group_by(book) |>
  slice_max(ratio, n = 1) |>
  ungroup() |>
  arrange(desc(ratio)) |>
  select(book, chapter, negative_words, words, ratio)

Part 2 — Extension

2.1 Extension Corpus: H.G. Wells

Rationale: Austen’s domestic social comedies are polite, bounded, and emotionally moderate. As a deliberate contrast, we use four H.G. Wells science-fiction novels dealing with invasion, mutation, time travel, and existential threat. Both authors wrote in Victorian/Edwardian England but in entirely different registers.

Code

wells_meta <- tibble(
  gutenberg_id = c(35, 36, 5230, 718),
  title = c(
    "The Time Machine",
    "The War of the Worlds",
    "The Invisible Man",
    "The Island of Doctor Moreau"
  )
)

# Download texts (note: meta_fields may not work with all mirrors)
wells_raw <- gutenberg_download(wells_meta$gutenberg_id)

# Join with our local metadata to get titles
tidy_wells <- wells_raw |>
  left_join(wells_meta, by = "gutenberg_id") |>
  group_by(title) |>
  mutate(
    linenumber = row_number(),
    chapter    = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlci]+", ignore_case = TRUE)
    ))
  ) |>
  ungroup() |>
  unnest_tokens(word, text)

tidy_wells |> count(title, sort = TRUE)

2.2 NRC Joy vs Fear — Austen and Wells Side by Side

Code

nrc_joy_fear <- get_sentiments("nrc") |>
  filter(sentiment %in% c("joy", "fear"))

austen_jf <- tidy_books |>
  inner_join(nrc_joy_fear, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "Jane Austen", proportion = n / sum(n))

wells_jf <- tidy_wells |>
  inner_join(nrc_joy_fear, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "H.G. Wells", proportion = n / sum(n))

bind_rows(austen_jf, wells_jf) |>
  ggplot(aes(sentiment, proportion, fill = corpus)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Joy vs Fear — Austen vs Wells",
    subtitle = "NRC lexicon · proportion of joy/fear matched words",
    x = NULL, y = "Proportion", fill = "Corpus"
  ) +
  theme_minimal(base_size = 13)

Joy vs fear word proportions: Austen vs Wells (NRC)

2.3 Sentiment Arc — Wells Novels (Bing)

Code

wells_sentiment_bing <- tidy_wells |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(title, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(sentiment = positive - negative)

ggplot(wells_sentiment_bing, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x") +
  scale_fill_brewer(palette = "Dark2") +
  labs(
    title    = "Sentiment Trajectory — H.G. Wells Novels",
    subtitle = "Bing lexicon · 80-line rolling windows",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment (positive − negative)"
  ) +
  theme_minimal(base_size = 12)

Sentiment arc across H.G. Wells novels — Bing lexicon

2.4 Most Common Fear Words by Wells Novel (NRC)

Code

nrc_fear <- get_sentiments("nrc") |> filter(sentiment == "fear")

tidy_wells |>
  inner_join(nrc_fear, by = "word") |>
  count(title, word, sort = TRUE) |>
  group_by(title) |>
  slice_max(n, n = 10) |>
  ungroup() |>
  ggplot(aes(n, reorder_within(word, n, title), fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, scales = "free_y") +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Dark2") +
  labs(
    title    = "Most Common Fear Words — H.G. Wells",
    subtitle = "NRC Lexicon",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 11)

Top fear words in each Wells novel — NRC lexicon

2.5 Additional Lexicon: Loughran-McDonald

Background

The Loughran-McDonald lexicon (Loughran and McDonald 2011) was constructed from SEC 10-K annual reports to identify words with consistent sentiment signals in financial prose. It provides six categories:

Category	Meaning in finance	Why it is interesting in fiction
positive	Favourable outlook	General optimism
negative	Unfavourable outlook	General pessimism
uncertainty	Hedging, speculation	Language of the unknown
litigious	Legal language	Conflict, authority
constraining	Obligation, restriction	Captivity, control
superfluous	Redundant filler	—

Applying a financial lexicon to Victorian fiction is deliberately unconventional. The goal is not to claim Loughran is the right tool for fiction, but to use its unique categories — especially uncertainty — to surface linguistic patterns that Bing and NRC cannot detect.

Code

loughran <- get_sentiments("loughran")
loughran |> count(sentiment, sort = TRUE)

2.5.1 Loughran Category Profile — Wells Novels

Code

wells_loughran <- tidy_wells |>
  inner_join(loughran, by = "word") |>
  count(title, sentiment) |>
  group_by(title) |>
  mutate(proportion = n / sum(n)) |>
  ungroup()

ggplot(wells_loughran, aes(sentiment, proportion, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Loughran-McDonald Category Proportions — H.G. Wells",
    subtitle = "Proportion of matched words falling into each category",
    x = NULL,
    y = "Proportion of matched sentiment words"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Loughran-McDonald category proportions — H.G. Wells novels

2.5.2 Loughran Category Profile — Austen Novels

Code

austen_loughran <- tidy_books |>
  inner_join(loughran, by = "word") |>
  count(book, sentiment) |>
  group_by(book) |>
  mutate(proportion = n / sum(n)) |>
  ungroup()

ggplot(austen_loughran, aes(sentiment, proportion, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Loughran-McDonald Category Proportions — Jane Austen",
    subtitle = "Proportion of matched words falling into each category",
    x = NULL,
    y = "Proportion of matched sentiment words"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Loughran-McDonald category proportions — Jane Austen novels

2.5.3 Uncertainty Language — The Signature of Science Fiction

The uncertainty category is the most analytically interesting when applied to fiction. Financial uncertainty words (“possible,” “might,” “appears,” “uncertain,” “approximately”) map naturally onto the language of characters confronting the unknown.

Code

loughran_uncertainty <- loughran |> filter(sentiment == "uncertainty")

austen_unc <- tidy_books |>
  inner_join(loughran_uncertainty, by = "word") |>
  count(word, sort = TRUE) |>
  mutate(corpus = "Jane Austen")

wells_unc <- tidy_wells |>
  inner_join(loughran_uncertainty, by = "word") |>
  count(word, sort = TRUE) |>
  mutate(corpus = "H.G. Wells")

bind_rows(austen_unc, wells_unc) |>
  group_by(corpus) |>
  slice_max(n, n = 12) |>
  ungroup() |>
  ggplot(aes(n, reorder_within(word, n, corpus), fill = corpus)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~corpus, scales = "free") +
  scale_y_reordered() +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Top Uncertainty Words — Austen vs Wells",
    subtitle = "Loughran-McDonald uncertainty category",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 12)

Top uncertainty words: Austen vs Wells (Loughran)

2.5.4 Bing vs Loughran Arc — The War of the Worlds

Code

wotw <- tidy_wells |> filter(title == "The War of the Worlds")

wotw_bing <- wotw |>
  inner_join(get_sentiments("bing"), by = "word",
             relationship = "many-to-many") |>
  count(index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(net = positive - negative, method = "Bing")

# pivot_wider only creates columns that exist in the data, so we
# explicitly add any missing polarity column before computing net
wotw_loughran <- wotw |>
  inner_join(
    loughran |> filter(sentiment %in% c("positive", "negative")),
    by = "word",
    relationship = "many-to-many"
  ) |>
  count(index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  (function(df) {
    if (!"positive" %in% names(df)) df$positive <- 0L
    if (!"negative" %in% names(df)) df$negative <- 0L
    df
  })() |>
  mutate(net = positive - negative, method = "Loughran")

bind_rows(
  wotw_bing     |> select(index, net, method),
  wotw_loughran |> select(index, net, method)
) |>
  ggplot(aes(index, net, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = c("#4e79a7", "#f28e2b")) +
  labs(
    title    = "Bing vs Loughran — The War of the Worlds",
    subtitle = "Net sentiment (positive − negative) per 80-line chunk",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment"
  ) +
  theme_minimal(base_size = 12)

Bing vs Loughran (positive − negative) — *The War of the Worlds*

Part 3 — Cross-Corpus Comparison

3.1 Overall Bing Polarity: Austen vs Wells

Code

austen_pol <- tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(sentiment) |>
  mutate(corpus = "Jane Austen", proportion = n / sum(n))

wells_pol <- tidy_wells |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(sentiment) |>
  mutate(corpus = "H.G. Wells", proportion = n / sum(n))

bind_rows(austen_pol, wells_pol) |>
  ggplot(aes(sentiment, proportion, fill = corpus)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Positive vs Negative Word Balance",
    subtitle = "Bing lexicon — Austen vs Wells",
    x = NULL, y = "Proportion of matched words", fill = "Corpus"
  ) +
  theme_minimal(base_size = 13)

Positive vs negative word balance: Austen vs Wells (Bing)

3.2 Full NRC Emotion Profile: Austen vs Wells

Code

nrc_all <- get_sentiments("nrc")

austen_nrc <- tidy_books |>
  inner_join(nrc_all, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "Jane Austen", proportion = n / sum(n))

wells_nrc <- tidy_wells |>
  inner_join(nrc_all, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "H.G. Wells", proportion = n / sum(n))

bind_rows(austen_nrc, wells_nrc) |>
  ggplot(aes(reorder(sentiment, proportion), proportion, fill = corpus)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "NRC Emotion Profiles — Austen vs Wells",
    subtitle = "Proportion of matched words in each category",
    x = NULL, y = "Proportion", fill = "Corpus"
  ) +
  theme_minimal(base_size = 13)

Full NRC emotion profile: Austen vs Wells

3.3 Uncertainty Rate per 1,000 Words: Austen vs Wells

Code

austen_unc_rate <- tidy_books |>
  inner_join(loughran_uncertainty, by = "word") |>
  nrow() / nrow(tidy_books) * 1000

wells_unc_rate <- tidy_wells |>
  inner_join(loughran_uncertainty, by = "word") |>
  nrow() / nrow(tidy_wells) * 1000

tibble(
  corpus = c("Jane Austen", "H.G. Wells"),
  rate   = c(austen_unc_rate, wells_unc_rate)
) |>
  ggplot(aes(corpus, rate, fill = corpus)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Uncertainty Word Rate per 1,000 Words",
    subtitle = "Loughran-McDonald uncertainty category",
    x = NULL,
    y = "Uncertainty words per 1,000 words"
  ) +
  theme_minimal(base_size = 14)

Loughran uncertainty word rate per 1,000 words

Discussion

Do the corpora differ in the expected direction?

Yes, but more subtly than expected. The NRC emotion profile (§3.2) shows the clearest contrast: Wells registers higher fear and anger, while Austen shows higher trust and anticipation — consistent with the difference between invasion narratives and courtship narratives. Surprise is roughly equal, perhaps because both genres rely on plot twists and revelation.

The Bing polarity comparison (§3.1) is more surprising: both corpora are net-positive, and the gap between them is smaller than expected. This reflects a known limitation of unigram methods — high-frequency, neutral-to-positive words (“good,” “great,” “well”) dominate raw counts and pull every corpus toward positive regardless of genre.

What does the Loughran lexicon add?

The uncertainty rate comparison (§3.3) is the most distinctive finding from the extension lexicon. Wells uses more uncertainty language than Austen — words like “perhaps,” “appeared,” “seemed,” “possible,” and “might” appear at a higher rate in his science fiction. This makes intuitive sense: Wells’s protagonists are constantly reasoning about phenomena at the edge of human understanding. Austen’s characters may be socially uncertain, but they rarely confront epistemological uncertainty about the nature of reality itself.

The Loughran arc on The War of the Worlds (§2.5.4) produces more compressed swings than the Bing arc on the same text. Loughran’s positive and negative vocabulary was calibrated on financial prose and matches fewer fiction words overall — fewer matches means less signal, but also fewer false positives like “miss.”

Lexicon choice matters

Lexicon	Strength	Limitation in this context
AFINN	Graded intensity	Small vocabulary
Bing	Large vocabulary	Binary only; false positives (“miss”)
NRC	Rich emotion categories	Overlapping categories; inflated counts
Loughran	Unique uncertainty/litigious axes	Calibrated on finance; under-matches fiction

No single lexicon is correct. The most informative analysis uses multiple lexicons and treats disagreements between them as data rather than problems.

Limitations

No negation handling: “not good” scores the same as “good.”
Historical vocabulary: Some 19th-century words are absent from modern lexicons, or have shifted meaning since they were written.
Loughran genre mismatch: Financial calibration means many emotionally charged fiction words have no Loughran entry, reducing coverage.
Raw counts vs normalisation: Where books differ in length, proportions (as used throughout Part 3) are more meaningful than raw counts.

References

Liu, Bing. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.

Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65.

Mohammad, Saif M., and Peter D. Turney. 2013. “Crowdsourcing a Word-Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–65.

Nielsen, Finn Årup. 2011. “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs.” arXiv Preprint arXiv:1103.2903. https://arxiv.org/abs/1103.2903.

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/.

--- title: "Sentiment Analysis with Text Mining in R" author: "Nana Kwasi Danquah" date: today format: html: toc: true toc-depth: 3 toc-title: "Contents" theme: cosmo highlight-style: github code-fold: show code-tools: true fig-width: 9 fig-height: 6 df-print: paged embed-resources: true execute: warning: false message: false bibliography: references.bib --- ## Overview This report has two parts. **Part 1** reproduces the primary sentiment analysis example from Chapter 2 of *Text Mining with R* [@silge2017text] using the Jane Austen corpus and the three built-in lexicons (AFINN, Bing, NRC). **Part 2** extends the analysis with a different corpus — four science-fiction novels by H.G. Wells downloaded from Project Gutenberg — and an additional sentiment lexicon: the **Loughran-McDonald** lexicon [@loughran2011liability], originally designed for financial documents. The central question driving the comparison is: **how does the emotional texture of Victorian domestic fiction (Austen) differ from early science fiction (Wells), and what does each lexicon reveal or conceal?** --- ## Setup ```{r setup} library(tidyverse) # data wrangling + ggplot2 library(tidytext) # tidy text mining library(textdata) # sentiment lexicons (AFINN, NRC, Loughran) library(janeaustenr) # base corpus library(gutenbergr) # extension corpus library(wordcloud) # word cloud visualisation library(reshape2) # acast() for comparison clouds library(scales) # percent_format() ``` --- ## Part 1 — Reproducing the Base Example > All code in this section is adapted directly from Chapter 2 of > *Text Mining with R: A Tidy Approach* by @silge2017text, available at > <https://www.tidytextmining.com/sentiment>. ### 1.1 The Three Sentiment Lexicons `tidytext` provides three English-language sentiment lexicons via `get_sentiments()`. Each encodes sentiment differently: - **AFINN** [@nielsen2011new] — integer scores from −5 (most negative) to +5 (most positive) - **Bing** [@liu2012sentiment] — binary classification: *positive* or *negative* - **NRC** [@mohammad2013crowdsourcing] — ten categories: *positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust* All three are based on **unigrams** (individual words) and do not account for negation or context. ```{r lexicons} get_sentiments("afinn") |> slice_head(n = 8) get_sentiments("bing") |> slice_head(n = 8) get_sentiments("nrc") |> slice_head(n = 8) ``` ### 1.2 Tidying the Jane Austen Corpus We load all six completed Austen novels from `janeaustenr` and convert to one-token-per-row format using `unnest_tokens()`. Line numbers and chapter markers are preserved for later windowed analysis. ```{r tidy-austen} tidy_books <- austen_books() |> group_by(book) |> mutate( linenumber = row_number(), chapter = cumsum(str_detect( text, regex("^chapter [\\divxlc]", ignore_case = TRUE) )) ) |> ungroup() |> unnest_tokens(word, text) tidy_books ``` ### 1.3 Most Common Joy Words in *Emma* (NRC Lexicon) We filter the NRC lexicon to the "joy" category and inner-join it with the tokenised text of *Emma* to find the most frequent joy-associated words. ```{r nrc-joy} #| fig-cap: "Top 15 joy words in *Emma* — NRC lexicon" nrc_joy <- get_sentiments("nrc") |> filter(sentiment == "joy") tidy_books |> filter(book == "Emma") |> inner_join(nrc_joy, by = "word") |> count(word, sort = TRUE) |> slice_head(n = 15) |> mutate(word = reorder(word, n)) |> ggplot(aes(n, word)) + geom_col(fill = "#4e79a7") + labs( title = "Most Common Joy Words in Emma", subtitle = "NRC Lexicon", x = "Count", y = NULL ) + theme_minimal(base_size = 13) ``` ### 1.4 Sentiment Arc Across All Six Novels (Bing Lexicon) Each novel is sliced into 80-line windows; net sentiment (positive − negative word count) is computed per window and plotted as a bar chart, revealing the emotional trajectory of each narrative. ```{r bing-arc} #| fig-cap: "Sentiment arc across Jane Austen novels — Bing lexicon" #| fig-height: 8 jane_austen_sentiment <- tidy_books |> inner_join(get_sentiments("bing"), by = "word") |> count(book, index = linenumber %/% 80, sentiment) |> pivot_wider( names_from = sentiment, values_from = n, values_fill = 0 ) |> mutate(sentiment = positive - negative) ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) + geom_col(show.legend = FALSE) + facet_wrap(~book, ncol = 2, scales = "free_x") + scale_fill_brewer(palette = "Set2") + labs( title = "Sentiment Trajectory — Jane Austen Novels", subtitle = "Bing lexicon · 80-line rolling windows", x = "Narrative progress (chunk index)", y = "Net sentiment (positive − negative)" ) + theme_minimal(base_size = 12) ``` **Observation:** Every novel shows a broadly positive arc with dips during crisis points — the Wickham scandal in *Pride & Prejudice*, Marianne's illness in *Sense & Sensibility* — before resolving positively, consistent with social comedy conventions. ### 1.5 Comparing All Three Lexicons on *Pride & Prejudice* To see whether lexicon choice materially changes the story, we apply all three to *Pride & Prejudice* and plot the net sentiment arcs together. ```{r three-lexicons} #| fig-cap: "AFINN, Bing, and NRC compared on *Pride & Prejudice*" #| fig-height: 7 pride_prejudice <- tidy_books |> filter(book == "Pride & Prejudice") # AFINN: numeric scores summed per window afinn_pp <- pride_prejudice |> inner_join(get_sentiments("afinn"), by = "word") |> group_by(index = linenumber %/% 80) |> summarise(sentiment = sum(value)) |> mutate(method = "AFINN") # Bing and NRC (positive/negative categories -> net count) bing_nrc_pp <- bind_rows( pride_prejudice |> inner_join(get_sentiments("bing"), by = "word") |> mutate(method = "Bing"), pride_prejudice |> inner_join( get_sentiments("nrc") |> filter(sentiment %in% c("positive", "negative")), by = "word" ) |> mutate(method = "NRC") ) |> count(method, index = linenumber %/% 80, sentiment) |> pivot_wider( names_from = sentiment, values_from = n, values_fill = 0 ) |> mutate(sentiment = positive - negative) bind_rows(afinn_pp, bing_nrc_pp) |> ggplot(aes(index, sentiment, fill = method)) + geom_col(show.legend = FALSE) + facet_wrap(~method, ncol = 1, scales = "free_y") + scale_fill_manual(values = c("#e15759", "#4e79a7", "#59a14f")) + labs( title = "Three Lexicons Compared — Pride & Prejudice", subtitle = "Each panel uses a different sentiment lexicon", x = "Narrative progress (chunk index)", y = "Net sentiment" ) + theme_minimal(base_size = 12) ``` **Observation:** All three lexicons agree on narrative shape — early optimism, a prolonged negative centre, positive resolution — but AFINN produces the largest absolute swings because it uses a continuous scale. NRC scores higher overall because its "positive" category is broader than Bing's. ### 1.6 Most Common Positive and Negative Words (Bing) ```{r bing-top-words} #| fig-cap: "Top positive and negative words across all Austen novels — Bing" bing_word_counts <- tidy_books |> inner_join(get_sentiments("bing"), by = "word") |> count(word, sentiment, sort = TRUE) |> ungroup() bing_word_counts |> group_by(sentiment) |> slice_max(n, n = 10) |> ungroup() |> mutate(word = reorder(word, n)) |> ggplot(aes(n, word, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales = "free_y") + scale_fill_manual(values = c("#e15759", "#4e79a7")) + labs( title = "Top Positive & Negative Words — Jane Austen", subtitle = "Bing lexicon", x = "Count", y = NULL ) + theme_minimal(base_size = 12) ``` **Note:** "Miss" ranks among negative words because Bing codes it as the verb "to miss," whereas in Austen it is almost always a honorific. This illustrates a classic limitation of unigram lexicons: **words are context-free**. ### 1.7 Comparison Word Cloud ```{r wordcloud} #| fig-cap: "Positive (blue) vs negative (red) word cloud — Bing lexicon" #| fig-height: 6 tidy_books |> inner_join(get_sentiments("bing"), by = "word") |> count(word, sentiment, sort = TRUE) |> acast(word ~ sentiment, value.var = "n", fill = 0) |> comparison.cloud( colors = c("#e15759", "#4e79a7"), max.words = 120, title.size = 1.5 ) ``` ### 1.8 Most Negative Chapter Across All Novels Which chapter of each novel has the highest proportion of negative words under the Bing lexicon? ```{r most-negative-chapter} bing_negative <- get_sentiments("bing") |> filter(sentiment == "negative") word_counts <- tidy_books |> group_by(book, chapter) |> summarise(words = n(), .groups = "drop") tidy_books |> semi_join(bing_negative, by = "word") |> group_by(book, chapter) |> summarise(negative_words = n(), .groups = "drop") |> left_join(word_counts, by = c("book", "chapter")) |> mutate(ratio = negative_words / words) |> filter(chapter != 0) |> group_by(book) |> slice_max(ratio, n = 1) |> ungroup() |> arrange(desc(ratio)) |> select(book, chapter, negative_words, words, ratio) ``` --- ## Part 2 — Extension ### 2.1 Extension Corpus: H.G. Wells **Rationale:** Austen's domestic social comedies are polite, bounded, and emotionally moderate. As a deliberate contrast, we use four H.G. Wells science-fiction novels dealing with invasion, mutation, time travel, and existential threat. Both authors wrote in Victorian/Edwardian England but in entirely different registers. ```{r download-wells} wells_meta <- tibble( gutenberg_id = c(35, 36, 5230, 718), title = c( "The Time Machine", "The War of the Worlds", "The Invisible Man", "The Island of Doctor Moreau" ) ) # Download texts (note: meta_fields may not work with all mirrors) wells_raw <- gutenberg_download(wells_meta$gutenberg_id) # Join with our local metadata to get titles tidy_wells <- wells_raw |> left_join(wells_meta, by = "gutenberg_id") |> group_by(title) |> mutate( linenumber = row_number(), chapter = cumsum(str_detect( text, regex("^chapter [\\divxlci]+", ignore_case = TRUE) )) ) |> ungroup() |> unnest_tokens(word, text) tidy_wells |> count(title, sort = TRUE) ``` ### 2.2 NRC Joy vs Fear — Austen and Wells Side by Side ```{r joy-fear-comparison} #| fig-cap: "Joy vs fear word proportions: Austen vs Wells (NRC)" nrc_joy_fear <- get_sentiments("nrc") |> filter(sentiment %in% c("joy", "fear")) austen_jf <- tidy_books |> inner_join(nrc_joy_fear, by = "word") |> count(sentiment) |> mutate(corpus = "Jane Austen", proportion = n / sum(n)) wells_jf <- tidy_wells |> inner_join(nrc_joy_fear, by = "word") |> count(sentiment) |> mutate(corpus = "H.G. Wells", proportion = n / sum(n)) bind_rows(austen_jf, wells_jf) |> ggplot(aes(sentiment, proportion, fill = corpus)) + geom_col(position = "dodge") + scale_y_continuous(labels = percent_format()) + scale_fill_manual(values = c("#e15759", "#4e79a7")) + labs( title = "Joy vs Fear — Austen vs Wells", subtitle = "NRC lexicon · proportion of joy/fear matched words", x = NULL, y = "Proportion", fill = "Corpus" ) + theme_minimal(base_size = 13) ``` ### 2.3 Sentiment Arc — Wells Novels (Bing) ```{r bing-arc-wells} #| fig-cap: "Sentiment arc across H.G. Wells novels — Bing lexicon" #| fig-height: 7 wells_sentiment_bing <- tidy_wells |> inner_join(get_sentiments("bing"), by = "word") |> count(title, index = linenumber %/% 80, sentiment) |> pivot_wider( names_from = sentiment, values_from = n, values_fill = 0 ) |> mutate(sentiment = positive - negative) ggplot(wells_sentiment_bing, aes(index, sentiment, fill = title)) + geom_col(show.legend = FALSE) + facet_wrap(~title, ncol = 2, scales = "free_x") + scale_fill_brewer(palette = "Dark2") + labs( title = "Sentiment Trajectory — H.G. Wells Novels", subtitle = "Bing lexicon · 80-line rolling windows", x = "Narrative progress (chunk index)", y = "Net sentiment (positive − negative)" ) + theme_minimal(base_size = 12) ``` ### 2.4 Most Common Fear Words by Wells Novel (NRC) ```{r nrc-fear-wells} #| fig-cap: "Top fear words in each Wells novel — NRC lexicon" #| fig-height: 7 nrc_fear <- get_sentiments("nrc") |> filter(sentiment == "fear") tidy_wells |> inner_join(nrc_fear, by = "word") |> count(title, word, sort = TRUE) |> group_by(title) |> slice_max(n, n = 10) |> ungroup() |> ggplot(aes(n, reorder_within(word, n, title), fill = title)) + geom_col(show.legend = FALSE) + facet_wrap(~title, scales = "free_y") + scale_y_reordered() + scale_fill_brewer(palette = "Dark2") + labs( title = "Most Common Fear Words — H.G. Wells", subtitle = "NRC Lexicon", x = "Count", y = NULL ) + theme_minimal(base_size = 11) ``` ### 2.5 Additional Lexicon: Loughran-McDonald #### Background The **Loughran-McDonald** lexicon [@loughran2011liability] was constructed from SEC 10-K annual reports to identify words with consistent sentiment signals in **financial** prose. It provides six categories: | Category | Meaning in finance | Why it is interesting in fiction | |---|---|---| | **positive** | Favourable outlook | General optimism | | **negative** | Unfavourable outlook | General pessimism | | **uncertainty** | Hedging, speculation | Language of the unknown | | **litigious** | Legal language | Conflict, authority | | **constraining** | Obligation, restriction | Captivity, control | | **superfluous** | Redundant filler | — | Applying a financial lexicon to Victorian fiction is deliberately unconventional. The goal is not to claim Loughran is the *right* tool for fiction, but to use its unique categories — especially *uncertainty* — to surface linguistic patterns that Bing and NRC cannot detect. ```{r loughran-overview} loughran <- get_sentiments("loughran") loughran |> count(sentiment, sort = TRUE) ``` #### 2.5.1 Loughran Category Profile — Wells Novels ```{r loughran-wells-profile} #| fig-cap: "Loughran-McDonald category proportions — H.G. Wells novels" #| fig-height: 6 wells_loughran <- tidy_wells |> inner_join(loughran, by = "word") |> count(title, sentiment) |> group_by(title) |> mutate(proportion = n / sum(n)) |> ungroup() ggplot(wells_loughran, aes(sentiment, proportion, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~title, ncol = 2) + scale_fill_brewer(palette = "Set1") + scale_y_continuous(labels = percent_format()) + labs( title = "Loughran-McDonald Category Proportions — H.G. Wells", subtitle = "Proportion of matched words falling into each category", x = NULL, y = "Proportion of matched sentiment words" ) + theme_minimal(base_size = 12) + theme(axis.text.x = element_text(angle = 30, hjust = 1)) ``` #### 2.5.2 Loughran Category Profile — Austen Novels ```{r loughran-austen-profile} #| fig-cap: "Loughran-McDonald category proportions — Jane Austen novels" #| fig-height: 7 austen_loughran <- tidy_books |> inner_join(loughran, by = "word") |> count(book, sentiment) |> group_by(book) |> mutate(proportion = n / sum(n)) |> ungroup() ggplot(austen_loughran, aes(sentiment, proportion, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~book, ncol = 2) + scale_fill_brewer(palette = "Set1") + scale_y_continuous(labels = percent_format()) + labs( title = "Loughran-McDonald Category Proportions — Jane Austen", subtitle = "Proportion of matched words falling into each category", x = NULL, y = "Proportion of matched sentiment words" ) + theme_minimal(base_size = 12) + theme(axis.text.x = element_text(angle = 30, hjust = 1)) ``` #### 2.5.3 Uncertainty Language — The Signature of Science Fiction The *uncertainty* category is the most analytically interesting when applied to fiction. Financial uncertainty words ("possible," "might," "appears," "uncertain," "approximately") map naturally onto the language of characters confronting the unknown. ```{r uncertainty-comparison} #| fig-cap: "Top uncertainty words: Austen vs Wells (Loughran)" loughran_uncertainty <- loughran |> filter(sentiment == "uncertainty") austen_unc <- tidy_books |> inner_join(loughran_uncertainty, by = "word") |> count(word, sort = TRUE) |> mutate(corpus = "Jane Austen") wells_unc <- tidy_wells |> inner_join(loughran_uncertainty, by = "word") |> count(word, sort = TRUE) |> mutate(corpus = "H.G. Wells") bind_rows(austen_unc, wells_unc) |> group_by(corpus) |> slice_max(n, n = 12) |> ungroup() |> ggplot(aes(n, reorder_within(word, n, corpus), fill = corpus)) + geom_col(show.legend = FALSE) + facet_wrap(~corpus, scales = "free") + scale_y_reordered() + scale_fill_manual(values = c("#e15759", "#4e79a7")) + labs( title = "Top Uncertainty Words — Austen vs Wells", subtitle = "Loughran-McDonald uncertainty category", x = "Count", y = NULL ) + theme_minimal(base_size = 12) ``` #### 2.5.4 Bing vs Loughran Arc — *The War of the Worlds* ```{r bing-vs-loughran-arc} #| fig-cap: "Bing vs Loughran (positive − negative) — *The War of the Worlds*" #| fig-height: 6 wotw <- tidy_wells |> filter(title == "The War of the Worlds") wotw_bing <- wotw |> inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") |> count(index = linenumber %/% 80, sentiment) |> pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> mutate(net = positive - negative, method = "Bing") # pivot_wider only creates columns that exist in the data, so we # explicitly add any missing polarity column before computing net wotw_loughran <- wotw |> inner_join( loughran |> filter(sentiment %in% c("positive", "negative")), by = "word", relationship = "many-to-many" ) |> count(index = linenumber %/% 80, sentiment) |> pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |> (function(df) { if (!"positive" %in% names(df)) df$positive <- 0L if (!"negative" %in% names(df)) df$negative <- 0L df })() |> mutate(net = positive - negative, method = "Loughran") bind_rows( wotw_bing |> select(index, net, method), wotw_loughran |> select(index, net, method) ) |> ggplot(aes(index, net, fill = method)) + geom_col(show.legend = FALSE) + facet_wrap(~method, ncol = 1, scales = "free_y") + scale_fill_manual(values = c("#4e79a7", "#f28e2b")) + labs( title = "Bing vs Loughran — The War of the Worlds", subtitle = "Net sentiment (positive − negative) per 80-line chunk", x = "Narrative progress (chunk index)", y = "Net sentiment" ) + theme_minimal(base_size = 12) ``` --- ## Part 3 — Cross-Corpus Comparison ### 3.1 Overall Bing Polarity: Austen vs Wells ```{r polarity-comparison} #| fig-cap: "Positive vs negative word balance: Austen vs Wells (Bing)" austen_pol <- tidy_books |> inner_join(get_sentiments("bing"), by = "word") |> count(sentiment) |> mutate(corpus = "Jane Austen", proportion = n / sum(n)) wells_pol <- tidy_wells |> inner_join(get_sentiments("bing"), by = "word") |> count(sentiment) |> mutate(corpus = "H.G. Wells", proportion = n / sum(n)) bind_rows(austen_pol, wells_pol) |> ggplot(aes(sentiment, proportion, fill = corpus)) + geom_col(position = "dodge") + scale_y_continuous(labels = percent_format()) + scale_fill_manual(values = c("#e15759", "#4e79a7")) + labs( title = "Positive vs Negative Word Balance", subtitle = "Bing lexicon — Austen vs Wells", x = NULL, y = "Proportion of matched words", fill = "Corpus" ) + theme_minimal(base_size = 13) ``` ### 3.2 Full NRC Emotion Profile: Austen vs Wells ```{r nrc-full-comparison} #| fig-cap: "Full NRC emotion profile: Austen vs Wells" #| fig-height: 6 nrc_all <- get_sentiments("nrc") austen_nrc <- tidy_books |> inner_join(nrc_all, by = "word") |> count(sentiment) |> mutate(corpus = "Jane Austen", proportion = n / sum(n)) wells_nrc <- tidy_wells |> inner_join(nrc_all, by = "word") |> count(sentiment) |> mutate(corpus = "H.G. Wells", proportion = n / sum(n)) bind_rows(austen_nrc, wells_nrc) |> ggplot(aes(reorder(sentiment, proportion), proportion, fill = corpus)) + geom_col(position = "dodge") + coord_flip() + scale_y_continuous(labels = percent_format()) + scale_fill_manual(values = c("#e15759", "#4e79a7")) + labs( title = "NRC Emotion Profiles — Austen vs Wells", subtitle = "Proportion of matched words in each category", x = NULL, y = "Proportion", fill = "Corpus" ) + theme_minimal(base_size = 13) ``` ### 3.3 Uncertainty Rate per 1,000 Words: Austen vs Wells ```{r uncertainty-rate} #| fig-cap: "Loughran uncertainty word rate per 1,000 words" austen_unc_rate <- tidy_books |> inner_join(loughran_uncertainty, by = "word") |> nrow() / nrow(tidy_books) * 1000 wells_unc_rate <- tidy_wells |> inner_join(loughran_uncertainty, by = "word") |> nrow() / nrow(tidy_wells) * 1000 tibble( corpus = c("Jane Austen", "H.G. Wells"), rate = c(austen_unc_rate, wells_unc_rate) ) |> ggplot(aes(corpus, rate, fill = corpus)) + geom_col(show.legend = FALSE, width = 0.5) + scale_fill_manual(values = c("#e15759", "#4e79a7")) + labs( title = "Uncertainty Word Rate per 1,000 Words", subtitle = "Loughran-McDonald uncertainty category", x = NULL, y = "Uncertainty words per 1,000 words" ) + theme_minimal(base_size = 14) ``` --- ## Discussion ### Do the corpora differ in the expected direction? Yes, but more subtly than expected. The **NRC emotion profile** (§3.2) shows the clearest contrast: Wells registers higher *fear* and *anger*, while Austen shows higher *trust* and *anticipation* — consistent with the difference between invasion narratives and courtship narratives. *Surprise* is roughly equal, perhaps because both genres rely on plot twists and revelation. The **Bing polarity** comparison (§3.1) is more surprising: both corpora are net-positive, and the gap between them is smaller than expected. This reflects a known limitation of unigram methods — high-frequency, neutral-to-positive words ("good," "great," "well") dominate raw counts and pull every corpus toward positive regardless of genre. ### What does the Loughran lexicon add? The **uncertainty rate** comparison (§3.3) is the most distinctive finding from the extension lexicon. Wells uses more uncertainty language than Austen — words like "perhaps," "appeared," "seemed," "possible," and "might" appear at a higher rate in his science fiction. This makes intuitive sense: Wells's protagonists are constantly reasoning about phenomena at the edge of human understanding. Austen's characters may be socially uncertain, but they rarely confront epistemological uncertainty about the nature of reality itself. The **Loughran arc** on *The War of the Worlds* (§2.5.4) produces more compressed swings than the Bing arc on the same text. Loughran's positive and negative vocabulary was calibrated on financial prose and matches fewer fiction words overall — fewer matches means less signal, but also fewer false positives like "miss." ### Lexicon choice matters | Lexicon | Strength | Limitation in this context | |---|---|---| | AFINN | Graded intensity | Small vocabulary | | Bing | Large vocabulary | Binary only; false positives ("miss") | | NRC | Rich emotion categories | Overlapping categories; inflated counts | | Loughran | Unique uncertainty/litigious axes | Calibrated on finance; under-matches fiction | No single lexicon is correct. The most informative analysis uses multiple lexicons and treats disagreements between them as data rather than problems. ### Limitations 1. **No negation handling:** "not good" scores the same as "good." 2. **Historical vocabulary:** Some 19th-century words are absent from modern lexicons, or have shifted meaning since they were written. 3. **Loughran genre mismatch:** Financial calibration means many emotionally charged fiction words have no Loughran entry, reducing coverage. 4. **Raw counts vs normalisation:** Where books differ in length, proportions (as used throughout Part 3) are more meaningful than raw counts. --- ## References ::: {#refs} :::