---
title: "Sentiment Analysis with Text Mining in R"
author: "Nana Kwasi Danquah"
date: today
format:
html:
toc: true
toc-depth: 3
toc-title: "Contents"
theme: cosmo
highlight-style: github
code-fold: show
code-tools: true
fig-width: 9
fig-height: 6
df-print: paged
embed-resources: true
execute:
warning: false
message: false
bibliography: references.bib
---
## Overview
This report has two parts. **Part 1** reproduces the primary sentiment analysis
example from Chapter 2 of *Text Mining with R* [@silge2017text] using the
Jane Austen corpus and the three built-in lexicons (AFINN, Bing, NRC).
**Part 2** extends the analysis with a different corpus — four science-fiction
novels by H.G. Wells downloaded from Project Gutenberg — and an additional
sentiment lexicon: the **Loughran-McDonald** lexicon [@loughran2011liability],
originally designed for financial documents.
The central question driving the comparison is: **how does the emotional
texture of Victorian domestic fiction (Austen) differ from early science
fiction (Wells), and what does each lexicon reveal or conceal?**
---
## Setup
```{r setup}
library(tidyverse) # data wrangling + ggplot2
library(tidytext) # tidy text mining
library(textdata) # sentiment lexicons (AFINN, NRC, Loughran)
library(janeaustenr) # base corpus
library(gutenbergr) # extension corpus
library(wordcloud) # word cloud visualisation
library(reshape2) # acast() for comparison clouds
library(scales) # percent_format()
```
---
## Part 1 — Reproducing the Base Example
> All code in this section is adapted directly from Chapter 2 of
> *Text Mining with R: A Tidy Approach* by @silge2017text, available at
> <https://www.tidytextmining.com/sentiment>.
### 1.1 The Three Sentiment Lexicons
`tidytext` provides three English-language sentiment lexicons via
`get_sentiments()`. Each encodes sentiment differently:
- **AFINN** [@nielsen2011new] — integer scores from −5 (most negative) to +5
(most positive)
- **Bing** [@liu2012sentiment] — binary classification: *positive* or
*negative*
- **NRC** [@mohammad2013crowdsourcing] — ten categories: *positive, negative,
anger, anticipation, disgust, fear, joy, sadness, surprise, trust*
All three are based on **unigrams** (individual words) and do not account for
negation or context.
```{r lexicons}
get_sentiments("afinn") |> slice_head(n = 8)
get_sentiments("bing") |> slice_head(n = 8)
get_sentiments("nrc") |> slice_head(n = 8)
```
### 1.2 Tidying the Jane Austen Corpus
We load all six completed Austen novels from `janeaustenr` and convert to
one-token-per-row format using `unnest_tokens()`. Line numbers and chapter
markers are preserved for later windowed analysis.
```{r tidy-austen}
tidy_books <- austen_books() |>
group_by(book) |>
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(
text,
regex("^chapter [\\divxlc]", ignore_case = TRUE)
))
) |>
ungroup() |>
unnest_tokens(word, text)
tidy_books
```
### 1.3 Most Common Joy Words in *Emma* (NRC Lexicon)
We filter the NRC lexicon to the "joy" category and inner-join it with the
tokenised text of *Emma* to find the most frequent joy-associated words.
```{r nrc-joy}
#| fig-cap: "Top 15 joy words in *Emma* — NRC lexicon"
nrc_joy <- get_sentiments("nrc") |>
filter(sentiment == "joy")
tidy_books |>
filter(book == "Emma") |>
inner_join(nrc_joy, by = "word") |>
count(word, sort = TRUE) |>
slice_head(n = 15) |>
mutate(word = reorder(word, n)) |>
ggplot(aes(n, word)) +
geom_col(fill = "#4e79a7") +
labs(
title = "Most Common Joy Words in Emma",
subtitle = "NRC Lexicon",
x = "Count", y = NULL
) +
theme_minimal(base_size = 13)
```
### 1.4 Sentiment Arc Across All Six Novels (Bing Lexicon)
Each novel is sliced into 80-line windows; net sentiment (positive − negative
word count) is computed per window and plotted as a bar chart, revealing the
emotional trajectory of each narrative.
```{r bing-arc}
#| fig-cap: "Sentiment arc across Jane Austen novels — Bing lexicon"
#| fig-height: 8
jane_austen_sentiment <- tidy_books |>
inner_join(get_sentiments("bing"), by = "word") |>
count(book, index = linenumber %/% 80, sentiment) |>
pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = 0
) |>
mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x") +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Sentiment Trajectory — Jane Austen Novels",
subtitle = "Bing lexicon · 80-line rolling windows",
x = "Narrative progress (chunk index)",
y = "Net sentiment (positive − negative)"
) +
theme_minimal(base_size = 12)
```
**Observation:** Every novel shows a broadly positive arc with dips during
crisis points — the Wickham scandal in *Pride & Prejudice*, Marianne's illness
in *Sense & Sensibility* — before resolving positively, consistent with social
comedy conventions.
### 1.5 Comparing All Three Lexicons on *Pride & Prejudice*
To see whether lexicon choice materially changes the story, we apply all three
to *Pride & Prejudice* and plot the net sentiment arcs together.
```{r three-lexicons}
#| fig-cap: "AFINN, Bing, and NRC compared on *Pride & Prejudice*"
#| fig-height: 7
pride_prejudice <- tidy_books |>
filter(book == "Pride & Prejudice")
# AFINN: numeric scores summed per window
afinn_pp <- pride_prejudice |>
inner_join(get_sentiments("afinn"), by = "word") |>
group_by(index = linenumber %/% 80) |>
summarise(sentiment = sum(value)) |>
mutate(method = "AFINN")
# Bing and NRC (positive/negative categories -> net count)
bing_nrc_pp <- bind_rows(
pride_prejudice |>
inner_join(get_sentiments("bing"), by = "word") |>
mutate(method = "Bing"),
pride_prejudice |>
inner_join(
get_sentiments("nrc") |>
filter(sentiment %in% c("positive", "negative")),
by = "word"
) |>
mutate(method = "NRC")
) |>
count(method, index = linenumber %/% 80, sentiment) |>
pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = 0
) |>
mutate(sentiment = positive - negative)
bind_rows(afinn_pp, bing_nrc_pp) |>
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y") +
scale_fill_manual(values = c("#e15759", "#4e79a7", "#59a14f")) +
labs(
title = "Three Lexicons Compared — Pride & Prejudice",
subtitle = "Each panel uses a different sentiment lexicon",
x = "Narrative progress (chunk index)",
y = "Net sentiment"
) +
theme_minimal(base_size = 12)
```
**Observation:** All three lexicons agree on narrative shape — early optimism,
a prolonged negative centre, positive resolution — but AFINN produces the
largest absolute swings because it uses a continuous scale. NRC scores higher
overall because its "positive" category is broader than Bing's.
### 1.6 Most Common Positive and Negative Words (Bing)
```{r bing-top-words}
#| fig-cap: "Top positive and negative words across all Austen novels — Bing"
bing_word_counts <- tidy_books |>
inner_join(get_sentiments("bing"), by = "word") |>
count(word, sentiment, sort = TRUE) |>
ungroup()
bing_word_counts |>
group_by(sentiment) |>
slice_max(n, n = 10) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
scale_fill_manual(values = c("#e15759", "#4e79a7")) +
labs(
title = "Top Positive & Negative Words — Jane Austen",
subtitle = "Bing lexicon",
x = "Count", y = NULL
) +
theme_minimal(base_size = 12)
```
**Note:** "Miss" ranks among negative words because Bing codes it as the
verb "to miss," whereas in Austen it is almost always a honorific. This
illustrates a classic limitation of unigram lexicons: **words are
context-free**.
### 1.7 Comparison Word Cloud
```{r wordcloud}
#| fig-cap: "Positive (blue) vs negative (red) word cloud — Bing lexicon"
#| fig-height: 6
tidy_books |>
inner_join(get_sentiments("bing"), by = "word") |>
count(word, sentiment, sort = TRUE) |>
acast(word ~ sentiment, value.var = "n", fill = 0) |>
comparison.cloud(
colors = c("#e15759", "#4e79a7"),
max.words = 120,
title.size = 1.5
)
```
### 1.8 Most Negative Chapter Across All Novels
Which chapter of each novel has the highest proportion of negative words
under the Bing lexicon?
```{r most-negative-chapter}
bing_negative <- get_sentiments("bing") |>
filter(sentiment == "negative")
word_counts <- tidy_books |>
group_by(book, chapter) |>
summarise(words = n(), .groups = "drop")
tidy_books |>
semi_join(bing_negative, by = "word") |>
group_by(book, chapter) |>
summarise(negative_words = n(), .groups = "drop") |>
left_join(word_counts, by = c("book", "chapter")) |>
mutate(ratio = negative_words / words) |>
filter(chapter != 0) |>
group_by(book) |>
slice_max(ratio, n = 1) |>
ungroup() |>
arrange(desc(ratio)) |>
select(book, chapter, negative_words, words, ratio)
```
---
## Part 2 — Extension
### 2.1 Extension Corpus: H.G. Wells
**Rationale:** Austen's domestic social comedies are polite, bounded, and
emotionally moderate. As a deliberate contrast, we use four H.G. Wells
science-fiction novels dealing with invasion, mutation, time travel, and
existential threat. Both authors wrote in Victorian/Edwardian England but in
entirely different registers.
```{r download-wells}
wells_meta <- tibble(
gutenberg_id = c(35, 36, 5230, 718),
title = c(
"The Time Machine",
"The War of the Worlds",
"The Invisible Man",
"The Island of Doctor Moreau"
)
)
# Download texts (note: meta_fields may not work with all mirrors)
wells_raw <- gutenberg_download(wells_meta$gutenberg_id)
# Join with our local metadata to get titles
tidy_wells <- wells_raw |>
left_join(wells_meta, by = "gutenberg_id") |>
group_by(title) |>
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(
text,
regex("^chapter [\\divxlci]+", ignore_case = TRUE)
))
) |>
ungroup() |>
unnest_tokens(word, text)
tidy_wells |> count(title, sort = TRUE)
```
### 2.2 NRC Joy vs Fear — Austen and Wells Side by Side
```{r joy-fear-comparison}
#| fig-cap: "Joy vs fear word proportions: Austen vs Wells (NRC)"
nrc_joy_fear <- get_sentiments("nrc") |>
filter(sentiment %in% c("joy", "fear"))
austen_jf <- tidy_books |>
inner_join(nrc_joy_fear, by = "word") |>
count(sentiment) |>
mutate(corpus = "Jane Austen", proportion = n / sum(n))
wells_jf <- tidy_wells |>
inner_join(nrc_joy_fear, by = "word") |>
count(sentiment) |>
mutate(corpus = "H.G. Wells", proportion = n / sum(n))
bind_rows(austen_jf, wells_jf) |>
ggplot(aes(sentiment, proportion, fill = corpus)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = c("#e15759", "#4e79a7")) +
labs(
title = "Joy vs Fear — Austen vs Wells",
subtitle = "NRC lexicon · proportion of joy/fear matched words",
x = NULL, y = "Proportion", fill = "Corpus"
) +
theme_minimal(base_size = 13)
```
### 2.3 Sentiment Arc — Wells Novels (Bing)
```{r bing-arc-wells}
#| fig-cap: "Sentiment arc across H.G. Wells novels — Bing lexicon"
#| fig-height: 7
wells_sentiment_bing <- tidy_wells |>
inner_join(get_sentiments("bing"), by = "word") |>
count(title, index = linenumber %/% 80, sentiment) |>
pivot_wider(
names_from = sentiment,
values_from = n,
values_fill = 0
) |>
mutate(sentiment = positive - negative)
ggplot(wells_sentiment_bing, aes(index, sentiment, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x") +
scale_fill_brewer(palette = "Dark2") +
labs(
title = "Sentiment Trajectory — H.G. Wells Novels",
subtitle = "Bing lexicon · 80-line rolling windows",
x = "Narrative progress (chunk index)",
y = "Net sentiment (positive − negative)"
) +
theme_minimal(base_size = 12)
```
### 2.4 Most Common Fear Words by Wells Novel (NRC)
```{r nrc-fear-wells}
#| fig-cap: "Top fear words in each Wells novel — NRC lexicon"
#| fig-height: 7
nrc_fear <- get_sentiments("nrc") |> filter(sentiment == "fear")
tidy_wells |>
inner_join(nrc_fear, by = "word") |>
count(title, word, sort = TRUE) |>
group_by(title) |>
slice_max(n, n = 10) |>
ungroup() |>
ggplot(aes(n, reorder_within(word, n, title), fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, scales = "free_y") +
scale_y_reordered() +
scale_fill_brewer(palette = "Dark2") +
labs(
title = "Most Common Fear Words — H.G. Wells",
subtitle = "NRC Lexicon",
x = "Count", y = NULL
) +
theme_minimal(base_size = 11)
```
### 2.5 Additional Lexicon: Loughran-McDonald
#### Background
The **Loughran-McDonald** lexicon [@loughran2011liability] was constructed
from SEC 10-K annual reports to identify words with consistent sentiment
signals in **financial** prose. It provides six categories:
| Category | Meaning in finance | Why it is interesting in fiction |
|---|---|---|
| **positive** | Favourable outlook | General optimism |
| **negative** | Unfavourable outlook | General pessimism |
| **uncertainty** | Hedging, speculation | Language of the unknown |
| **litigious** | Legal language | Conflict, authority |
| **constraining** | Obligation, restriction | Captivity, control |
| **superfluous** | Redundant filler | — |
Applying a financial lexicon to Victorian fiction is deliberately
unconventional. The goal is not to claim Loughran is the *right* tool for
fiction, but to use its unique categories — especially *uncertainty* — to
surface linguistic patterns that Bing and NRC cannot detect.
```{r loughran-overview}
loughran <- get_sentiments("loughran")
loughran |> count(sentiment, sort = TRUE)
```
#### 2.5.1 Loughran Category Profile — Wells Novels
```{r loughran-wells-profile}
#| fig-cap: "Loughran-McDonald category proportions — H.G. Wells novels"
#| fig-height: 6
wells_loughran <- tidy_wells |>
inner_join(loughran, by = "word") |>
count(title, sentiment) |>
group_by(title) |>
mutate(proportion = n / sum(n)) |>
ungroup()
ggplot(wells_loughran, aes(sentiment, proportion, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2) +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Loughran-McDonald Category Proportions — H.G. Wells",
subtitle = "Proportion of matched words falling into each category",
x = NULL,
y = "Proportion of matched sentiment words"
) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
```
#### 2.5.2 Loughran Category Profile — Austen Novels
```{r loughran-austen-profile}
#| fig-cap: "Loughran-McDonald category proportions — Jane Austen novels"
#| fig-height: 7
austen_loughran <- tidy_books |>
inner_join(loughran, by = "word") |>
count(book, sentiment) |>
group_by(book) |>
mutate(proportion = n / sum(n)) |>
ungroup()
ggplot(austen_loughran, aes(sentiment, proportion, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2) +
scale_fill_brewer(palette = "Set1") +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Loughran-McDonald Category Proportions — Jane Austen",
subtitle = "Proportion of matched words falling into each category",
x = NULL,
y = "Proportion of matched sentiment words"
) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
```
#### 2.5.3 Uncertainty Language — The Signature of Science Fiction
The *uncertainty* category is the most analytically interesting when applied
to fiction. Financial uncertainty words ("possible," "might," "appears,"
"uncertain," "approximately") map naturally onto the language of characters
confronting the unknown.
```{r uncertainty-comparison}
#| fig-cap: "Top uncertainty words: Austen vs Wells (Loughran)"
loughran_uncertainty <- loughran |> filter(sentiment == "uncertainty")
austen_unc <- tidy_books |>
inner_join(loughran_uncertainty, by = "word") |>
count(word, sort = TRUE) |>
mutate(corpus = "Jane Austen")
wells_unc <- tidy_wells |>
inner_join(loughran_uncertainty, by = "word") |>
count(word, sort = TRUE) |>
mutate(corpus = "H.G. Wells")
bind_rows(austen_unc, wells_unc) |>
group_by(corpus) |>
slice_max(n, n = 12) |>
ungroup() |>
ggplot(aes(n, reorder_within(word, n, corpus), fill = corpus)) +
geom_col(show.legend = FALSE) +
facet_wrap(~corpus, scales = "free") +
scale_y_reordered() +
scale_fill_manual(values = c("#e15759", "#4e79a7")) +
labs(
title = "Top Uncertainty Words — Austen vs Wells",
subtitle = "Loughran-McDonald uncertainty category",
x = "Count", y = NULL
) +
theme_minimal(base_size = 12)
```
#### 2.5.4 Bing vs Loughran Arc — *The War of the Worlds*
```{r bing-vs-loughran-arc}
#| fig-cap: "Bing vs Loughran (positive − negative) — *The War of the Worlds*"
#| fig-height: 6
wotw <- tidy_wells |> filter(title == "The War of the Worlds")
wotw_bing <- wotw |>
inner_join(get_sentiments("bing"), by = "word",
relationship = "many-to-many") |>
count(index = linenumber %/% 80, sentiment) |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
mutate(net = positive - negative, method = "Bing")
# pivot_wider only creates columns that exist in the data, so we
# explicitly add any missing polarity column before computing net
wotw_loughran <- wotw |>
inner_join(
loughran |> filter(sentiment %in% c("positive", "negative")),
by = "word",
relationship = "many-to-many"
) |>
count(index = linenumber %/% 80, sentiment) |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
(function(df) {
if (!"positive" %in% names(df)) df$positive <- 0L
if (!"negative" %in% names(df)) df$negative <- 0L
df
})() |>
mutate(net = positive - negative, method = "Loughran")
bind_rows(
wotw_bing |> select(index, net, method),
wotw_loughran |> select(index, net, method)
) |>
ggplot(aes(index, net, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y") +
scale_fill_manual(values = c("#4e79a7", "#f28e2b")) +
labs(
title = "Bing vs Loughran — The War of the Worlds",
subtitle = "Net sentiment (positive − negative) per 80-line chunk",
x = "Narrative progress (chunk index)",
y = "Net sentiment"
) +
theme_minimal(base_size = 12)
```
---
## Part 3 — Cross-Corpus Comparison
### 3.1 Overall Bing Polarity: Austen vs Wells
```{r polarity-comparison}
#| fig-cap: "Positive vs negative word balance: Austen vs Wells (Bing)"
austen_pol <- tidy_books |>
inner_join(get_sentiments("bing"), by = "word") |>
count(sentiment) |>
mutate(corpus = "Jane Austen", proportion = n / sum(n))
wells_pol <- tidy_wells |>
inner_join(get_sentiments("bing"), by = "word") |>
count(sentiment) |>
mutate(corpus = "H.G. Wells", proportion = n / sum(n))
bind_rows(austen_pol, wells_pol) |>
ggplot(aes(sentiment, proportion, fill = corpus)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = c("#e15759", "#4e79a7")) +
labs(
title = "Positive vs Negative Word Balance",
subtitle = "Bing lexicon — Austen vs Wells",
x = NULL, y = "Proportion of matched words", fill = "Corpus"
) +
theme_minimal(base_size = 13)
```
### 3.2 Full NRC Emotion Profile: Austen vs Wells
```{r nrc-full-comparison}
#| fig-cap: "Full NRC emotion profile: Austen vs Wells"
#| fig-height: 6
nrc_all <- get_sentiments("nrc")
austen_nrc <- tidy_books |>
inner_join(nrc_all, by = "word") |>
count(sentiment) |>
mutate(corpus = "Jane Austen", proportion = n / sum(n))
wells_nrc <- tidy_wells |>
inner_join(nrc_all, by = "word") |>
count(sentiment) |>
mutate(corpus = "H.G. Wells", proportion = n / sum(n))
bind_rows(austen_nrc, wells_nrc) |>
ggplot(aes(reorder(sentiment, proportion), proportion, fill = corpus)) +
geom_col(position = "dodge") +
coord_flip() +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = c("#e15759", "#4e79a7")) +
labs(
title = "NRC Emotion Profiles — Austen vs Wells",
subtitle = "Proportion of matched words in each category",
x = NULL, y = "Proportion", fill = "Corpus"
) +
theme_minimal(base_size = 13)
```
### 3.3 Uncertainty Rate per 1,000 Words: Austen vs Wells
```{r uncertainty-rate}
#| fig-cap: "Loughran uncertainty word rate per 1,000 words"
austen_unc_rate <- tidy_books |>
inner_join(loughran_uncertainty, by = "word") |>
nrow() / nrow(tidy_books) * 1000
wells_unc_rate <- tidy_wells |>
inner_join(loughran_uncertainty, by = "word") |>
nrow() / nrow(tidy_wells) * 1000
tibble(
corpus = c("Jane Austen", "H.G. Wells"),
rate = c(austen_unc_rate, wells_unc_rate)
) |>
ggplot(aes(corpus, rate, fill = corpus)) +
geom_col(show.legend = FALSE, width = 0.5) +
scale_fill_manual(values = c("#e15759", "#4e79a7")) +
labs(
title = "Uncertainty Word Rate per 1,000 Words",
subtitle = "Loughran-McDonald uncertainty category",
x = NULL,
y = "Uncertainty words per 1,000 words"
) +
theme_minimal(base_size = 14)
```
---
## Discussion
### Do the corpora differ in the expected direction?
Yes, but more subtly than expected. The **NRC emotion profile** (§3.2) shows
the clearest contrast: Wells registers higher *fear* and *anger*, while Austen
shows higher *trust* and *anticipation* — consistent with the difference
between invasion narratives and courtship narratives. *Surprise* is roughly
equal, perhaps because both genres rely on plot twists and revelation.
The **Bing polarity** comparison (§3.1) is more surprising: both corpora are
net-positive, and the gap between them is smaller than expected. This reflects
a known limitation of unigram methods — high-frequency, neutral-to-positive
words ("good," "great," "well") dominate raw counts and pull every corpus
toward positive regardless of genre.
### What does the Loughran lexicon add?
The **uncertainty rate** comparison (§3.3) is the most distinctive finding
from the extension lexicon. Wells uses more uncertainty language than Austen —
words like "perhaps," "appeared," "seemed," "possible," and "might" appear at
a higher rate in his science fiction. This makes intuitive sense: Wells's
protagonists are constantly reasoning about phenomena at the edge of human
understanding. Austen's characters may be socially uncertain, but they rarely
confront epistemological uncertainty about the nature of reality itself.
The **Loughran arc** on *The War of the Worlds* (§2.5.4) produces more
compressed swings than the Bing arc on the same text. Loughran's positive and
negative vocabulary was calibrated on financial prose and matches fewer fiction
words overall — fewer matches means less signal, but also fewer false
positives like "miss."
### Lexicon choice matters
| Lexicon | Strength | Limitation in this context |
|---|---|---|
| AFINN | Graded intensity | Small vocabulary |
| Bing | Large vocabulary | Binary only; false positives ("miss") |
| NRC | Rich emotion categories | Overlapping categories; inflated counts |
| Loughran | Unique uncertainty/litigious axes | Calibrated on finance; under-matches fiction |
No single lexicon is correct. The most informative analysis uses multiple
lexicons and treats disagreements between them as data rather than problems.
### Limitations
1. **No negation handling:** "not good" scores the same as "good."
2. **Historical vocabulary:** Some 19th-century words are absent from modern
lexicons, or have shifted meaning since they were written.
3. **Loughran genre mismatch:** Financial calibration means many emotionally
charged fiction words have no Loughran entry, reducing coverage.
4. **Raw counts vs normalisation:** Where books differ in length, proportions
(as used throughout Part 3) are more meaningful than raw counts.
---
## References
::: {#refs}
:::