Sentiment Analysis with Text Mining in R — Approach
Approach
This report reproduces the sentiment analysis example from Chapter 2 of Text Mining with R [@silge2017text] and extends it in two directions: a different corpus (H.G. Wells via Project Gutenberg) and an additional lexicon (Loughran-McDonald).
Step 1 — Base Example
The base example follows Chapter 2 of @silge2017text exactly. It uses the janeaustenr package as the corpus and demonstrates three core workflows:
- NRC lexicon — most common joy words in Emma
- Bing lexicon — sentiment arc across all six Austen novels in 80-line chunks
- Three-lexicon comparison — AFINN, Bing, and NRC applied to Pride & Prejudice side by side
All code in this section is adapted directly from @silge2017text, available at https://www.tidytextmining.com/sentiment.
Step 2a — Different Corpus: H.G. Wells
Four novels are downloaded from Project Gutenberg via the gutenbergr package:
| Gutenberg ID | Title |
|---|---|
| 35 | The Time Machine (1895) |
| 36 | The War of the Worlds (1898) |
| 5230 | The Invisible Man (1897) |
| 718 | The Island of Doctor Moreau (1896) |
These contrast sharply with Austen science fiction, existential dread, and violence rather than domestic social drama which makes sentiment differences meaningful rather than superficial.
Step 2b — Additional Lexicon: Loughran-McDonald
The Loughran-McDonald lexicon [@loughran2011liability] is accessed via get_sentiments("loughran") from the textdata package. It was originally designed for SEC 10-K financial filings and provides six categories unavailable in NRC or Bing:
| Category | Description |
|---|---|
| positive / negative | General polarity |
| uncertainty | Hedging and speculative language |
| litigious | Legal and dispute language |
| constraining | Language of obligation and restriction |
| superfluous | Filler and redundant language |
The key analytical angle: the uncertainty category, designed to flag financial hedging, turns out to be prominent in science fiction characters in Wells constantly reason about phenomena they cannot fully explain. This makes Loughran a revealing lens despite the genre mismatch, and also illustrates an important limitation: lexicons should be matched to genre.
Comparison & Discussion
The final section runs the same Bing polarity balance and NRC emotion profile on both corpora side by side. The central finding is that Wells scores surprisingly positive on Bing (general words like “great” and “strange” dominate raw counts), while Loughran’s uncertainty and negative categories tell a darker story. This gap motivates the broader discussion of why lexicon choice matters.