Sentiment Analysis with Text Mining in R — Approach

Author

Nana Kwasi Danquah

Published

April 16, 2026

Approach

This report reproduces the sentiment analysis example from Chapter 2 of Text Mining with R [@silge2017text] and extends it in two directions: a different corpus (H.G. Wells via Project Gutenberg) and an additional lexicon (Loughran-McDonald).


Step 1 — Base Example

The base example follows Chapter 2 of @silge2017text exactly. It uses the janeaustenr package as the corpus and demonstrates three core workflows:

  • NRC lexicon — most common joy words in Emma
  • Bing lexicon — sentiment arc across all six Austen novels in 80-line chunks
  • Three-lexicon comparison — AFINN, Bing, and NRC applied to Pride & Prejudice side by side

All code in this section is adapted directly from @silge2017text, available at https://www.tidytextmining.com/sentiment.


Step 2a — Different Corpus: H.G. Wells

Four novels are downloaded from Project Gutenberg via the gutenbergr package:

Gutenberg ID Title
35 The Time Machine (1895)
36 The War of the Worlds (1898)
5230 The Invisible Man (1897)
718 The Island of Doctor Moreau (1896)

These contrast sharply with Austen science fiction, existential dread, and violence rather than domestic social drama which makes sentiment differences meaningful rather than superficial.


Step 2b — Additional Lexicon: Loughran-McDonald

The Loughran-McDonald lexicon [@loughran2011liability] is accessed via get_sentiments("loughran") from the textdata package. It was originally designed for SEC 10-K financial filings and provides six categories unavailable in NRC or Bing:

Category Description
positive / negative General polarity
uncertainty Hedging and speculative language
litigious Legal and dispute language
constraining Language of obligation and restriction
superfluous Filler and redundant language

The key analytical angle: the uncertainty category, designed to flag financial hedging, turns out to be prominent in science fiction characters in Wells constantly reason about phenomena they cannot fully explain. This makes Loughran a revealing lens despite the genre mismatch, and also illustrates an important limitation: lexicons should be matched to genre.


Comparison & Discussion

The final section runs the same Bing polarity balance and NRC emotion profile on both corpora side by side. The central finding is that Wells scores surprisingly positive on Bing (general words like “great” and “strange” dominate raw counts), while Loughran’s uncertainty and negative categories tell a darker story. This gap motivates the broader discussion of why lexicon choice matters.