Sentiment Analysis with Text Mining in R

Author

Nana Kwasi Danquah

Published

April 19, 2026

Overview

This report has two parts. Part 1 reproduces the primary sentiment analysis example from Chapter 2 of Text Mining with R (Silge and Robinson 2017) using the Jane Austen corpus and the three built-in lexicons (AFINN, Bing, NRC). Part 2 extends the analysis with a different corpus — four science-fiction novels by H.G. Wells downloaded from Project Gutenberg — and an additional sentiment lexicon: the Loughran-McDonald lexicon (Loughran and McDonald 2011), originally designed for financial documents.

The central question driving the comparison is: how does the emotional texture of Victorian domestic fiction (Austen) differ from early science fiction (Wells), and what does each lexicon reveal or conceal?


Setup

Code
library(tidyverse)    # data wrangling + ggplot2
library(tidytext)     # tidy text mining
library(textdata)     # sentiment lexicons (AFINN, NRC, Loughran)
library(janeaustenr)  # base corpus
library(gutenbergr)   # extension corpus
library(wordcloud)    # word cloud visualisation
library(reshape2)     # acast() for comparison clouds
library(scales)       # percent_format()

Part 1 — Reproducing the Base Example

All code in this section is adapted directly from Chapter 2 of Text Mining with R: A Tidy Approach by Silge and Robinson (2017), available at https://www.tidytextmining.com/sentiment.

1.1 The Three Sentiment Lexicons

tidytext provides three English-language sentiment lexicons via get_sentiments(). Each encodes sentiment differently:

  • AFINN (Nielsen 2011) — integer scores from −5 (most negative) to +5 (most positive)
  • Bing (Liu 2012) — binary classification: positive or negative
  • NRC (Mohammad and Turney 2013) — ten categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust

All three are based on unigrams (individual words) and do not account for negation or context.

Code
get_sentiments("afinn") |> slice_head(n = 8)
Code
get_sentiments("bing")  |> slice_head(n = 8)
Code
get_sentiments("nrc")   |> slice_head(n = 8)

1.2 Tidying the Jane Austen Corpus

We load all six completed Austen novels from janeaustenr and convert to one-token-per-row format using unnest_tokens(). Line numbers and chapter markers are preserved for later windowed analysis.

Code
tidy_books <- austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter    = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlc]", ignore_case = TRUE)
    ))
  ) |>
  ungroup() |>
  unnest_tokens(word, text)

tidy_books

1.3 Most Common Joy Words in Emma (NRC Lexicon)

We filter the NRC lexicon to the “joy” category and inner-join it with the tokenised text of Emma to find the most frequent joy-associated words.

Code
nrc_joy <- get_sentiments("nrc") |>
  filter(sentiment == "joy")

tidy_books |>
  filter(book == "Emma") |>
  inner_join(nrc_joy, by = "word") |>
  count(word, sort = TRUE) |>
  slice_head(n = 15) |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(n, word)) +
  geom_col(fill = "#4e79a7") +
  labs(
    title    = "Most Common Joy Words in Emma",
    subtitle = "NRC Lexicon",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 13)

Top 15 joy words in Emma — NRC lexicon

1.4 Sentiment Arc Across All Six Novels (Bing Lexicon)

Each novel is sliced into 80-line windows; net sentiment (positive − negative word count) is computed per window and plotted as a bar chart, revealing the emotional trajectory of each narrative.

Code
jane_austen_sentiment <- tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(sentiment = positive - negative)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Sentiment Trajectory — Jane Austen Novels",
    subtitle = "Bing lexicon · 80-line rolling windows",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment (positive − negative)"
  ) +
  theme_minimal(base_size = 12)

Sentiment arc across Jane Austen novels — Bing lexicon

Observation: Every novel shows a broadly positive arc with dips during crisis points — the Wickham scandal in Pride & Prejudice, Marianne’s illness in Sense & Sensibility — before resolving positively, consistent with social comedy conventions.

1.5 Comparing All Three Lexicons on Pride & Prejudice

To see whether lexicon choice materially changes the story, we apply all three to Pride & Prejudice and plot the net sentiment arcs together.

Code
pride_prejudice <- tidy_books |>
  filter(book == "Pride & Prejudice")

# AFINN: numeric scores summed per window
afinn_pp <- pride_prejudice |>
  inner_join(get_sentiments("afinn"), by = "word") |>
  group_by(index = linenumber %/% 80) |>
  summarise(sentiment = sum(value)) |>
  mutate(method = "AFINN")

# Bing and NRC (positive/negative categories -> net count)
bing_nrc_pp <- bind_rows(
  pride_prejudice |>
    inner_join(get_sentiments("bing"), by = "word") |>
    mutate(method = "Bing"),
  pride_prejudice |>
    inner_join(
      get_sentiments("nrc") |>
        filter(sentiment %in% c("positive", "negative")),
      by = "word"
    ) |>
    mutate(method = "NRC")
) |>
  count(method, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(sentiment = positive - negative)

bind_rows(afinn_pp, bing_nrc_pp) |>
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = c("#e15759", "#4e79a7", "#59a14f")) +
  labs(
    title    = "Three Lexicons Compared — Pride & Prejudice",
    subtitle = "Each panel uses a different sentiment lexicon",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment"
  ) +
  theme_minimal(base_size = 12)

AFINN, Bing, and NRC compared on Pride & Prejudice

Observation: All three lexicons agree on narrative shape — early optimism, a prolonged negative centre, positive resolution — but AFINN produces the largest absolute swings because it uses a continuous scale. NRC scores higher overall because its “positive” category is broader than Bing’s.

1.6 Most Common Positive and Negative Words (Bing)

Code
bing_word_counts <- tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) |>
  ungroup()

bing_word_counts |>
  group_by(sentiment) |>
  slice_max(n, n = 10) |>
  ungroup() |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Top Positive & Negative Words — Jane Austen",
    subtitle = "Bing lexicon",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 12)

Top positive and negative words across all Austen novels — Bing

Note: “Miss” ranks among negative words because Bing codes it as the verb “to miss,” whereas in Austen it is almost always a honorific. This illustrates a classic limitation of unigram lexicons: words are context-free.

1.7 Comparison Word Cloud

Code
tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(word, sentiment, sort = TRUE) |>
  acast(word ~ sentiment, value.var = "n", fill = 0) |>
  comparison.cloud(
    colors     = c("#e15759", "#4e79a7"),
    max.words  = 120,
    title.size = 1.5
  )

Positive (blue) vs negative (red) word cloud — Bing lexicon

1.8 Most Negative Chapter Across All Novels

Which chapter of each novel has the highest proportion of negative words under the Bing lexicon?

Code
bing_negative <- get_sentiments("bing") |>
  filter(sentiment == "negative")

word_counts <- tidy_books |>
  group_by(book, chapter) |>
  summarise(words = n(), .groups = "drop")

tidy_books |>
  semi_join(bing_negative, by = "word") |>
  group_by(book, chapter) |>
  summarise(negative_words = n(), .groups = "drop") |>
  left_join(word_counts, by = c("book", "chapter")) |>
  mutate(ratio = negative_words / words) |>
  filter(chapter != 0) |>
  group_by(book) |>
  slice_max(ratio, n = 1) |>
  ungroup() |>
  arrange(desc(ratio)) |>
  select(book, chapter, negative_words, words, ratio)

Part 2 — Extension

2.1 Extension Corpus: H.G. Wells

Rationale: Austen’s domestic social comedies are polite, bounded, and emotionally moderate. As a deliberate contrast, we use four H.G. Wells science-fiction novels dealing with invasion, mutation, time travel, and existential threat. Both authors wrote in Victorian/Edwardian England but in entirely different registers.

Code
wells_meta <- tibble(
  gutenberg_id = c(35, 36, 5230, 718),
  title = c(
    "The Time Machine",
    "The War of the Worlds",
    "The Invisible Man",
    "The Island of Doctor Moreau"
  )
)

# Download texts (note: meta_fields may not work with all mirrors)
wells_raw <- gutenberg_download(wells_meta$gutenberg_id)

# Join with our local metadata to get titles
tidy_wells <- wells_raw |>
  left_join(wells_meta, by = "gutenberg_id") |>
  group_by(title) |>
  mutate(
    linenumber = row_number(),
    chapter    = cumsum(str_detect(
      text,
      regex("^chapter [\\divxlci]+", ignore_case = TRUE)
    ))
  ) |>
  ungroup() |>
  unnest_tokens(word, text)

tidy_wells |> count(title, sort = TRUE)

2.2 NRC Joy vs Fear — Austen and Wells Side by Side

Code
nrc_joy_fear <- get_sentiments("nrc") |>
  filter(sentiment %in% c("joy", "fear"))

austen_jf <- tidy_books |>
  inner_join(nrc_joy_fear, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "Jane Austen", proportion = n / sum(n))

wells_jf <- tidy_wells |>
  inner_join(nrc_joy_fear, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "H.G. Wells", proportion = n / sum(n))

bind_rows(austen_jf, wells_jf) |>
  ggplot(aes(sentiment, proportion, fill = corpus)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Joy vs Fear — Austen vs Wells",
    subtitle = "NRC lexicon · proportion of joy/fear matched words",
    x = NULL, y = "Proportion", fill = "Corpus"
  ) +
  theme_minimal(base_size = 13)

Joy vs fear word proportions: Austen vs Wells (NRC)

2.3 Sentiment Arc — Wells Novels (Bing)

Code
wells_sentiment_bing <- tidy_wells |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(title, index = linenumber %/% 80, sentiment) |>
  pivot_wider(
    names_from  = sentiment,
    values_from = n,
    values_fill = 0
  ) |>
  mutate(sentiment = positive - negative)

ggplot(wells_sentiment_bing, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x") +
  scale_fill_brewer(palette = "Dark2") +
  labs(
    title    = "Sentiment Trajectory — H.G. Wells Novels",
    subtitle = "Bing lexicon · 80-line rolling windows",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment (positive − negative)"
  ) +
  theme_minimal(base_size = 12)

Sentiment arc across H.G. Wells novels — Bing lexicon

2.4 Most Common Fear Words by Wells Novel (NRC)

Code
nrc_fear <- get_sentiments("nrc") |> filter(sentiment == "fear")

tidy_wells |>
  inner_join(nrc_fear, by = "word") |>
  count(title, word, sort = TRUE) |>
  group_by(title) |>
  slice_max(n, n = 10) |>
  ungroup() |>
  ggplot(aes(n, reorder_within(word, n, title), fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, scales = "free_y") +
  scale_y_reordered() +
  scale_fill_brewer(palette = "Dark2") +
  labs(
    title    = "Most Common Fear Words — H.G. Wells",
    subtitle = "NRC Lexicon",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 11)

Top fear words in each Wells novel — NRC lexicon

2.5 Additional Lexicon: Loughran-McDonald

Background

The Loughran-McDonald lexicon (Loughran and McDonald 2011) was constructed from SEC 10-K annual reports to identify words with consistent sentiment signals in financial prose. It provides six categories:

Category Meaning in finance Why it is interesting in fiction
positive Favourable outlook General optimism
negative Unfavourable outlook General pessimism
uncertainty Hedging, speculation Language of the unknown
litigious Legal language Conflict, authority
constraining Obligation, restriction Captivity, control
superfluous Redundant filler

Applying a financial lexicon to Victorian fiction is deliberately unconventional. The goal is not to claim Loughran is the right tool for fiction, but to use its unique categories — especially uncertainty — to surface linguistic patterns that Bing and NRC cannot detect.

Code
loughran <- get_sentiments("loughran")
loughran |> count(sentiment, sort = TRUE)

2.5.1 Loughran Category Profile — Wells Novels

Code
wells_loughran <- tidy_wells |>
  inner_join(loughran, by = "word") |>
  count(title, sentiment) |>
  group_by(title) |>
  mutate(proportion = n / sum(n)) |>
  ungroup()

ggplot(wells_loughran, aes(sentiment, proportion, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Loughran-McDonald Category Proportions — H.G. Wells",
    subtitle = "Proportion of matched words falling into each category",
    x = NULL,
    y = "Proportion of matched sentiment words"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Loughran-McDonald category proportions — H.G. Wells novels

2.5.2 Loughran Category Profile — Austen Novels

Code
austen_loughran <- tidy_books |>
  inner_join(loughran, by = "word") |>
  count(book, sentiment) |>
  group_by(book) |>
  mutate(proportion = n / sum(n)) |>
  ungroup()

ggplot(austen_loughran, aes(sentiment, proportion, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Loughran-McDonald Category Proportions — Jane Austen",
    subtitle = "Proportion of matched words falling into each category",
    x = NULL,
    y = "Proportion of matched sentiment words"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Loughran-McDonald category proportions — Jane Austen novels

2.5.3 Uncertainty Language — The Signature of Science Fiction

The uncertainty category is the most analytically interesting when applied to fiction. Financial uncertainty words (“possible,” “might,” “appears,” “uncertain,” “approximately”) map naturally onto the language of characters confronting the unknown.

Code
loughran_uncertainty <- loughran |> filter(sentiment == "uncertainty")

austen_unc <- tidy_books |>
  inner_join(loughran_uncertainty, by = "word") |>
  count(word, sort = TRUE) |>
  mutate(corpus = "Jane Austen")

wells_unc <- tidy_wells |>
  inner_join(loughran_uncertainty, by = "word") |>
  count(word, sort = TRUE) |>
  mutate(corpus = "H.G. Wells")

bind_rows(austen_unc, wells_unc) |>
  group_by(corpus) |>
  slice_max(n, n = 12) |>
  ungroup() |>
  ggplot(aes(n, reorder_within(word, n, corpus), fill = corpus)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~corpus, scales = "free") +
  scale_y_reordered() +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Top Uncertainty Words — Austen vs Wells",
    subtitle = "Loughran-McDonald uncertainty category",
    x = "Count", y = NULL
  ) +
  theme_minimal(base_size = 12)

Top uncertainty words: Austen vs Wells (Loughran)

2.5.4 Bing vs Loughran Arc — The War of the Worlds

Code
wotw <- tidy_wells |> filter(title == "The War of the Worlds")

wotw_bing <- wotw |>
  inner_join(get_sentiments("bing"), by = "word",
             relationship = "many-to-many") |>
  count(index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(net = positive - negative, method = "Bing")

# pivot_wider only creates columns that exist in the data, so we
# explicitly add any missing polarity column before computing net
wotw_loughran <- wotw |>
  inner_join(
    loughran |> filter(sentiment %in% c("positive", "negative")),
    by = "word",
    relationship = "many-to-many"
  ) |>
  count(index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  (function(df) {
    if (!"positive" %in% names(df)) df$positive <- 0L
    if (!"negative" %in% names(df)) df$negative <- 0L
    df
  })() |>
  mutate(net = positive - negative, method = "Loughran")

bind_rows(
  wotw_bing     |> select(index, net, method),
  wotw_loughran |> select(index, net, method)
) |>
  ggplot(aes(index, net, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = c("#4e79a7", "#f28e2b")) +
  labs(
    title    = "Bing vs Loughran — The War of the Worlds",
    subtitle = "Net sentiment (positive − negative) per 80-line chunk",
    x = "Narrative progress (chunk index)",
    y = "Net sentiment"
  ) +
  theme_minimal(base_size = 12)

Bing vs Loughran (positive − negative) — The War of the Worlds

Part 3 — Cross-Corpus Comparison

3.1 Overall Bing Polarity: Austen vs Wells

Code
austen_pol <- tidy_books |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(sentiment) |>
  mutate(corpus = "Jane Austen", proportion = n / sum(n))

wells_pol <- tidy_wells |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(sentiment) |>
  mutate(corpus = "H.G. Wells", proportion = n / sum(n))

bind_rows(austen_pol, wells_pol) |>
  ggplot(aes(sentiment, proportion, fill = corpus)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Positive vs Negative Word Balance",
    subtitle = "Bing lexicon — Austen vs Wells",
    x = NULL, y = "Proportion of matched words", fill = "Corpus"
  ) +
  theme_minimal(base_size = 13)

Positive vs negative word balance: Austen vs Wells (Bing)

3.2 Full NRC Emotion Profile: Austen vs Wells

Code
nrc_all <- get_sentiments("nrc")

austen_nrc <- tidy_books |>
  inner_join(nrc_all, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "Jane Austen", proportion = n / sum(n))

wells_nrc <- tidy_wells |>
  inner_join(nrc_all, by = "word") |>
  count(sentiment) |>
  mutate(corpus = "H.G. Wells", proportion = n / sum(n))

bind_rows(austen_nrc, wells_nrc) |>
  ggplot(aes(reorder(sentiment, proportion), proportion, fill = corpus)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "NRC Emotion Profiles — Austen vs Wells",
    subtitle = "Proportion of matched words in each category",
    x = NULL, y = "Proportion", fill = "Corpus"
  ) +
  theme_minimal(base_size = 13)

Full NRC emotion profile: Austen vs Wells

3.3 Uncertainty Rate per 1,000 Words: Austen vs Wells

Code
austen_unc_rate <- tidy_books |>
  inner_join(loughran_uncertainty, by = "word") |>
  nrow() / nrow(tidy_books) * 1000

wells_unc_rate <- tidy_wells |>
  inner_join(loughran_uncertainty, by = "word") |>
  nrow() / nrow(tidy_wells) * 1000

tibble(
  corpus = c("Jane Austen", "H.G. Wells"),
  rate   = c(austen_unc_rate, wells_unc_rate)
) |>
  ggplot(aes(corpus, rate, fill = corpus)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  scale_fill_manual(values = c("#e15759", "#4e79a7")) +
  labs(
    title    = "Uncertainty Word Rate per 1,000 Words",
    subtitle = "Loughran-McDonald uncertainty category",
    x = NULL,
    y = "Uncertainty words per 1,000 words"
  ) +
  theme_minimal(base_size = 14)

Loughran uncertainty word rate per 1,000 words

Discussion

Do the corpora differ in the expected direction?

Yes, but more subtly than expected. The NRC emotion profile (§3.2) shows the clearest contrast: Wells registers higher fear and anger, while Austen shows higher trust and anticipation — consistent with the difference between invasion narratives and courtship narratives. Surprise is roughly equal, perhaps because both genres rely on plot twists and revelation.

The Bing polarity comparison (§3.1) is more surprising: both corpora are net-positive, and the gap between them is smaller than expected. This reflects a known limitation of unigram methods — high-frequency, neutral-to-positive words (“good,” “great,” “well”) dominate raw counts and pull every corpus toward positive regardless of genre.

What does the Loughran lexicon add?

The uncertainty rate comparison (§3.3) is the most distinctive finding from the extension lexicon. Wells uses more uncertainty language than Austen — words like “perhaps,” “appeared,” “seemed,” “possible,” and “might” appear at a higher rate in his science fiction. This makes intuitive sense: Wells’s protagonists are constantly reasoning about phenomena at the edge of human understanding. Austen’s characters may be socially uncertain, but they rarely confront epistemological uncertainty about the nature of reality itself.

The Loughran arc on The War of the Worlds (§2.5.4) produces more compressed swings than the Bing arc on the same text. Loughran’s positive and negative vocabulary was calibrated on financial prose and matches fewer fiction words overall — fewer matches means less signal, but also fewer false positives like “miss.”

Lexicon choice matters

Lexicon Strength Limitation in this context
AFINN Graded intensity Small vocabulary
Bing Large vocabulary Binary only; false positives (“miss”)
NRC Rich emotion categories Overlapping categories; inflated counts
Loughran Unique uncertainty/litigious axes Calibrated on finance; under-matches fiction

No single lexicon is correct. The most informative analysis uses multiple lexicons and treats disagreements between them as data rather than problems.

Limitations

  1. No negation handling: “not good” scores the same as “good.”
  2. Historical vocabulary: Some 19th-century words are absent from modern lexicons, or have shifted meaning since they were written.
  3. Loughran genre mismatch: Financial calibration means many emotionally charged fiction words have no Loughran entry, reducing coverage.
  4. Raw counts vs normalisation: Where books differ in length, proportions (as used throughout Part 3) are more meaningful than raw counts.

References

Liu, Bing. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65.
Mohammad, Saif M., and Peter D. Turney. 2013. “Crowdsourcing a Word-Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–65.
Nielsen, Finn Årup. 2011. “A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs.” arXiv Preprint arXiv:1103.2903. https://arxiv.org/abs/1103.2903.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/.