Sentiment Analysis: 10A

Author

Desiree Thomas, Denise Atherley, Kiera Griffiths

Approach:

For this assignment, we were asked to work through the original code in Chapter 2 of the Text Mining with R textbook before then extending it with a new text work and lexicon.

Load these before you begin. Please install if you do not already have the packages:

# Setup and Dependency Loading
library(tidyverse)    # Core data manipulation and visualization
Warning: package 'dplyr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)     # Text mining framework
Warning: package 'tidytext' was built under R version 4.5.3
library(textdata)     # Access to sentiment lexicons (AFINN, NRC)
Warning: package 'textdata' was built under R version 4.5.3
library(janeaustenr)  # Austen corpus
Warning: package 'janeaustenr' was built under R version 4.5.3
library(gutenbergr)   # Project Gutenberg corpus access
Warning: package 'gutenbergr' was built under R version 4.5.3
library(lexicon)      # Advanced sentiment dictionaries (Jockers)
Warning: package 'lexicon' was built under R version 4.5.3
library(wordcloud)    # Word cloud visualization
Warning: package 'wordcloud' was built under R version 4.5.3
Loading required package: RColorBrewer
library(reshape2)     # Required for comparison.cloud()

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths
library(scales)       # Formatting axes and labels (percents)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Methodology:

For this assignment, we are going to reproduce the base example of Sentiment Analysis in Chapter 2 of Text Mining with R. We are going to use the following packages: tidytext, dplyr and stringr. The goal here is to use Tidy Text philosophy, which can be done through functions such as un_nest_tokens() and inner_join(). We will be reproducing the sentiment path of Jane Austen’s novel and using the janeaustenr package to do so. We will also use the gutenbergr package to choose another work that has a significantly different tone. We will also perform a comparative validation to determine how much the lexicons agree by calculating the correlation between the sentiment scores that are produced by the different lexicons in the same segments to see where they diverge. Some of the data challenges that we anticipate is the potential sparsity of the lexicons. These lexicons are finite and many of the chosen words in the corpus may not exist in the lexicon. We will have to calculate the coverage rate to determine if the sentiment score that results will actually be representative of the text.

Step 1: Jane-Austen

The function get_sentiments allows us to get specific sentiment lexicons with the appropriate measures for each one.

get_sentiments("afinn")
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows
get_sentiments("bing")
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows
get_sentiments("nrc")
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

When looking for words with a joy score from the NRC lexicon, we first need to take the text of the novels and convert the text to the tidy format using unnest_tokens(). The function below does that and also sets up some other columns to keep track of which line and chapter of the book each word comes from.

tidy_books <- austen_books() %>%
  group_by(book) %>% 
  mutate(
    linenumber = row_number(), 
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]",
                                      ignore_case = TRUE)))) %>% 
  ungroup() %>% 
  unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>% 
  filter(book == "Emma") %>% 
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
Joining with `by = join_by(word)`
# A tibble: 301 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
# ℹ 291 more rows

Next, we’ll count up how many positive and negative words there are in defined sections of each book.

jane_austen_sentiment <- tidy_books %>% 
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Now we are able to plot these sentiment scores across the plot trajectory of each novel.

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) + 
  geom_col(show.legend = FALSE) + 
             facet_wrap(~book, ncol = 2, scales = "free_x")

To choose only the words from one novel we’re interested in, we’ll use filter(). In this case we’re using all three sentiment lexicons to examine how the sentiment changes across the narrative arc of Pride and Prejudice.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice
# A tibble: 122,204 × 4
   book              linenumber chapter word     
   <fct>                  <int>   <int> <chr>    
 1 Pride & Prejudice          1       0 pride    
 2 Pride & Prejudice          1       0 and      
 3 Pride & Prejudice          1       0 prejudice
 4 Pride & Prejudice          3       0 by       
 5 Pride & Prejudice          3       0 jane     
 6 Pride & Prejudice          3       0 austen   
 7 Pride & Prejudice          7       1 chapter  
 8 Pride & Prejudice          7       1 1        
 9 Pride & Prejudice         10       1 it       
10 Pride & Prejudice         10       1 is       
# ℹ 122,194 more rows
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Having established an estimate of the net sentiment, we will bind them together and visualize them.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

We’ll now look at how many positive and negative words are in these lexicons.

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative   3316
2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative   4781
2 positive   2005

Having the data frame with both sentiment and word allows us to analyze word counts that contribute to each sentiment.

bing_word_counts  <- tidy_books %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bing_word_counts
# A tibble: 2,585 × 3
   word     sentiment     n
   <chr>    <chr>     <int>
 1 miss     negative   1855
 2 well     positive   1523
 3 good     positive   1380
 4 great    positive    981
 5 like     positive    725
 6 better   positive    639
 7 enough   positive    613
 8 happy    positive    534
 9 love     positive    495
10 pleasure positive    462
# ℹ 2,575 more rows
bing_word_counts %>% 
  group_by(sentiment) %>% 
  slice_max(n, n = 10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(n, word, fill = sentiment)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~sentiment, scales = "free_y") + 
  labs(x = "Contribution to sentiment", 
       y = NULL)

The word “miss” is an anomaly and is coded as a negative when it really shouldn’t. We can add “miss” to a custom stop-words list using bind_rows() and implement a strategy with the follow code.

custom_stop_words <- bind_rows(tibble(word = c("miss"), 
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words
# A tibble: 1,150 × 2
   word        lexicon
   <chr>       <chr>  
 1 miss        custom 
 2 a           SMART  
 3 a's         SMART  
 4 able        SMART  
 5 about       SMART  
 6 above       SMART  
 7 according   SMART  
 8 accordingly SMART  
 9 across      SMART  
10 actually    SMART  
# ℹ 1,140 more rows

We can visualize this in different types of word clouds.

tidy_books %>% 
  anti_join(stop_words) %>% 
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100))
Joining with `by = join_by(word)`

tidy_books %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  acast(word ~sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors = c("gray20", "gray80"), 
                    max.words = 100)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Some sentiment analysis algorithms look beyond only unigrams to try to understand the sentiment of a sentence as a whole. In those scenarios, we may want to tokenize text into sentences.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
[1] "by jane austen"

Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.

austen_chapters <- austen_books() %>% 
  group_by(book) %>% 
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter| CHAPTER [\\dIVXLC]") %>% 
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
# A tibble: 6 × 2
  book                chapters
  <fct>                  <int>
1 Sense & Sensibility        1
2 Pride & Prejudice         62
3 Mansfield Park             1
4 Emma                       1
5 Northanger Abbey           1
6 Persuasion                25

We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels?

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>% 
  group_by(book, chapter) %>% 
  summarize(words = n())
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by book and chapter.
ℹ Output is grouped by book.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(book, chapter))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.
tidy_books %>% 
  semi_join(bingnegative) %>% 
  group_by(book, chapter) %>% 
  summarize(negativewords = n()) %>% 
  left_join(wordcounts, by = c("book", "chapter")) %>% 
  mutate(ratio = negativewords/words) %>% 
  filter(chapter != 0) %>% 
  slice_max(ratio, n = 1) %>% 
  ungroup()
Joining with `by = join_by(word)`
`summarise()` has regrouped the output.
# A tibble: 6 × 5
  book                chapter negativewords words  ratio
  <fct>                 <int>         <int> <int>  <dbl>
1 Sense & Sensibility      43           161  3405 0.0473
2 Pride & Prejudice        34           111  2104 0.0528
3 Mansfield Park           46           173  3685 0.0469
4 Emma                     15           151  3340 0.0452
5 Northanger Abbey         21           149  2982 0.0500
6 Persuasion                4            62  1807 0.0343

Step 2: ‘The War of the Worlds’ Extension

The novel ‘War of the Worlds’ by H.G Wells will be used to perform a sentiment analysis.

# Grab the text from gutenbergr & convert to tidy text
wotw_tidy <- gutenberg_download(161) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter", ignore_case = TRUE)))) %>%
  unnest_tokens(word, text)
Using mirror https://aleph.pglaf.org.
# Load the Jockers hash table for literary sentiment
jockers_lex <- lexicon::hash_sentiment_jockers

# Create a Loughran-McDonald baseline dataframe
loughran_wotw <- wotw_tidy %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  mutate(method = "Loughran")
Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1354 of `x` matches multiple rows in `y`.
ℹ Row 2772 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Plot the Loughran baseline to prove the initial negative trajectory
loughran_wotw %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(index, sentiment)) +
  geom_col(fill = "midnightblue") +
  theme_minimal() +
  labs(title = "Baseline Sentiment: 'War of the Worlds' (Loughran)",
       subtitle = "Pre-cleaning trajectory using a financial/technical lexicon",
       x = "Narrative Progress (80-line bins)",
       y = "Net Sentiment Score")

We can see from this initial exploration that the general sentiment of this title when using the Loughran-McDonald lexicon is majority negative sentiment trajectory throughout the plot of the book. In contrast, Pride and Prejudice leaned heavily towards a positive sentiment trajectory throughout its plot.

Please note, that during our use of the Loughran-McDonald lexicon when reviewing the output and contrasting it with the context of the book, we noticed a discrepancy that did not make sense – when calculating the intensity of despair within Well’s book, it returned a positive sentiment trajectory. Upon further investigation, we realized that although Loughran-McDonald’s lexicon dictionary ‘hits’ the words that appear in our text, it is built for accounting and finance and thus largely unable to pick up the narrative context or tone of our text it is being used on. Some of the words that Wells uses in his book match Loughran’s “risk” vocabulary, thus for this sentiment analysis, we receive a negative general sentiment.

However, a deeper look at standard word contributions reveals a significant amount of contextual noise. To combat this “noise”, we removed specific words due to the context in which they are used within the science fiction narrative.

Please note that later on we switch to the Jockers lexicon and compare it against Loughrans (out of curiosity).

Irrelevant words in Context

wells_noise <- tibble(
  word = c("miss", "well", "object", "like", "great", "good", "enough", "perfectly"),
  lexicon = "custom"
)

# Run the tidy pipe with an anti_join to remove noise
wotw_cleaned <- wotw_tidy %>%
  anti_join(stop_words, by = "word") %>%
  anti_join(wells_noise, by = "word") 

# Plot Word Contribution with External Labels to show the noise is gone
wotw_cleaned %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  mutate(percent = n / sum(n),
         word = reorder(word, n)) %>%
  slice_max(n, n = 15) %>%
  ungroup() %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(n, " (", percent(percent, accuracy = 0.1), ")")),
            hjust = -0.1,      
            size = 3.2,       
            color = "gray20",  
            fontface = "bold") +
  scale_x_continuous(expand = expansion(mult = c(0, 0.2))) +
  facet_wrap(~sentiment, scales = "free_y") +
  theme_minimal() +
  labs(title = "Word Contribution (Cleaned Context)",
       x = "Count (n)",
       y = NULL)

Jockers and Loughran Sentiment Trajectories

Here is the sentiment trajectory plot after we applied the anti_join to remove the contextual noise. We will compare the literary Jockers lexicon against the financial Loughran-McDonald lexicon

# Calculate Jockers Sentiment Trajectory 
jockers_trajectory <- wotw_cleaned %>%
  inner_join(jockers_lex, by = c("word" = "x")) %>%
  group_by(index = linenumber %/% 100) %>%
  summarise(sentiment = sum(y)) %>% 
  mutate(lexicon = "Jockers")

# Calculate Loughran Trajectory 
loughran_trajectory <- wotw_cleaned %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  mutate(score = ifelse(sentiment == "positive", 1, -1)) %>%
  group_by(index = linenumber %/% 100) %>%
  summarise(sentiment = sum(score)) %>%
  mutate(lexicon = "Loughran")
Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 513 of `x` matches multiple rows in `y`.
ℹ Row 656 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Combine and Plot the Differences
bind_rows(jockers_trajectory, loughran_trajectory) %>%
  ggplot(aes(index, sentiment, color = lexicon)) +
  geom_line(linewidth = 1, show.legend = FALSE) +
  geom_smooth(method = "loess", se = FALSE, linetype = "dashed", color = "gray30") +
  facet_wrap(~lexicon, scales = "free_y", ncol = 1) +
  theme_minimal() +
  labs(title = "Sentiment Trajectory after Contextual Cleaning",
       subtitle = "Cleaned data excludes: miss, well, object, like, great, good, enough, perfectly",
       x = "Narrative Progress (100-line bins)",
       y = "Total Sentiment Score")
`geom_smooth()` using formula = 'y ~ x'

Calculating Lexicon Coverage Rates

Here, we calculate the coverage rate to determine if the sentiment score that results is a better representation of the text

# Gets the total number of words in the corpus
total_wotw_words <- nrow(wotw_tidy)

# Counts the Loughran matches
loughran_matches <- wotw_tidy %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  nrow()
Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1354 of `x` matches multiple rows in `y`.
ℹ Row 2772 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Counts the Jockers matches
jockers_matches <- wotw_tidy %>%
  inner_join(jockers_lex, by = c("word" = "x")) %>%
  nrow()

# Defines the data pipeline
coverage_rates <- tibble(
  Lexicon = c("Loughran-McDonald", "Jockers"),
  Matched_Words = c(loughran_matches, jockers_matches),
  Total_Words = total_wotw_words,
  Coverage_Percentage = c((loughran_matches / total_wotw_words) * 100, 
                          (jockers_matches / total_wotw_words) * 100)
)

# Plots the Coverage Rates
coverage_rates %>%
  ggplot(aes(x = Lexicon, y = Coverage_Percentage / 100, fill = Lexicon)) +
  geom_col(show.legend = FALSE, width = 0.5) +
  geom_text(aes(label = percent(Coverage_Percentage / 100, accuracy = 0.1)), 
            vjust = -0.8, 
            size = 4.5, 
            color = "gray20",
            fontface = "bold") +
  scale_y_continuous(labels = percent_format(), 
                     expand = expansion(mult = c(0, 0.15))) +
  scale_fill_manual(values = c("Jockers" = "steelblue", 
                               "Loughran-McDonald" = "darkred")) +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 12, face = "bold")) +
  labs(title = "Lexicon Sparsity: Model Coverage in 'War of the Worlds'",
       subtitle = "Percentage of total corpus words successfully mapped to a sentiment score",
       x = "Sentiment Dictionary",
       y = "Corpus Coverage Rate")

The resulting chart shows that Jockers has a significantly higher coverage rate than Loughran. Loughran’s score is driven by a small subset of words that is is trained on, which led to a small coverage rate of 5.2% when compared to Jockers coverage rate of 11.8%.

Austen vs. Wells: The Inverted Valence Paradox

We wanted to demonstrate why The War of the Worlds (the extension) required customized data cleaning compared to the original analysis. To do this, we compared how a standard lexicon (Bing; used in the original Austen’s example) interprets positive sentiment, using our two different texts.

# Austen Data: Calculate the totals and percentages before slicing
austen_positive <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  mutate(total_positive = sum(n),               
         percent = n / total_positive,          
         Author = "Jane Austen (Emma)") %>%
  slice_max(n, n = 10)
Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 32813 of `x` matches multiple rows in `y`.
ℹ Row 4099 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Wells Data: Calculate the totals and percentages before slicing
wells_positive <- wotw_tidy %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE) %>%
  mutate(total_positive = sum(n),               
         percent = n / total_positive,          
         Author = "H.G. Wells (WotW)") %>%
  slice_max(n, n = 10)

# Combine and Plot with external labels (due to labels on the inside being cut off due to short bars)
bind_rows(austen_positive, wells_positive) %>%
  mutate(word = reorder_within(word, n, Author)) %>%
  ggplot(aes(x = n, y = word, fill = Author)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(n, " (", percent(percent, accuracy = 0.1), ")")),
            hjust = -0.1, 
            size = 3.5, 
            color = "gray20") +
  scale_y_reordered() +
  scale_x_continuous(expand = expansion(mult = c(0, 0.2))) + 
  facet_wrap(~Author, scales = "free") +
  scale_fill_manual(values = c("Jane Austen (Emma)" = "darkgreen", 
                               "H.G. Wells (WotW)" = "darkred")) +
  theme_minimal() +
  labs(title = "The 'Inverted Valence' Paradox in Sentiment Analysis",
       subtitle = "Labeled with raw count and percentage of total positive sentiment",
       x = "Word Frequency",
       y = NULL)

Conclusion

As our conclusion, the application of sentiment lexicons to H.G. Wells’s ‘The War of the Worlds’ revealed a stark contrast to the baseline Jane Austen example. This is primarily due to the domain mismatch (two very different genres) and the “inverted valence” paradox. Austen’s texts aligned with standard lexicons to produce intuitive sentiment trajectories which seemed heavily driven very “emotional” vocabulary. Wells’s science fiction narrative initially inflated the positive scores because his text used a pseudo-journalist narrator who used pseudo-journalistic words like “great” and “well” to describe catastrophic events. This required us to create a custom stop-word dictionary to mitigate the resulting contextual noise and correct the model’s trajectory. Furthermore, though the Loughran-McDonald accounting lexicon’s seemingly accurate negative overall sentiment, our validation testing revealed this was due to extreme data sparsity (5.2%), as the model covered only a fraction of the text by flagging words as financial risks rather than indicators of narrative despair. Ultimately, our extension demonstrates that sentiment models are highly sensitive to the corpus’ domain. If you do not include contextual feature engineering and coverage validation, raw lexicon scores can easily produce analytically false conclusions. Rather than simply accepting what the sentiment analysis gives you, you should consider the context to decide if it aligns with the output or if there may be a mistake or discrepancy behind it’s output. Some “common sense” is required and you should always make sure that you understand the output.

An interesting additional use case for this may be in a book recommendation engine. Perhaps, it could create a sentiment-trajectory of books that fit a “theme” or specific trajectory. Then depending on the recommendation engine, books could be matched to users based on the score given to a book. This would be similar to the movie-recommendation algorithm we used earlier in the semester.

Citations

Google DeepMind. (2026). Gemini 3 Flash [Large language model]. https://gemini.google.com. Accessed April 19th, 2026.

APA 7th Edition Citation (Software/Packages):

Jane Austen Corpus: Silge, J. (2020). janeaustenr: Jane Austen’s complete novels (Version 1.0.0) [Computer software]. CRAN. https://CRAN.R-project.org/package=janeaustenr

H.G. Wells Corpus: Robinson, D. (2020). gutenbergr: Download and process public domain works from Project Gutenberg (Version 0.2.0) [Computer software]. CRAN. https://CRAN.R-project.org/package=gutenbergr