For this assignment, we were asked to work through the original code in Chapter 2 of the Text Mining with R textbook before then extending it with a new text work and lexicon.
Load these before you begin. Please install if you do not already have the packages:
# Setup and Dependency Loadinglibrary(tidyverse) # Core data manipulation and visualization
Warning: package 'dplyr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext) # Text mining framework
Warning: package 'tidytext' was built under R version 4.5.3
library(textdata) # Access to sentiment lexicons (AFINN, NRC)
Warning: package 'textdata' was built under R version 4.5.3
library(janeaustenr) # Austen corpus
Warning: package 'janeaustenr' was built under R version 4.5.3
library(gutenbergr) # Project Gutenberg corpus access
Warning: package 'gutenbergr' was built under R version 4.5.3
Warning: package 'lexicon' was built under R version 4.5.3
library(wordcloud) # Word cloud visualization
Warning: package 'wordcloud' was built under R version 4.5.3
Loading required package: RColorBrewer
library(reshape2) # Required for comparison.cloud()
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
library(scales) # Formatting axes and labels (percents)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Methodology:
For this assignment, we are going to reproduce the base example of Sentiment Analysis in Chapter 2 of Text Mining with R. We are going to use the following packages: tidytext, dplyr and stringr. The goal here is to use Tidy Text philosophy, which can be done through functions such as un_nest_tokens() and inner_join(). We will be reproducing the sentiment path of Jane Austen’s novel and using the janeaustenr package to do so. We will also use the gutenbergr package to choose another work that has a significantly different tone. We will also perform a comparative validation to determine how much the lexicons agree by calculating the correlation between the sentiment scores that are produced by the different lexicons in the same segments to see where they diverge. Some of the data challenges that we anticipate is the potential sparsity of the lexicons. These lexicons are finite and many of the chosen words in the corpus may not exist in the lexicon. We will have to calculate the coverage rate to determine if the sentiment score that results will actually be representative of the text.
Step 1: Jane-Austen
The function get_sentiments allows us to get specific sentiment lexicons with the appropriate measures for each one.
# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# ℹ 13,862 more rows
When looking for words with a joy score from the NRC lexicon, we first need to take the text of the novels and convert the text to the tidy format using unnest_tokens(). The function below does that and also sets up some other columns to keep track of which line and chapter of the book each word comes from.
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Now we are able to plot these sentiment scores across the plot trajectory of each novel.
To choose only the words from one novel we’re interested in, we’ll use filter(). In this case we’re using all three sentiment lexicons to examine how the sentiment changes across the narrative arc of Pride and Prejudice.
# A tibble: 122,204 × 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Pride & Prejudice 1 0 pride
2 Pride & Prejudice 1 0 and
3 Pride & Prejudice 1 0 prejudice
4 Pride & Prejudice 3 0 by
5 Pride & Prejudice 3 0 jane
6 Pride & Prejudice 3 0 austen
7 Pride & Prejudice 7 1 chapter
8 Pride & Prejudice 7 1 1
9 Pride & Prejudice 10 1 it
10 Pride & Prejudice 10 1 is
# ℹ 122,194 more rows
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Having established an estimate of the net sentiment, we will bind them together and visualize them.
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bing_word_counts
# A tibble: 2,585 × 3
word sentiment n
<chr> <chr> <int>
1 miss negative 1855
2 well positive 1523
3 good positive 1380
4 great positive 981
5 like positive 725
6 better positive 639
7 enough positive 613
8 happy positive 534
9 love positive 495
10 pleasure positive 462
# ℹ 2,575 more rows
bing_word_counts %>%group_by(sentiment) %>%slice_max(n, n =10) %>%ungroup() %>%mutate(word =reorder(word, n)) %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +facet_wrap(~sentiment, scales ="free_y") +labs(x ="Contribution to sentiment", y =NULL)
The word “miss” is an anomaly and is coded as a negative when it really shouldn’t. We can add “miss” to a custom stop-words list using bind_rows() and implement a strategy with the follow code.
# A tibble: 1,150 × 2
word lexicon
<chr> <chr>
1 miss custom
2 a SMART
3 a's SMART
4 able SMART
5 about SMART
6 above SMART
7 according SMART
8 accordingly SMART
9 across SMART
10 actually SMART
# ℹ 1,140 more rows
We can visualize this in different types of word clouds.
tidy_books %>%anti_join(stop_words) %>%count(word) %>%with(wordcloud(word, n, max.words =100))
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Some sentiment analysis algorithms look beyond only unigrams to try to understand the sentiment of a sentence as a whole. In those scenarios, we may want to tokenize text into sentences.
Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by book and chapter.
ℹ Output is grouped by book.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(book, chapter))` for per-operation grouping
(`?dplyr::dplyr_by`) instead.
tidy_books %>%semi_join(bingnegative) %>%group_by(book, chapter) %>%summarize(negativewords =n()) %>%left_join(wordcounts, by =c("book", "chapter")) %>%mutate(ratio = negativewords/words) %>%filter(chapter !=0) %>%slice_max(ratio, n =1) %>%ungroup()
Joining with `by = join_by(word)`
`summarise()` has regrouped the output.
# A tibble: 6 × 5
book chapter negativewords words ratio
<fct> <int> <int> <int> <dbl>
1 Sense & Sensibility 43 161 3405 0.0473
2 Pride & Prejudice 34 111 2104 0.0528
3 Mansfield Park 46 173 3685 0.0469
4 Emma 15 151 3340 0.0452
5 Northanger Abbey 21 149 2982 0.0500
6 Persuasion 4 62 1807 0.0343
Step 2: ‘The War of the Worlds’ Extension
The novel ‘War of the Worlds’ by H.G Wells will be used to perform a sentiment analysis.
# Grab the text from gutenbergr & convert to tidy textwotw_tidy <-gutenberg_download(161) %>%mutate(linenumber =row_number(),chapter =cumsum(str_detect(text, regex("^chapter", ignore_case =TRUE)))) %>%unnest_tokens(word, text)
Using mirror https://aleph.pglaf.org.
# Load the Jockers hash table for literary sentimentjockers_lex <- lexicon::hash_sentiment_jockers# Create a Loughran-McDonald baseline dataframeloughran_wotw <- wotw_tidy %>%inner_join(get_sentiments("loughran"), by ="word") %>%filter(sentiment %in%c("positive", "negative")) %>%mutate(method ="Loughran")
Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1354 of `x` matches multiple rows in `y`.
ℹ Row 2772 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# Plot the Loughran baseline to prove the initial negative trajectoryloughran_wotw %>%count(method, index = linenumber %/%80, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment = positive - negative) %>%ggplot(aes(index, sentiment)) +geom_col(fill ="midnightblue") +theme_minimal() +labs(title ="Baseline Sentiment: 'War of the Worlds' (Loughran)",subtitle ="Pre-cleaning trajectory using a financial/technical lexicon",x ="Narrative Progress (80-line bins)",y ="Net Sentiment Score")
We can see from this initial exploration that the general sentiment of this title when using the Loughran-McDonald lexicon is majority negative sentiment trajectory throughout the plot of the book. In contrast, Pride and Prejudice leaned heavily towards a positive sentiment trajectory throughout its plot.
Please note, that during our use of the Loughran-McDonald lexicon when reviewing the output and contrasting it with the context of the book, we noticed a discrepancy that did not make sense – when calculating the intensity of despair within Well’s book, it returned a positive sentiment trajectory. Upon further investigation, we realized that although Loughran-McDonald’s lexicon dictionary ‘hits’ the words that appear in our text, it is built for accounting and finance and thus largely unable to pick up the narrative context or tone of our text it is being used on. Some of the words that Wells uses in his book match Loughran’s “risk” vocabulary, thus for this sentiment analysis, we receive a negative general sentiment.
However, a deeper look at standard word contributions reveals a significant amount of contextual noise. To combat this “noise”, we removed specific words due to the context in which they are used within the science fiction narrative.
Please note that later on we switch to the Jockers lexicon and compare it against Loughrans (out of curiosity).
Irrelevant words in Context
wells_noise <-tibble(word =c("miss", "well", "object", "like", "great", "good", "enough", "perfectly"),lexicon ="custom")# Run the tidy pipe with an anti_join to remove noisewotw_cleaned <- wotw_tidy %>%anti_join(stop_words, by ="word") %>%anti_join(wells_noise, by ="word") # Plot Word Contribution with External Labels to show the noise is gonewotw_cleaned %>%inner_join(get_sentiments("bing"), by ="word") %>%count(word, sentiment, sort =TRUE) %>%group_by(sentiment) %>%mutate(percent = n /sum(n),word =reorder(word, n)) %>%slice_max(n, n =15) %>%ungroup() %>%ggplot(aes(n, word, fill = sentiment)) +geom_col(show.legend =FALSE) +geom_text(aes(label =paste0(n, " (", percent(percent, accuracy =0.1), ")")),hjust =-0.1, size =3.2, color ="gray20", fontface ="bold") +scale_x_continuous(expand =expansion(mult =c(0, 0.2))) +facet_wrap(~sentiment, scales ="free_y") +theme_minimal() +labs(title ="Word Contribution (Cleaned Context)",x ="Count (n)",y =NULL)
Jockers and Loughran Sentiment Trajectories
Here is the sentiment trajectory plot after we applied the anti_join to remove the contextual noise. We will compare the literary Jockers lexicon against the financial Loughran-McDonald lexicon
Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 513 of `x` matches multiple rows in `y`.
ℹ Row 656 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# Combine and Plot the Differencesbind_rows(jockers_trajectory, loughran_trajectory) %>%ggplot(aes(index, sentiment, color = lexicon)) +geom_line(linewidth =1, show.legend =FALSE) +geom_smooth(method ="loess", se =FALSE, linetype ="dashed", color ="gray30") +facet_wrap(~lexicon, scales ="free_y", ncol =1) +theme_minimal() +labs(title ="Sentiment Trajectory after Contextual Cleaning",subtitle ="Cleaned data excludes: miss, well, object, like, great, good, enough, perfectly",x ="Narrative Progress (100-line bins)",y ="Total Sentiment Score")
`geom_smooth()` using formula = 'y ~ x'
Calculating Lexicon Coverage Rates
Here, we calculate the coverage rate to determine if the sentiment score that results is a better representation of the text
# Gets the total number of words in the corpustotal_wotw_words <-nrow(wotw_tidy)# Counts the Loughran matchesloughran_matches <- wotw_tidy %>%inner_join(get_sentiments("loughran"), by ="word") %>%nrow()
Warning in inner_join(., get_sentiments("loughran"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1354 of `x` matches multiple rows in `y`.
ℹ Row 2772 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# Counts the Jockers matchesjockers_matches <- wotw_tidy %>%inner_join(jockers_lex, by =c("word"="x")) %>%nrow()# Defines the data pipelinecoverage_rates <-tibble(Lexicon =c("Loughran-McDonald", "Jockers"),Matched_Words =c(loughran_matches, jockers_matches),Total_Words = total_wotw_words,Coverage_Percentage =c((loughran_matches / total_wotw_words) *100, (jockers_matches / total_wotw_words) *100))# Plots the Coverage Ratescoverage_rates %>%ggplot(aes(x = Lexicon, y = Coverage_Percentage /100, fill = Lexicon)) +geom_col(show.legend =FALSE, width =0.5) +geom_text(aes(label =percent(Coverage_Percentage /100, accuracy =0.1)), vjust =-0.8, size =4.5, color ="gray20",fontface ="bold") +scale_y_continuous(labels =percent_format(), expand =expansion(mult =c(0, 0.15))) +scale_fill_manual(values =c("Jockers"="steelblue", "Loughran-McDonald"="darkred")) +theme_minimal() +theme(axis.text.x =element_text(size =12, face ="bold")) +labs(title ="Lexicon Sparsity: Model Coverage in 'War of the Worlds'",subtitle ="Percentage of total corpus words successfully mapped to a sentiment score",x ="Sentiment Dictionary",y ="Corpus Coverage Rate")
The resulting chart shows that Jockers has a significantly higher coverage rate than Loughran. Loughran’s score is driven by a small subset of words that is is trained on, which led to a small coverage rate of 5.2% when compared to Jockers coverage rate of 11.8%.
Austen vs. Wells: The Inverted Valence Paradox
We wanted to demonstrate why The War of the Worlds (the extension) required customized data cleaning compared to the original analysis. To do this, we compared how a standard lexicon (Bing; used in the original Austen’s example) interprets positive sentiment, using our two different texts.
# Austen Data: Calculate the totals and percentages before slicingausten_positive <- tidy_books %>%filter(book =="Emma") %>%inner_join(get_sentiments("bing"), by ="word") %>%filter(sentiment =="positive") %>%count(word, sort =TRUE) %>%mutate(total_positive =sum(n), percent = n / total_positive, Author ="Jane Austen (Emma)") %>%slice_max(n, n =10)
Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 32813 of `x` matches multiple rows in `y`.
ℹ Row 4099 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# Wells Data: Calculate the totals and percentages before slicingwells_positive <- wotw_tidy %>%inner_join(get_sentiments("bing"), by ="word") %>%filter(sentiment =="positive") %>%count(word, sort =TRUE) %>%mutate(total_positive =sum(n), percent = n / total_positive, Author ="H.G. Wells (WotW)") %>%slice_max(n, n =10)# Combine and Plot with external labels (due to labels on the inside being cut off due to short bars)bind_rows(austen_positive, wells_positive) %>%mutate(word =reorder_within(word, n, Author)) %>%ggplot(aes(x = n, y = word, fill = Author)) +geom_col(show.legend =FALSE) +geom_text(aes(label =paste0(n, " (", percent(percent, accuracy =0.1), ")")),hjust =-0.1, size =3.5, color ="gray20") +scale_y_reordered() +scale_x_continuous(expand =expansion(mult =c(0, 0.2))) +facet_wrap(~Author, scales ="free") +scale_fill_manual(values =c("Jane Austen (Emma)"="darkgreen", "H.G. Wells (WotW)"="darkred")) +theme_minimal() +labs(title ="The 'Inverted Valence' Paradox in Sentiment Analysis",subtitle ="Labeled with raw count and percentage of total positive sentiment",x ="Word Frequency",y =NULL)
Conclusion
As our conclusion, the application of sentiment lexicons to H.G. Wells’s ‘The War of the Worlds’ revealed a stark contrast to the baseline Jane Austen example. This is primarily due to the domain mismatch (two very different genres) and the “inverted valence” paradox. Austen’s texts aligned with standard lexicons to produce intuitive sentiment trajectories which seemed heavily driven very “emotional” vocabulary. Wells’s science fiction narrative initially inflated the positive scores because his text used a pseudo-journalist narrator who used pseudo-journalistic words like “great” and “well” to describe catastrophic events. This required us to create a custom stop-word dictionary to mitigate the resulting contextual noise and correct the model’s trajectory. Furthermore, though the Loughran-McDonald accounting lexicon’s seemingly accurate negative overall sentiment, our validation testing revealed this was due to extreme data sparsity (5.2%), as the model covered only a fraction of the text by flagging words as financial risks rather than indicators of narrative despair. Ultimately, our extension demonstrates that sentiment models are highly sensitive to the corpus’ domain. If you do not include contextual feature engineering and coverage validation, raw lexicon scores can easily produce analytically false conclusions. Rather than simply accepting what the sentiment analysis gives you, you should consider the context to decide if it aligns with the output or if there may be a mistake or discrepancy behind it’s output. Some “common sense” is required and you should always make sure that you understand the output.
An interesting additional use case for this may be in a book recommendation engine. Perhaps, it could create a sentiment-trajectory of books that fit a “theme” or specific trajectory. Then depending on the recommendation engine, books could be matched to users based on the score given to a book. This would be similar to the movie-recommendation algorithm we used earlier in the semester.
Citations
Google DeepMind. (2026). Gemini 3 Flash [Large language model]. https://gemini.google.com. Accessed April 19th, 2026.
APA 7th Edition Citation (Software/Packages):
Jane Austen Corpus: Silge, J. (2020). janeaustenr: Jane Austen’s complete novels (Version 1.0.0) [Computer software]. CRAN. https://CRAN.R-project.org/package=janeaustenr
H.G. Wells Corpus: Robinson, D. (2020). gutenbergr: Download and process public domain works from Project Gutenberg (Version 0.2.0) [Computer software]. CRAN. https://CRAN.R-project.org/package=gutenbergr