This assignment is based on the sentiment analysis tutorial from Text Mining with R by Julia Silge and David Robinson. The primary example code in this document is adapted from Chapter 2: Sentiment Analysis.
Citation:
Silge, Julia, and David Robinson. Text Mining with R. O’Reilly
Media, 2017. https://www.tidytextmining.com/sentiment.html
The goal is to first recreate the core example from the chapter and then extend it by:
The primary example uses the janeaustenr corpus and the
bing sentiment lexicon to analyze how sentiment changes
across the narrative of each of Jane Austen’s novels.
I begin by converting the Austen books into a tidy text format, one token per row.
# Get the Jane Austen book data and add line & chapter numbers
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
# Tokenize the text into words
unnest_tokens(word, text)
# Display the structure of the resulting data frame
head(tidy_books)
## # A tibble: 6 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
bing
LexiconI now perform an inner join with the bing sentiment
lexicon to assign a sentiment (positive/negative) to each word.
# Get the Bing sentiment lexicon
bing_sentiments <- get_sentiments("bing")
# Join the sentiments to the book words
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), relationship = "many-to-many") %>%
# Count the occurrences of each sentiment per book and index (80-line segments)
count(book, index = linenumber %/% 80, sentiment) %>%
# Spread the data to have separate columns for positive and negative
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
# Calculate net sentiment (positive - negative)
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
# Display the result
head(jane_austen_sentiment)
## # A tibble: 6 × 5
## book index negative positive sentiment
## <fct> <dbl> <int> <int> <int>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
Plot the sentiment across the progression of each novel.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x") +
labs(title = "Sentiment Arc of Jane Austen's Novels",
x = "Narrative Index (80-line segments)",
y = "Net Sentiment (Positive - Negative)") +
theme_minimal()
Interpretation: This plot successfully recreates the one from the book, showing how sentiment fluctuates throughout the narrative of each novel, often correlating with plot events.
For the first extension, I will analyze a different corpus:
H.G. Wells’s science fiction novels, also available in
the gutenbergr package. This provides a contrast in genre
and style to Jane Austen’s works.
# Install and load the gutenbergr package if not already installed
#install.packages("gutenbergr")
library(gutenbergr)
# Get H.G. Wells works and filter for specific titles
hgwells_books <- gutenberg_download(c(35, 36, 149, 3349), meta_fields = "title")
## Determining mirror for Project Gutenberg from
## https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.
# Process the data similarly to the Austen example
tidy_hgwells <- hgwells_books %>%
group_by(title) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
# Display the titles we are working with
unique(tidy_hgwells$title)
## [1] "The Time Machine" "The War of the Worlds"
## [3] "The Lost Continent" "The Wandering Jew — Volume 11"
I repeat the sentiment analysis process using the bing
lexicon on this new corpus.
hgwells_sentiment <- tidy_hgwells %>%
inner_join(get_sentiments("bing")) %>%
count(title, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 33176 of `x` matches multiple rows in `y`.
## ℹ Row 4456 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Plot the sentiment arc for H.G. Wells's novels
ggplot(hgwells_sentiment, aes(index, sentiment, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x") +
labs(title = "Sentiment Arc of H.G. Wells's Science Fiction Novels",
x = "Narrative Index (80-line segments)",
y = "Net Sentiment (Positive - Negative)") +
theme_minimal()
Interpretation: While all novels show negative dominance, The War of the Worlds stands out with negative sentiment, emphasizing its apocalyptic nature. The Lost Continent and The Wandering Jew, though still negative, present more balanced emotional profiles. The Time Machine’s plot shows deep, pronounced negative spikes, indicating intense emotional lows throughout its narrative.
hgwells_sentiment_totals <- tidy_hgwells %>%
inner_join(get_sentiments("bing"), relationship = "many-to-many") %>%
count(title, sentiment) %>%
group_by(title) %>%
mutate(total_words = sum(n),
percentage = round(n / total_words * 100, 1)) %>%
ungroup()
## Joining with `by = join_by(word)`
# Display the sentiment totals
hgwells_sentiment_totals
## # A tibble: 8 × 5
## title sentiment n total_words percentage
## <chr> <chr> <int> <int> <dbl>
## 1 The Lost Continent negative 1345 2396 56.1
## 2 The Lost Continent positive 1051 2396 43.9
## 3 The Time Machine negative 1360 2273 59.8
## 4 The Time Machine positive 913 2273 40.2
## 5 The Wandering Jew — Volume 11 negative 2527 4643 54.4
## 6 The Wandering Jew — Volume 11 positive 2116 4643 45.6
## 7 The War of the Worlds negative 2621 3852 68
## 8 The War of the Worlds positive 1231 3852 32
# Create a visualization of sentiment distribution
ggplot(hgwells_sentiment_totals, aes(x = title, y = n, fill = sentiment)) +
geom_col(position = "dodge") +
geom_text(aes(label = paste0(n, " (", percentage, "%)")),
position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
scale_fill_manual(values = c("negative" = "#F8766D", "positive" = "#00BFC4")) +
labs(title = "Total Positive vs Negative Sentiments in H.G. Wells's Novels",
x = "Novel",
y = "Word Count",
fill = "Sentiment") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Calculate overall sentiment ratios
hgwells_overall_ratio <- hgwells_sentiment_totals %>%
select(title, sentiment, n) %>%
pivot_wider(names_from = sentiment, values_from = n) %>%
mutate(negative_ratio = round(negative / (positive + negative) * 100, 1),
positive_ratio = round(positive / (positive + negative) * 100, 1))
hgwells_overall_ratio
## # A tibble: 4 × 5
## title negative positive negative_ratio positive_ratio
## <chr> <int> <int> <dbl> <dbl>
## 1 The Lost Continent 1345 1051 56.1 43.9
## 2 The Time Machine 1360 913 59.8 40.2
## 3 The Wandering Jew — Volume 11 2527 2116 54.4 45.6
## 4 The War of the Worlds 2621 1231 68 32
The quantitative analysis reveals striking patterns in sentiment distribution across the four novels. The War of the Worlds emerges as the most negatively charged work, with a substantial 68% of sentiment-carrying words classified as negative—more than double the positive words. This overwhelming negative bias reflects the novel’s apocalyptic narrative of human civilization under existential threat.
The Time Machine follows with nearly 60% negative sentiment, consistent with its dystopian vision of humanity’s devolved future. The Lost Continent and The Wandering Jew — Volume 11 show more balanced but still negative-leaning distributions, with 56.1% and 54.4% negative sentiment respectively.
Overall, all four works demonstrate a clear predominance of negative sentiment, averaging 59.6% negative words across the corpus. This consistent pattern underscores the darker thematic concerns common to these texts, focusing on existential threats, dystopian futures, and human suffering rather than optimistic resolutions. The sentiment ratios align with the critical and cautionary tones characteristic of early 20th century speculative and philosophical fiction.
For the second extension, I will incorporate the
syuzhet package, which provides access to
the NRC Emotion Lexicon. This lexicon categorizes words into eight basic
emotions (anger, fear, anticipation, trust, surprise, sadness, joy,
disgust) and two sentiments (positive, negative).
# Install and load the syuzhet package to get the NRC lexicon
#install.packages("syuzhet")
library(syuzhet)
# The get_sentiments function from tidytext also supports "nrc"
nrc_sentiments <- get_sentiments("nrc")
# Let's see the unique emotions/sentiments in this lexicon
unique(nrc_sentiments$sentiment)
## [1] "trust" "fear" "negative" "sadness" "anger"
## [6] "surprise" "positive" "disgust" "joy" "anticipation"
Let’s analyze the prevalence of different emotions in H.G. Wells’s The War of the Worlds.
# Filter for "The War of the Worlds" and join with NRC lexicon
nrc_war_of_worlds <- tidy_hgwells %>%
filter(title == "The War of the Worlds") %>%
inner_join(nrc_sentiments) %>%
# We don't want the general "positive" and "negative" from NRC for this part
filter(!sentiment %in% c("positive", "negative"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 13510 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Count the emotions
emotion_counts <- nrc_war_of_worlds %>%
count(sentiment, sort = TRUE)
# Plot the emotional profile
ggplot(emotion_counts, aes(x = reorder(sentiment, n), y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel2")+
coord_flip() +
labs(title = "Emotional Profile of 'The War of the Worlds' using NRC Lexicon",
x = "Emotion",
y = "Word Count") +
theme_minimal()
bing vs. nrc
(Sentiment)I can also compare the overall positive/negative sentiment trajectory
using the nrc lexicon against our initial
bing-based analysis for The War of the Worlds.
# Get sentiment using the NRC lexicon (filtering for positive/negative)
sentiment_nrc <- tidy_hgwells %>%
filter(title == "The War of the Worlds") %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative, lexicon = "NRC") %>% # FIX: Use same column name 'sentiment'
select(index, sentiment, lexicon)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 13510 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Get sentiment using the Bing lexicon (from our earlier data)
sentiment_bing <- hgwells_sentiment %>%
filter(title == "The War of the Worlds") %>%
select(index, sentiment) %>%
mutate(lexicon = "Bing")
# Combine the data for plotting
combined_sentiment <- bind_rows(sentiment_bing, sentiment_nrc)
# Plot the comparison
ggplot(combined_sentiment, aes(index, sentiment, color = lexicon)) +
geom_line(linewidth = 1.2, alpha = 0.7) +
facet_wrap(~lexicon, ncol = 1) +
labs(title = "Sentiment in 'The War of the Worlds': Bing vs. NRC Lexicon",
x = "Narrative Index (80-line segments)",
y = "Net Sentiment") +
theme_minimal()
Interpretation: The
NRC and
Bing lexicons produce different sentiment trajectories.
While they often agree on the direction of sentiment shifts (e.g., a
peak or a valley), the magnitude and sometimes even the sign can differ.
This highlights how the choice of lexicon can influence the results of a
sentiment analysis.
# Calculate percentages for both lexicons
bing_percentages <- tidy_hgwells %>%
filter(title == "The War of the Worlds") %>%
inner_join(get_sentiments("bing"), relationship = "many-to-many") %>%
count(sentiment) %>%
mutate(
total_words = sum(n),
percentage = round(n / total_words * 100, 1),
lexicon = "Bing"
)
## Joining with `by = join_by(word)`
nrc_percentages <- tidy_hgwells %>%
filter(title == "The War of the Worlds") %>%
inner_join(get_sentiments("nrc"), relationship = "many-to-many") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment) %>%
mutate(
total_words = sum(n),
percentage = round(n / total_words * 100, 1),
lexicon = "NRC"
)
## Joining with `by = join_by(word)`
# Combine both results
lexicon_comparison <- bind_rows(bing_percentages, nrc_percentages) %>%
select(lexicon, sentiment, n, percentage, total_words) %>%
arrange(lexicon, sentiment)
# Create a formatted table for better presentation
library(knitr)
kable(lexicon_comparison,
caption = "Comparison of Positive/Negative Percentages: Bing vs NRC \n Lexicon for 'The War of the Worlds'",
col.names = c("Lexicon", "Sentiment", "Count", "Percentage", "Total Words"))
| Lexicon | Sentiment | Count | Percentage | Total Words |
|---|---|---|---|---|
| Bing | negative | 2621 | 68.0 | 3852 |
| Bing | positive | 1231 | 32.0 | 3852 |
| NRC | negative | 2425 | 53.7 | 4513 |
| NRC | positive | 2088 | 46.3 | 4513 |
The comparison reveals significant methodological differences between the two sentiment lexicons. The Bing lexicon classifies The War of the Worlds as substantially more negative (68.0% negative vs. 32.0% positive), portraying the novel as overwhelmingly pessimistic. In contrast, the NRC lexicon presents a more balanced emotional profile (53.7% negative vs. 46.3% positive), suggesting a less extreme narrative tone.
This assignment successfully replicated the core sentiment analysis example from Text Mining with R and extended it in two meaningful ways:
NRC
Emotion Lexicon allowed for a more nuanced analysis, breaking down the
text into specific emotions and comparing its overall sentiment scoring
mechanism with the bing lexicon. This comparison
underscores the importance of understanding the tools and lexicons used
in text analysis.I learned that different sentiment lexicons can yield significantly different results, as shown by the Bing lexicon’s more pessimistic classification versus the NRC’s nuanced emotional profiling. The analysis highlighted how methodological choices in text mining—from corpus selection to lexicon application—fundamentally shape analytical outcomes and literary interpretations.