1. Introduction & Base Code Citation

This assignment is based on the sentiment analysis tutorial from Text Mining with R by Julia Silge and David Robinson. The primary example code in this document is adapted from Chapter 2: Sentiment Analysis.

Citation:
Silge, Julia, and David Robinson. Text Mining with R. O’Reilly Media, 2017. https://www.tidytextmining.com/sentiment.html

The goal is to first recreate the core example from the chapter and then extend it by:

2. Recreating the Primary Example

The primary example uses the janeaustenr corpus and the bing sentiment lexicon to analyze how sentiment changes across the narrative of each of Jane Austen’s novels.

2.1. Data Preparation and Tokenization

I begin by converting the Austen books into a tidy text format, one token per row.

# Get the Jane Austen book data and add line & chapter numbers
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]",
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  # Tokenize the text into words
  unnest_tokens(word, text)

# Display the structure of the resulting data frame
head(tidy_books)
## # A tibble: 6 × 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

2.2. Performing Sentiment Analysis with the bing Lexicon

I now perform an inner join with the bing sentiment lexicon to assign a sentiment (positive/negative) to each word.

# Get the Bing sentiment lexicon
bing_sentiments <- get_sentiments("bing")

# Join the sentiments to the book words
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing"), relationship = "many-to-many") %>%
  # Count the occurrences of each sentiment per book and index (80-line segments)
  count(book, index = linenumber %/% 80, sentiment) %>%
  # Spread the data to have separate columns for positive and negative
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  # Calculate net sentiment (positive - negative)
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
# Display the result
head(jane_austen_sentiment)
## # A tibble: 6 × 5
##   book                index negative positive sentiment
##   <fct>               <dbl>    <int>    <int>     <int>
## 1 Sense & Sensibility     0       16       32        16
## 2 Sense & Sensibility     1       19       53        34
## 3 Sense & Sensibility     2       12       31        19
## 4 Sense & Sensibility     3       15       31        16
## 5 Sense & Sensibility     4       16       34        18
## 6 Sense & Sensibility     5       16       51        35

2.3. Visualizing the Narrative Arc of Sentiment

Plot the sentiment across the progression of each novel.

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x") +
  labs(title = "Sentiment Arc of Jane Austen's Novels",
       x = "Narrative Index (80-line segments)",
       y = "Net Sentiment (Positive - Negative)") +
  theme_minimal()

Interpretation: This plot successfully recreates the one from the book, showing how sentiment fluctuates throughout the narrative of each novel, often correlating with plot events.

3. Extension 1: A Different Corpus

For the first extension, I will analyze a different corpus: H.G. Wells’s science fiction novels, also available in the gutenbergr package. This provides a contrast in genre and style to Jane Austen’s works.

# Install and load the gutenbergr package if not already installed
#install.packages("gutenbergr")
library(gutenbergr)

# Get H.G. Wells works and filter for specific titles
hgwells_books <- gutenberg_download(c(35, 36, 149, 3349), meta_fields = "title")
## Determining mirror for Project Gutenberg from
## https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.
# Process the data similarly to the Austen example
tidy_hgwells <- hgwells_books %>%
  group_by(title) %>%
  mutate(linenumber = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text)

# Display the titles we are working with
unique(tidy_hgwells$title)
## [1] "The Time Machine"              "The War of the Worlds"        
## [3] "The Lost Continent"            "The Wandering Jew — Volume 11"

3.1. Sentiment Analysis on H.G. Wells’s Novels

I repeat the sentiment analysis process using the bing lexicon on this new corpus.

hgwells_sentiment <- tidy_hgwells %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 33176 of `x` matches multiple rows in `y`.
## ℹ Row 4456 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Plot the sentiment arc for H.G. Wells's novels
ggplot(hgwells_sentiment, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x") +
  labs(title = "Sentiment Arc of H.G. Wells's Science Fiction Novels",
       x = "Narrative Index (80-line segments)",
       y = "Net Sentiment (Positive - Negative)") +
  theme_minimal()

Interpretation: While all novels show negative dominance, The War of the Worlds stands out with negative sentiment, emphasizing its apocalyptic nature. The Lost Continent and The Wandering Jew, though still negative, present more balanced emotional profiles. The Time Machine’s plot shows deep, pronounced negative spikes, indicating intense emotional lows throughout its narrative.

Calculate total positive and negative sentiments for each book

hgwells_sentiment_totals <- tidy_hgwells %>%
  inner_join(get_sentiments("bing"), relationship = "many-to-many") %>%
  count(title, sentiment) %>%
  group_by(title) %>%
  mutate(total_words = sum(n),
         percentage = round(n / total_words * 100, 1)) %>%
  ungroup()
## Joining with `by = join_by(word)`
# Display the sentiment totals
hgwells_sentiment_totals
## # A tibble: 8 × 5
##   title                         sentiment     n total_words percentage
##   <chr>                         <chr>     <int>       <int>      <dbl>
## 1 The Lost Continent            negative   1345        2396       56.1
## 2 The Lost Continent            positive   1051        2396       43.9
## 3 The Time Machine              negative   1360        2273       59.8
## 4 The Time Machine              positive    913        2273       40.2
## 5 The Wandering Jew — Volume 11 negative   2527        4643       54.4
## 6 The Wandering Jew — Volume 11 positive   2116        4643       45.6
## 7 The War of the Worlds         negative   2621        3852       68  
## 8 The War of the Worlds         positive   1231        3852       32
# Create a visualization of sentiment distribution
ggplot(hgwells_sentiment_totals, aes(x = title, y = n, fill = sentiment)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = paste0(n, " (", percentage, "%)")), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_fill_manual(values = c("negative" = "#F8766D", "positive" = "#00BFC4")) +
  labs(title = "Total Positive vs Negative Sentiments in H.G. Wells's Novels",
       x = "Novel",
       y = "Word Count",
       fill = "Sentiment") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate overall sentiment ratios
hgwells_overall_ratio <- hgwells_sentiment_totals %>%
  select(title, sentiment, n) %>%
  pivot_wider(names_from = sentiment, values_from = n) %>%
  mutate(negative_ratio = round(negative / (positive + negative) * 100, 1),
         positive_ratio = round(positive / (positive + negative) * 100, 1))

hgwells_overall_ratio
## # A tibble: 4 × 5
##   title                         negative positive negative_ratio positive_ratio
##   <chr>                            <int>    <int>          <dbl>          <dbl>
## 1 The Lost Continent                1345     1051           56.1           43.9
## 2 The Time Machine                  1360      913           59.8           40.2
## 3 The Wandering Jew — Volume 11     2527     2116           54.4           45.6
## 4 The War of the Worlds             2621     1231           68             32

Summary of Sentiment Analysis Results

The quantitative analysis reveals striking patterns in sentiment distribution across the four novels. The War of the Worlds emerges as the most negatively charged work, with a substantial 68% of sentiment-carrying words classified as negative—more than double the positive words. This overwhelming negative bias reflects the novel’s apocalyptic narrative of human civilization under existential threat.

The Time Machine follows with nearly 60% negative sentiment, consistent with its dystopian vision of humanity’s devolved future. The Lost Continent and The Wandering Jew — Volume 11 show more balanced but still negative-leaning distributions, with 56.1% and 54.4% negative sentiment respectively.

Overall, all four works demonstrate a clear predominance of negative sentiment, averaging 59.6% negative words across the corpus. This consistent pattern underscores the darker thematic concerns common to these texts, focusing on existential threats, dystopian futures, and human suffering rather than optimistic resolutions. The sentiment ratios align with the critical and cautionary tones characteristic of early 20th century speculative and philosophical fiction.

4. Extension 2: An Additional Sentiment Lexicon

For the second extension, I will incorporate the syuzhet package, which provides access to the NRC Emotion Lexicon. This lexicon categorizes words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, disgust) and two sentiments (positive, negative).

# Install and load the syuzhet package to get the NRC lexicon
#install.packages("syuzhet")
library(syuzhet)

# The get_sentiments function from tidytext also supports "nrc"
nrc_sentiments <- get_sentiments("nrc")
# Let's see the unique emotions/sentiments in this lexicon
unique(nrc_sentiments$sentiment)
##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

4.1. Emotional Analysis of a Single Novel

Let’s analyze the prevalence of different emotions in H.G. Wells’s The War of the Worlds.

# Filter for "The War of the Worlds" and join with NRC lexicon
nrc_war_of_worlds <- tidy_hgwells %>%
  filter(title == "The War of the Worlds") %>%
  inner_join(nrc_sentiments) %>%
  # We don't want the general "positive" and "negative" from NRC for this part
  filter(!sentiment %in% c("positive", "negative"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 13510 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Count the emotions
emotion_counts <- nrc_war_of_worlds %>%
  count(sentiment, sort = TRUE)

# Plot the emotional profile
ggplot(emotion_counts, aes(x = reorder(sentiment, n), y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  scale_fill_brewer(palette = "Pastel2")+
  coord_flip() +
  labs(title = "Emotional Profile of 'The War of the Worlds' using NRC Lexicon",
       x = "Emotion",
       y = "Word Count") +
  theme_minimal()

4.2. Comparing Lexicons: bing vs. nrc (Sentiment)

I can also compare the overall positive/negative sentiment trajectory using the nrc lexicon against our initial bing-based analysis for The War of the Worlds.

# Get sentiment using the NRC lexicon (filtering for positive/negative)
sentiment_nrc <- tidy_hgwells %>%
  filter(title == "The War of the Worlds") %>%
  inner_join(get_sentiments("nrc")) %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative, lexicon = "NRC") %>%  # FIX: Use same column name 'sentiment'
  select(index, sentiment, lexicon)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 13510 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Get sentiment using the Bing lexicon (from our earlier data)
sentiment_bing <- hgwells_sentiment %>%
  filter(title == "The War of the Worlds") %>%
  select(index, sentiment) %>%
  mutate(lexicon = "Bing")

# Combine the data for plotting
combined_sentiment <- bind_rows(sentiment_bing, sentiment_nrc)

# Plot the comparison
ggplot(combined_sentiment, aes(index, sentiment, color = lexicon)) +
  geom_line(linewidth = 1.2, alpha = 0.7) +
  facet_wrap(~lexicon, ncol = 1) +
  labs(title = "Sentiment in 'The War of the Worlds': Bing vs. NRC Lexicon",
       x = "Narrative Index (80-line segments)",
       y = "Net Sentiment") +
  theme_minimal()

Interpretation: The NRC and Bing lexicons produce different sentiment trajectories. While they often agree on the direction of sentiment shifts (e.g., a peak or a valley), the magnitude and sometimes even the sign can differ. This highlights how the choice of lexicon can influence the results of a sentiment analysis.

# Calculate percentages for both lexicons
bing_percentages <- tidy_hgwells %>%
  filter(title == "The War of the Worlds") %>%
  inner_join(get_sentiments("bing"), relationship = "many-to-many") %>%
  count(sentiment) %>%
  mutate(
    total_words = sum(n),
    percentage = round(n / total_words * 100, 1),
    lexicon = "Bing"
  )
## Joining with `by = join_by(word)`
nrc_percentages <- tidy_hgwells %>%
  filter(title == "The War of the Worlds") %>%
  inner_join(get_sentiments("nrc"), relationship = "many-to-many") %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  count(sentiment) %>%
  mutate(
    total_words = sum(n),
    percentage = round(n / total_words * 100, 1),
    lexicon = "NRC"
  )
## Joining with `by = join_by(word)`
# Combine both results
lexicon_comparison <- bind_rows(bing_percentages, nrc_percentages) %>%
  select(lexicon, sentiment, n, percentage, total_words) %>%
  arrange(lexicon, sentiment)


# Create a formatted table for better presentation
library(knitr)
kable(lexicon_comparison, 
      caption = "Comparison of Positive/Negative Percentages: Bing vs NRC \n Lexicon for 'The War of the Worlds'",
      col.names = c("Lexicon", "Sentiment", "Count", "Percentage", "Total Words"))
Comparison of Positive/Negative Percentages: Bing vs NRC Lexicon for ‘The War of the Worlds’
Lexicon Sentiment Count Percentage Total Words
Bing negative 2621 68.0 3852
Bing positive 1231 32.0 3852
NRC negative 2425 53.7 4513
NRC positive 2088 46.3 4513

Summary of Lexicon Comparison

The comparison reveals significant methodological differences between the two sentiment lexicons. The Bing lexicon classifies The War of the Worlds as substantially more negative (68.0% negative vs. 32.0% positive), portraying the novel as overwhelmingly pessimistic. In contrast, the NRC lexicon presents a more balanced emotional profile (53.7% negative vs. 46.3% positive), suggesting a less extreme narrative tone.

5. Conclusion

This assignment successfully replicated the core sentiment analysis example from Text Mining with R and extended it in two meaningful ways:

  1. New Corpus: Applying the same technique to H.G. Wells’s science fiction novels revealed starkly different, more negative sentiment arcs compared to Jane Austen’s works, demonstrating how sentiment analysis can reflect genre and thematic content.
  2. New Lexicon: Incorporating the NRC Emotion Lexicon allowed for a more nuanced analysis, breaking down the text into specific emotions and comparing its overall sentiment scoring mechanism with the bing lexicon. This comparison underscores the importance of understanding the tools and lexicons used in text analysis.

I learned that different sentiment lexicons can yield significantly different results, as shown by the Bing lexicon’s more pessimistic classification versus the NRC’s nuanced emotional profiling. The analysis highlighted how methodological choices in text mining—from corpus selection to lexicon application—fundamentally shape analytical outcomes and literary interpretations.