This project follows the sentiment analysis workflow introduced in Chapter 2 of Text Mining with R. The original example is reproduced by implementing the provided code in a Quarto (.qmd) file and ensuring it runs correctly. Proper citation is included for both the book and the source of the example. The analysis is then extended by applying the same methodology to a different text corpus and incorporating an additional sentiment lexicon.
Approach
This project is completed in two stages. In the first stage, the base sentiment analysis example from Chapter 2 of Text Mining with R is reproduced. The original code is implemented in a Quarto (.qmd) file, executed to ensure correctness, and properly cited.
In the second stage, the analysis is extended by applying the same workflow to a different text corpus. For the extended analysis, the novel Frankenstein by Mary Shelley is used as an additional text corpus, obtained from Project Gutenberg via the gutenbergr package. The text is processed using tidy text principles, where it is tokenized into individual words and joined with sentiment lexicons to calculate sentiment.
An additional sentiment lexicon is incorporated beyond those used in the original example. This allows for comparison of how different lexicons classify sentiment across the same text.
Finally, the results from the original example and the extended analysis are compared to evaluate differences in sentiment across datasets and lexicons.
Code Base
# Load librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load Jane Austen books (used in Chapter 2 example)austen_books <-austen_books()# Convert text into tidy format (one word per row)tidy_austen <- austen_books %>%unnest_tokens(word, text)# Load Bing sentiment lexicon (positive / negative)bing_lexicon <-get_sentiments("bing")# Join words with sentiment labelsausten_sentiment <- tidy_austen %>%inner_join(bing_lexicon, by ="word")
Warning in inner_join(., bing_lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 12 × 3
book sentiment n
<fct> <chr> <int>
1 Sense & Sensibility negative 3671
2 Sense & Sensibility positive 4933
3 Pride & Prejudice negative 3652
4 Pride & Prejudice positive 5052
5 Mansfield Park negative 4828
6 Mansfield Park positive 6749
7 Emma negative 4809
8 Emma positive 7157
9 Northanger Abbey negative 2518
10 Northanger Abbey positive 3244
11 Persuasion negative 2201
12 Persuasion positive 3473
PART 2: Extended Analysis
# Download Frankenstein by Mary Shelley (Gutenberg ID: 84)frankenstein <-gutenberg_download(84, meta_fields =c("title", "author"))
Mirror list unavailable. Falling back to <https://aleph.pglaf.org>.
# Convert text into tidy formattidy_frankenstein <- frankenstein %>%unnest_tokens(word, text)# Remove common stop words (e.g., "the", "and")tidy_frankenstein <- tidy_frankenstein %>%anti_join(stop_words, by ="word")frankenstein <-gutenberg_download(84, meta_fields =c("title", "author"))head(frankenstein)
# A tibble: 6 × 4
gutenberg_id text title author
<int> <chr> <chr> <chr>
1 84 "Frankenstein;" Frankenstein; o… Shell…
2 84 "" Frankenstein; o… Shell…
3 84 "or, the Modern Prometheus" Frankenstein; o… Shell…
4 84 "" Frankenstein; o… Shell…
5 84 "by Mary Wollstonecraft (Godwin) Shelley" Frankenstein; o… Shell…
6 84 "" Frankenstein; o… Shell…
Sentiment using Bing lexicon
frank_bing <- tidy_frankenstein %>%inner_join(bing_lexicon, by ="word") %>%count(sentiment)# View sentiment countsfrank_bing
# A tibble: 2 × 2
sentiment n
<chr> <int>
1 negative 3736
2 positive 2597
Additional Lexicon: AFINN
# Load AFINN lexicon (numeric sentiment scores)afinn_lexicon <-get_sentiments("afinn")# Join with Frankenstein wordsfrank_afinn <- tidy_frankenstein %>%inner_join(afinn_lexicon, by ="word")# Calculate average sentiment scoreafinn_summary <- frank_afinn %>%summarize(avg_sentiment =mean(value, na.rm =TRUE))afinn_summary
# A tibble: 1 × 1
avg_sentiment
<dbl>
1 -0.116
Sentiment over sections (better analysis)
# Create index to group text into chunks (like Chapter 2 example)frank_sections <- tidy_frankenstein %>%mutate(section =row_number() %/%80)# Count sentiment per sectionfrank_trend <- frank_sections %>%inner_join(bing_lexicon, by ="word") %>%count(section, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(net_sentiment = positive - negative)# Plot sentiment trendggplot(frank_trend, aes(x = section, y = net_sentiment)) +geom_line() +labs(title ="Sentiment Trend in Frankenstein",x ="Text Section",y ="Net Sentiment" )
Simple comparison plot
# Plot Bing sentiment counts for Frankensteinggplot(frank_bing, aes(x = sentiment, y = n, fill = sentiment)) +geom_col() +labs(title ="Sentiment Counts in Frankenstein (Bing Lexicon)")
Conclusion
The original example from Text Mining with R demonstrates sentiment analysis using Jane Austen’s texts, where the sentiment trend shows relatively balanced fluctuations between positive and negative values, with smoother transitions across sections.
In contrast, the analysis of Frankenstein by Mary Shelley reveals a noticeably different pattern. The sentiment trend is more volatile, with sharper fluctuations and more frequent negative values. The net sentiment line often falls below zero, indicating that negative words appear more frequently than positive ones throughout much of the text. This aligns with the novel’s darker themes, including isolation, fear, and tragedy.
The inclusion of the AFINN lexicon further supports this observation. While the Bing lexicon classifies words into positive and negative categories, AFINN assigns numeric sentiment scores. The average sentiment score for Frankenstein was approximately -0.12, indicating an overall slightly negative tone. This provides a more nuanced measure of sentiment intensity and confirms the general negativity observed in the Bing-based analysis.
Overall, the comparison shows that sentiment results vary depending on both the text corpus and the lexicon used. While the original example presents a more balanced emotional structure, Frankenstein exhibits a more negative and unstable sentiment pattern. Additionally, the difference between lexicons highlights how methodological choices can influence the interpretation of sentiment in text analysis.