In this assignment, I will reproduce and extend the sentiment analysis example from Chapter 2 of *Text Mining with R* by Julia Silge and David Robinson. First, I will recreate the original example in a Quarto document using the same general workflow shown in the chapter. This includes tokenizing the text, joining it with sentiment lexicons, and analyzing the emotional tone of the text.
To reproduce the base example, I will use the `janeaustenr` package together with the `tidytext` and `textdata` packages. I will download the required sentiment lexicons such as AFINN, Bing, and NRC, and then use them to analyze the Jane Austen texts.
To extend the analysis, I will use a different text corpus of my choice and apply at least one additional sentiment lexicon. I will then compare the results from the new corpus with the original Jane Austen example and explain how the sentiment patterns differ across lexicons and texts.
For the final report, I will include:
1. the reproduced Chapter 2 example,
2. a citation to the book and original example,
3. an extended analysis using a new corpus,
4. at least one additional sentiment lexicon,
5. and an explanation of how the new results differ from the original results.
Warning in inner_join(tidy_books, get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.5.2
# A tibble: 122,204 × 4
book linenumber chapter word
<fct> <int> <int> <chr>
1 Pride & Prejudice 1 0 pride
2 Pride & Prejudice 1 0 and
3 Pride & Prejudice 1 0 prejudice
4 Pride & Prejudice 3 0 by
5 Pride & Prejudice 3 0 jane
6 Pride & Prejudice 3 0 austen
7 Pride & Prejudice 7 1 chapter
8 Pride & Prejudice 7 1 1
9 Pride & Prejudice 10 1 it
10 Pride & Prejudice 10 1 is
# ℹ 122,194 more rows
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
The results show that sentiment changes over the course of each Jane Austen novel. The Bing lexicon provides a simple positive-versus-negative view, while AFINN gives weighted sentiment values and NRC offers a broader emotional classification. Comparing the three lexicons shows that they capture similar overall patterns, but differ in scale and interpretation.
Source
This analysis reproduces and extends examples from Chapter 2 of Text Mining with R by Julia Silge and David Robinson.
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media.
Step2 — Extend The Analysis
Using Different Text Corpus
For the extension, I will analyze a different text corpus and apply at least one additional sentiment lexicon. I will then compare the findings with the Jane Austen example from Chapter 2.
Warning: package 'gutenbergr' was built under R version 4.5.2
library(tidytext)library(dplyr)gutenberg_metadata |>filter(str_detect(title, "Alice in Wonderland"))
# A tibble: 7 × 8
gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
<int> <chr> <chr> <int> <fct> <chr>
1 114 "The Ten… Tenni… 74 en ""
2 19551 "Alice i… Carro… 7 en "Category: Childre…
3 19551 "Alice i… Gorha… 8742 en "Category: Childre…
4 35688 "Alice i… Carro… 7 en "Category: Childre…
5 35688 "Alice i… Gerst… 37837 en "Category: Childre…
6 35990 "The Sto… Bowma… 38017 en "Category: Biograp…
7 36308 "Songs F… Carro… 7 en "Category: Childre…
# ℹ 2 more variables: rights <fct>, has_text <lgl>
alice <-gutenberg_download(11)
Mirror list unavailable. Falling back to <https://aleph.pglaf.org>.
#Analysis
tidy_alice <- alice |>mutate(linenumber =row_number()) |>unnest_tokens(word, text)alice_sentiment <- tidy_alice |>inner_join(get_sentiments("bing"), by ="word") |>count(sentiment)
library(ggplot2)ggplot(alice_sentiment, aes(x = sentiment, y = n, fill = sentiment)) +geom_col() +labs(title ="Sentiment in Alice in Wonderland")
alice_sentiment_time <- tidy_alice |>inner_join(get_sentiments("bing"), by ="word") |>count(index = linenumber %/%80, sentiment) |>pivot_wider(names_from = sentiment, values_from = n, values_fill =0) |>mutate(sentiment = positive - negative)ggplot(alice_sentiment_time, aes(x = index, y = sentiment)) +geom_col(fill ="steelblue") +labs(title ="Sentiment Trajectory in Alice in Wonderland",x ="Text Index",y ="Net Sentiment" )
For the extended analysis, I used Alice’s Adventures in Wonderland as a different text corpus. Using the Bing lexicon, I found that the text contains both positive and negative language, but the sentiment pattern differs from the Jane Austen example. This suggests that the tone and vocabulary of the two corpora are different.
Using one new lexicon for analysis
#install.packages("syuzhet")library(syuzhet)# split text into sentences automaticallysentences <-get_sentences(alice$text)sentiment_values <-get_sentiment(sentences)plot(sentiment_values, type ="l",main ="Sentiment using Syuzhet",ylab ="Sentiment",xlab ="Text Progress")
Using the syuzhet package, I analyzed sentiment at the sentence level, which produces a continuous sentiment trajectory across the text. Unlike the Bing, AFINN, and NRC lexicons, which rely on word-level classifications, this method captures how sentiment evolves throughout the narrative. The results show fluctuations in sentiment that reflect changes in tone and events within the story.
The Bing lexicon classifies words strictly as positive or negative, which provides a simple overall sentiment score. In contrast, the Syuzhet method evaluates sentiment at the sentence level and produces a continuous sentiment score. This allows for a more detailed view of how sentiment changes throughout the narrative, rather than just counting positive and negative words.
Conclusion
This assignment was both challenging and informative. By reproducing and extending the sentiment analysis example from Chapter 2, I learned how to work with text data, apply different sentiment lexicons, and visualize patterns in language.
Using Alice’s Adventures in Wonderland as an alternative corpus, I observed that sentiment patterns vary significantly depending on the structure and style of the text. The Bing lexicon provided a simple classification of positive and negative words, while the Syuzhet method offered a more detailed view of how sentiment evolves throughout the narrative.
Overall, this project improved my understanding of text mining, sentiment analysis, and data visualization in R, and demonstrated how different analytical approaches can lead to different insights.