For this project, I am reproducing the sentiment analysis workflow from Chapter 2 of Text Mining with R to learn how to map emotional trajectories in text. I will start by replicating the textbook’s analysis of Jane Austen. For my extension, I have chosen to analyze W.E.B Du Bois’s The Souls of Black Folk (ID:408) using the gutenbergr package. I want to compare the emotional vocabulary of this 20th-century sociological work with the 19th-century fiction used in the base example.
Approach
I am going to do this using the Tidy Text workflow we have been discussing. My process will follow these steps.
I’ll use the unnest_tokens() to break the text into individual words, following the “one-token-per-row” rule. This is called tokenization.
I will use inner_join() to connect these words to the Bing lexicon (for positive/negative counts) and the NRC lexicon. This is sentiment joining.
Then I will add the Loughran lexicon. This dictionary is often used for technical or financial text, and I want to see if it picks up on the specific language Du Bois uses regarding law and social structures that a standard “romance” lexicon might miss.
Possible Challenges
-Sentiment lexicons often view words in isolation. I anticipate that Du Bois’s complex descriptions of the Black experience might be “mis-read” by a simple binary lexicon.
-I’ll need to filter out the Project Gutenberg header and footer so that the legal “boilerplate” text does not interfere with the actual sentiment of the book.
-Because this book was written in 1903, some vocabulary might be missing from modern sentiment dictionaries, which could lead to some data loss during the join.
Code
This code pulls the text here
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)library(gutenbergr)# Download 'The Souls of Black Folk' (ID: 408)dubois_raw <-gutenberg_download(408)
Mirror list unavailable. Falling back to <https://aleph.pglaf.org>.
# looking at the raw data structurehead(dubois_raw)
# A tibble: 6 × 2
gutenberg_id text
<int> <chr>
1 408 "The Souls of Black Folk"
2 408 ""
3 408 "by W. E. B. Du Bois"
4 408 ""
5 408 "Herein is Written"
6 408 ""
# Lets see the most frequent words Du Bois useddubois_tidy %>%count(word, sort =TRUE) %>%head(10)
# A tibble: 10 × 2
word n
<chr> <int>
1 black 281
2 negro 274
3 life 173
4 world 173
5 south 163
6 white 157
7 day 155
8 land 142
9 negroes 109
10 half 93
This uses the Bing lexicon to label positive and negative words.
# Join the book words with the sentiment dictionarydubois_sentiment <- dubois_tidy %>%inner_join(get_sentiments("bing"))
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 4914 of `x` matches multiple rows in `y`.
ℹ Row 2243 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# See the top positive and negative words dubois_sentiment %>%count(word, sentiment, sort =TRUE) %>%group_by(sentiment) %>%slice_max(n,n =5)
# A tibble: 10 × 3
# Groups: sentiment [2]
word sentiment n
<chr> <chr> <int>
1 dark negative 64
2 slave negative 54
3 hard negative 51
4 death negative 42
5 slaves negative 37
6 free positive 51
7 freedom positive 47
8 striving positive 27
9 rich positive 26
10 love positive 25
Explanation of calculations
dubois_arc <- dubois_sentiment %>%mutate(index =row_number() %/%80) %>%count(index, sentiment) %>%pivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%mutate(sentiment_score = positive - negative)# Plot the Emotional Arcggplot(dubois_arc, aes(x = index, y = sentiment_score)) +geom_col(aes(fill = sentiment_score >0)) +scale_fill_manual(values =c("firebrick", "darkgreen"), guide ="none") +theme_minimal() +labs(title ="Sentiment Arc of 'The Souls of Black Folk'",subtitle ="Positive vs. Negative sections through the book",x ="Progress Through Book",y ="Net Sentiment Score")
Conclusion
After doing all of this I was able to quantify the emotional weight of Du Bois’ prose.
Using the Bing lexicon allowed for a clear binary look at the text, though it is important to note that academic and historical language sometimes carries nuances that a simple “positive/negative’ dictionary might miss.
The visualization reveals a jagged emotional trajectory. This reflects the books structure moving between sociological analysis, personal sorrow and hopeful calls for justice.
This project shows that unnesting isn’t just for JSON.
AI Usage Transcript: Sentiment Analysis of W.E.B. Du Bois
Task: Transform raw text into a sentiment-scored visualization using the tidytext framework.
Tokenization Strategy
The Collaboration: I worked with the AI to understand the transition from “sentences” to “tokens.” We discussed how unnest_tokens() acts as the text version of the unnest() function I used for JSON.
The Logic: I prompted the AI to help me filter out “stop words.” We realized that without removing common words like “the” or “and,” the sentiment analysis would be skewed and the word counts would be meaningless.
Relational Joins (The Lexicon)
The Collaboration: I asked the AI how to “attach” emotions to words. The AI introduced the concept of an inner_join using the Bing Lexicon.
The Logic: We walked through the logic of how R looks at my list of words and compares them to a dictionary, keeping only the matches. This allowed me to turn qualitative writing into quantitative data.
Debugging & Visualization
The Collaboration: When I wanted to show a “trajectory” or an “arc,” the AI suggested using the integer division operator (%/%) to create “chunks” or “index” points for the book.
The Logic: I worked with the AI to refine the ggplot2 code, specifically using a logical statement (fill = sentiment_score > 0) to color-code the “Positive” sections green and “Negative” sections red. This made the emotional “rollercoaster” of the book immediately visible to the reader.
Reflection on AI Assistance
The AI served as a technical consultant for the tidytext syntax. While I provided the direction (analyzing Du Bois’ narrative arc), the AI helped me write more resilient code for the pivot_wider() step to ensure that sections with zero sentiment wouldn’t break the final calculation.