10A Sentiment Analysis Approach

Author

Ciara Bonnett-Jones

Introduction

For this project, I am reproducing the sentiment analysis workflow from Chapter 2 of Text Mining with R to learn how to map emotional trajectories in text. I will start by replicating the textbook’s analysis of Jane Austen. For my extension, I have chosen to analyze W.E.B Du Bois’s The Souls of Black Folk (ID:408) using the gutenbergr package. I want to compare the emotional vocabulary of this 20th-century sociological work with the 19th-century fiction used in the base example.

Approach

I am going to do this using the Tidy Text workflow we have been discussing. My process will follow these steps.

I’ll use the unnest_tokens() to break the text into individual words, following the “one-token-per-row” rule. This is called tokenization.

I will use inner_join() to connect these words to the Bing lexicon (for positive/negative counts) and the NRC lexicon. This is sentiment joining.

Then I will add the Loughran lexicon. This dictionary is often used for technical or financial text, and I want to see if it picks up on the specific language Du Bois uses regarding law and social structures that a standard “romance” lexicon might miss.

Possible Challenges

-Sentiment lexicons often view words in isolation. I anticipate that Du Bois’s complex descriptions of the Black experience might be “mis-read” by a simple binary lexicon.

-I’ll need to filter out the Project Gutenberg header and footer so that the legal “boilerplate” text does not interfere with the actual sentiment of the book.

-Because this book was written in 1903, some vocabulary might be missing from modern sentiment dictionaries, which could lead to some data loss during the join.

Code

This code pulls the text here

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(gutenbergr)

# Download 'The Souls of Black Folk' (ID: 408)
dubois_raw <- gutenberg_download(408)
Mirror list unavailable. Falling back to <https://aleph.pglaf.org>.
# looking at the raw data structure
head(dubois_raw)
# A tibble: 6 × 2
  gutenberg_id text                     
         <int> <chr>                    
1          408 "The Souls of Black Folk"
2          408 ""                       
3          408 "by W. E. B. Du Bois"    
4          408 ""                       
5          408 "Herein is Written"      
6          408 ""                       

This code cleans and removes

dubois_tidy <- dubois_raw %>%
  # Break sentences into individual words
  unnest_tokens(word, text) %>%
  # Remove 'stop words' 
  anti_join(stop_words)
Joining with `by = join_by(word)`
# Lets see the most frequent words Du Bois used
dubois_tidy %>%
  count(word, sort = TRUE) %>% 
  head(10)
# A tibble: 10 × 2
   word        n
   <chr>   <int>
 1 black     281
 2 negro     274
 3 life      173
 4 world     173
 5 south     163
 6 white     157
 7 day       155
 8 land      142
 9 negroes   109
10 half       93

This uses the Bing lexicon to label positive and negative words.

# Join the book words with the sentiment dictionary
dubois_sentiment <- dubois_tidy %>%
  inner_join(get_sentiments("bing"))
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 4914 of `x` matches multiple rows in `y`.
ℹ Row 2243 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# See the top positive and negative words 
dubois_sentiment %>% 
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  slice_max(n,n = 5)
# A tibble: 10 × 3
# Groups:   sentiment [2]
   word     sentiment     n
   <chr>    <chr>     <int>
 1 dark     negative     64
 2 slave    negative     54
 3 hard     negative     51
 4 death    negative     42
 5 slaves   negative     37
 6 free     positive     51
 7 freedom  positive     47
 8 striving positive     27
 9 rich     positive     26
10 love     positive     25

Explanation of calculations

dubois_arc <- dubois_sentiment %>%
  mutate(index = row_number() %/% 80) %>%
  count(index, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment_score = positive - negative)

# Plot the Emotional Arc
ggplot(dubois_arc, aes(x = index, y = sentiment_score)) +
  geom_col(aes(fill = sentiment_score > 0)) +
  scale_fill_manual(values = c("firebrick", "darkgreen"), guide = "none") +
  theme_minimal() +
  labs(title = "Sentiment Arc of 'The Souls of Black Folk'",
       subtitle = "Positive vs. Negative sections through the book",
       x = "Progress Through Book",
       y = "Net Sentiment Score")

Conclusion

After doing all of this I was able to quantify the emotional weight of Du Bois’ prose.

Using the Bing lexicon allowed for a clear binary look at the text, though it is important to note that academic and historical language sometimes carries nuances that a simple “positive/negative’ dictionary might miss.

The visualization reveals a jagged emotional trajectory. This reflects the books structure moving between sociological analysis, personal sorrow and hopeful calls for justice.

This project shows that unnesting isn’t just for JSON.

AI Usage Transcript: Sentiment Analysis of W.E.B. Du Bois

Task: Transform raw text into a sentiment-scored visualization using the tidytext framework.

  1. Tokenization Strategy

The Collaboration: I worked with the AI to understand the transition from “sentences” to “tokens.” We discussed how unnest_tokens() acts as the text version of the unnest() function I used for JSON.

The Logic: I prompted the AI to help me filter out “stop words.” We realized that without removing common words like “the” or “and,” the sentiment analysis would be skewed and the word counts would be meaningless.

  1. Relational Joins (The Lexicon)

The Collaboration: I asked the AI how to “attach” emotions to words. The AI introduced the concept of an inner_join using the Bing Lexicon.

The Logic: We walked through the logic of how R looks at my list of words and compares them to a dictionary, keeping only the matches. This allowed me to turn qualitative writing into quantitative data.

  1. Debugging & Visualization

The Collaboration: When I wanted to show a “trajectory” or an “arc,” the AI suggested using the integer division operator (%/%) to create “chunks” or “index” points for the book.

The Logic: I worked with the AI to refine the ggplot2 code, specifically using a logical statement (fill = sentiment_score > 0) to color-code the “Positive” sections green and “Negative” sections red. This made the emotional “rollercoaster” of the book immediately visible to the reader.

  1. Reflection on AI Assistance

The AI served as a technical consultant for the tidytext syntax. While I provided the direction (analyzing Du Bois’ narrative arc), the AI helped me write more resilient code for the pivot_wider() step to ensure that sections with zero sentiment wouldn’t break the final calculation.