Assignment 10A Approach

Author

Theresa Benny

Approach Deliverable

For Assignment 10A, I will reproduce and extend the sentiment analysis example from Chapter 2 of Text Mining with R. The original example begins by converting Jane Austen’s novels into tidy text format, where each row contains a single word. The text is organized with variables such as book, line number, and chapter, and then tokenized using unnest_tokens(). After the text is in tidy form, sentiment analysis is performed by joining the words to a sentiment lexicon and then summarizing sentiment across sections of the text. The chapter shows this process using the bing, AFINN, and NRC lexicons.

My first step will be to reproduce the base example in a Quarto file. I will include the setup code, load the required libraries, create the tidy_books object, and run the same sentiment analysis workflow shown in the chapter. This includes creating the tidy text dataset, joining it to a sentiment lexicon, grouping the words into text sections, calculating sentiment scores, and visualizing the results. I will also include a citation to Text Mining with R and note that the code pattern is based on the Chapter 2 sentiment analysis example.

For the extension portion, I will apply the same workflow to a different text corpus: David Foster Wallace’s This Is Water speech. Since this text is a speech rather than a novel, I will adapt the original workflow so that the speech can still be analyzed in sections. Instead of chapters across multiple books, I will divide the speech into smaller chunks, such as groups of lines or paragraphs, so that I can observe how sentiment changes throughout the speech. This will allow me to preserve the main idea of the original example, which is to examine sentiment across the progression of a text. The core extension is therefore not inventing a new method, but applying the same Chapter 2 method to a different kind of text.

I will also extend the original analysis by adding at least one additional sentiment lexicon beyond those already used in Chapter 2. Since the chapter already uses bing, AFINN, and NRC, my added lexicon should come from another package or an externally researched source. I will compare the results from this added lexicon with the original lexicon results and explain whether the speech appears more positive, more negative, or more mixed depending on the dictionary used. This comparison is important because lexicons classify and score words differently, so sentiment results may vary even when analyzing the same text.

The main goal of my report will be to clearly explain how the extension differs from the original example. The original analysis focuses on Jane Austen’s fiction and shows sentiment changing across narrative arcs. My extension uses a modern speech, which is shorter and more reflective, so the sentiment pattern may look less dramatic or may shift differently across sections. I also expect the additional lexicon to produce different results from the original lexicons because each dictionary is built with different word lists and scoring methods. Rather than treating one result as definitively correct, I will explain how the choice of text and the choice of lexicon both influence the interpretation.

One challenge in this assignment will be preparing This Is Water in a format that works well with tidy text analysis. Because the original chapter uses book structure and line numbers, I will need to create a similar grouping structure for the speech. Another challenge is that lexicon-based sentiment analysis works at the word level and may miss context, such as irony, negation, or phrases whose meaning depends on surrounding words. I will acknowledge these limitations in my discussion so that the results are interpreted carefully. Chapter 2 itself notes that unigram-based lexicon methods do not account for qualifiers such as “not good,” and that chunk size can affect the final sentiment pattern.

Citation

Silge, J., & Robinson, D. (2017). Text Mining with R: A tidy approach. O’Reilly Media. https://www.tidytextmining.com/

David Foster Wallace (2005). This is water: Commencement address at Kenyon College. Retrieved from https://web.ics.purdue.edu/~drkelly/DFWKenyonAddress2005.pdf

Codebase

#Set up libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(janeaustenr)
library(stringr)

#Reproduce base example
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))
  ) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_books

# A tibble: 725,055 × 4
   book                linenumber chapter word       
   <fct>                    <int>   <int> <chr>      
 1 Sense & Sensibility          1       0 sense      
 2 Sense & Sensibility          1       0 and        
 3 Sense & Sensibility          1       0 sensibility
 4 Sense & Sensibility          3       0 by         
 5 Sense & Sensibility          3       0 jane       
 6 Sense & Sensibility          3       0 austen     
 7 Sense & Sensibility          5       0 1811       
 8 Sense & Sensibility         10       1 chapter    
 9 Sense & Sensibility         10       1 1          
10 Sense & Sensibility         13       1 the        
# ℹ 725,045 more rows

## Sentiment Analysis with Bing Lexicon

bing <- get_sentiments("bing")

tidy_books_sentiment <- tidy_books %>%
  inner_join(bing, by = "word")

Warning in inner_join(., bing, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

tidy_books_sentiment

# A tibble: 52,287 × 5
   book                linenumber chapter word        sentiment
   <fct>                    <int>   <int> <chr>       <chr>    
 1 Sense & Sensibility         16       1 respectable positive 
 2 Sense & Sensibility         16       1 good        positive 
 3 Sense & Sensibility         18       1 advanced    positive 
 4 Sense & Sensibility         20       1 death       negative 
 5 Sense & Sensibility         20       1 great       positive 
 6 Sense & Sensibility         21       1 loss        negative 
 7 Sense & Sensibility         25       1 comfortably positive 
 8 Sense & Sensibility         28       1 goodness    positive 
 9 Sense & Sensibility         28       1 solid       positive 
10 Sense & Sensibility         29       1 comfort     positive 
# ℹ 52,277 more rows

#summarize sentiment across the text

sentiment_by_section <- tidy_books_sentiment %>%
  mutate(section = linenumber %/% 80) %>%
  count(book, section, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative)

sentiment_by_section

# A tibble: 920 × 5
   book                section negative positive net_sentiment
   <fct>                 <dbl>    <int>    <int>         <int>
 1 Sense & Sensibility       0       16       32            16
 2 Sense & Sensibility       1       19       53            34
 3 Sense & Sensibility       2       12       31            19
 4 Sense & Sensibility       3       15       31            16
 5 Sense & Sensibility       4       16       34            18
 6 Sense & Sensibility       5       16       51            35
 7 Sense & Sensibility       6       24       40            16
 8 Sense & Sensibility       7       23       51            28
 9 Sense & Sensibility       8       30       40            10
10 Sense & Sensibility       9       15       19             4
# ℹ 910 more rows

ggplot(sentiment_by_section, aes(x = section, y = net_sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, scales = "free_x")

#Extend the base example

speech_lines <- readLines("thisiswater.txt")

Warning in readLines("thisiswater.txt"): incomplete final line found on
'thisiswater.txt'

speech_lines <- unlist(str_split(speech_lines, "\\. "))

speech_df <- tibble(
  line = 1:length(speech_lines),
  text = speech_lines
)
speech_df

# A tibble: 154 × 2
    line text                                                                   
   <int> <chr>                                                                  
 1     1 "Transcription of the 2005 Kenyon Commencement Address - May 21, 2005 …
 2     2 "In fact I'm gonna [mumbles while pulling up his gown and taking out a…
 3     3 "There are these two young fish swimming along and they happen to meet…
 4     4 "How's the water?\" And the two young fish swim on for a bit, and then…
 5     5 "The story [\"thing\"] turns out to be one of the better, less bullshi…
 6     6 "I am not the wise old fish"                                           
 7     7 "The point of the fish story is merely that the most obvious, importan…
 8     8 "Stated as an English sentence, of course, this is just a banal platit…
 9     9 "Of course the main requirement of speeches like this is that I'm supp…
10    10 "So let's talk about the single most pervasive cliché in the commencem…
# ℹ 144 more rows

tidy_speech <- speech_df %>%
  unnest_tokens(word, text)

tidy_speech

# A tibble: 3,871 × 2
    line word         
   <int> <chr>        
 1     1 transcription
 2     1 of           
 3     1 the          
 4     1 2005         
 5     1 kenyon       
 6     1 commencement 
 7     1 address      
 8     1 may          
 9     1 21           
10     1 2005         
# ℹ 3,861 more rows

#Sentiment Analysis of *This Is Water* with Bing

speech_bing <- tidy_speech %>%
  inner_join(get_sentiments("bing"), by = "word")
speech_sentiment <- tidy_speech %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  mutate(section = line %/% 5) %>%   # group into chunks
  count(section, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative)

speech_sentiment

# A tibble: 31 × 4
   section negative positive net_sentiment
     <dbl>    <int>    <int>         <int>
 1       0        1        2             1
 2       1        4        6             2
 3       2        5        7             2
 4       3        6        2            -4
 5       4        2        4             2
 6       5        3        2            -1
 7       6        7        0            -7
 8       7        4        4             0
 9       8        2        1            -1
10       9        3        8             5
# ℹ 21 more rows

#Visualize the sentiment
ggplot(data = speech_sentiment, aes(x = section, y = net_sentiment)) +
  geom_col()

speech_afinn <- tidy_speech %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  mutate(section = line %/% 5) %>%
  group_by(section) %>%
  summarize(net_sentiment = sum(value), .groups = "drop")

speech_afinn

# A tibble: 31 × 2
   section net_sentiment
     <dbl>         <dbl>
 1       0             2
 2       1             6
 3       2             6
 4       3            -5
 5       4             5
 6       5            10
 7       6            -7
 8       7             5
 9       8            -3
10       9             4
# ℹ 21 more rows

ggplot(data = speech_afinn, aes(x = section, y = net_sentiment)) +
  geom_col()

Comparison and Discussion

The original Chapter 2 example from Text Mining with R analyzes sentiment in Jane Austen’s novels by converting the text into tidy format, joining words to sentiment lexicons, and tracking sentiment across sections of each book. In my extension, I applied the same workflow to David Foster Wallace’s This Is Water speech. Because the speech is shorter and does not have chapters like a novel, I divided it into smaller sections based on lines so that I could examine how sentiment changes throughout the speech.

Using the Bing lexicon, the speech showed variation in net sentiment across sections based on the number of positive and negative words. I then extended the analysis further by using the AFINN lexicon, which differs from Bing because it assigns weighted numeric sentiment values rather than simply labeling words as positive or negative. This means AFINN can capture stronger or weaker sentiment intensity, while Bing gives a simpler overall balance of positive versus negative words.

Compared with the original Jane Austen example, the sentiment pattern in This Is Water is less narrative and less dramatic because the speech is reflective rather than fictional. The Austen example shows sentiment moving through story arcs across multiple novels, while the speech shows smaller shifts across a single modern text. The results also show that sentiment analysis depends on the lexicon being used, since different dictionaries classify and score words differently.

A limitation of this method is that lexicon-based sentiment analysis works at the single-word level and does not fully account for context, irony, negation, or phrases whose meaning depends on surrounding words.