Sentiment Analysis

Author

Guibril Ramde

Approach

In this assignment, I will reproduce and extend the sentiment analysis example from Chapter 2 of *Text Mining with R* by Julia Silge and David Robinson. First, I will recreate the original example in a Quarto document using the same general workflow shown in the chapter. This includes tokenizing the text, joining it with sentiment lexicons, and analyzing the emotional tone of the text.

To reproduce the base example, I will use the `janeaustenr` package together with the `tidytext` and `textdata` packages. I will download the required sentiment lexicons such as AFINN, Bing, and NRC, and then use them to analyze the Jane Austen texts.

To extend the analysis, I will use a different text corpus of my choice and apply at least one additional sentiment lexicon. I will then compare the results from the new corpus with the original Jane Austen example and explain how the sentiment patterns differ across lexicons and texts.

For the final report, I will include:
1. the reproduced Chapter 2 example,
2. a citation to the book and original example,
3. an extended analysis using a new corpus,
4. at least one additional sentiment lexicon,
5. and an explanation of how the new results differ from the original results.

Step1

#install.packages("textdata")
library(tidytext)
library(textdata)

get_sentiments("afinn")

# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

get_sentiments("bing")

# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows

get_sentiments("nrc")

# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

library(janeaustenr)
library(stringr)
library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

tidy_books <- austen_books() |>
  group_by(book) |>
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text,
                                regex("^chapter [\\divxlc]",
                                            ignore_case = TRUE)))) |>
  ungroup() |>
  
 
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") |>
  filter(sentiment == "joy")


tidy_books |>
  filter(book == "Emma") |>
  inner_join(nrc_joy) |>
  count(word, sort = TRUE)

Joining with `by = join_by(word)`

# A tibble: 301 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
# ℹ 291 more rows

library(tidyr)

Warning: package 'tidyr' was built under R version 4.5.2

jane_austen_sentiment <- tidy_books |>
  inner_join(get_sentiments("bing")) |>
  count(book, index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0 ) |>
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`

Warning in inner_join(tidy_books, get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.5.2

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

pride_prejudice <- tidy_books |>
  filter(book == "Pride & Prejudice")
pride_prejudice

# A tibble: 122,204 × 4
   book              linenumber chapter word     
   <fct>                  <int>   <int> <chr>    
 1 Pride & Prejudice          1       0 pride    
 2 Pride & Prejudice          1       0 and      
 3 Pride & Prejudice          1       0 prejudice
 4 Pride & Prejudice          3       0 by       
 5 Pride & Prejudice          3       0 jane     
 6 Pride & Prejudice          3       0 austen   
 7 Pride & Prejudice          7       1 chapter  
 8 Pride & Prejudice          7       1 1        
 9 Pride & Prejudice         10       1 it       
10 Pride & Prejudice         10       1 is       
# ℹ 122,194 more rows

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

Joining with `by = join_by(word)`

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`
Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 215 of `x` matches multiple rows in `y`.
ℹ Row 5178 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The results show that sentiment changes over the course of each Jane Austen novel. The Bing lexicon provides a simple positive-versus-negative view, while AFINN gives weighted sentiment values and NRC offers a broader emotional classification. Comparing the three lexicons shows that they capture similar overall patterns, but differ in scale and interpretation.

Source

This analysis reproduces and extends examples from Chapter 2 of Text Mining with R by Julia Silge and David Robinson.

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media.

Step2 — Extend The Analysis

Using Different Text Corpus

For the extension, I will analyze a different text corpus and apply at least one additional sentiment lexicon. I will then compare the findings with the Jane Austen example from Chapter 2.

#install.packages("gutenbergr")
library(gutenbergr)

Warning: package 'gutenbergr' was built under R version 4.5.2

library(tidytext)
library(dplyr)


gutenberg_metadata |>
  filter(str_detect(title, "Alice in Wonderland"))

# A tibble: 7 × 8
  gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
         <int> <chr>     <chr>                <int> <fct>    <chr>              
1          114 "The Ten… Tenni…                  74 en       ""                 
2        19551 "Alice i… Carro…                   7 en       "Category: Childre…
3        19551 "Alice i… Gorha…                8742 en       "Category: Childre…
4        35688 "Alice i… Carro…                   7 en       "Category: Childre…
5        35688 "Alice i… Gerst…               37837 en       "Category: Childre…
6        35990 "The Sto… Bowma…               38017 en       "Category: Biograp…
7        36308 "Songs F… Carro…                   7 en       "Category: Childre…
# ℹ 2 more variables: rights <fct>, has_text <lgl>

alice <- gutenberg_download(11)

Mirror list unavailable. Falling back to <https://aleph.pglaf.org>.

#Analysis

tidy_alice <- alice |>
  mutate(linenumber = row_number()) |>
  unnest_tokens(word, text)

alice_sentiment <- tidy_alice |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(sentiment)

library(ggplot2)
ggplot(alice_sentiment, aes(x = sentiment, y = n, fill = sentiment)) +
  geom_col() +
  labs(title = "Sentiment in Alice in Wonderland")

alice_sentiment_time <- tidy_alice |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative)

ggplot(alice_sentiment_time, aes(x = index, y = sentiment)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Sentiment Trajectory in Alice in Wonderland",
    x = "Text Index",
    y = "Net Sentiment"
  )

For the extended analysis, I used Alice’s Adventures in Wonderland as a different text corpus. Using the Bing lexicon, I found that the text contains both positive and negative language, but the sentiment pattern differs from the Jane Austen example. This suggests that the tone and vocabulary of the two corpora are different.

Using one new lexicon for analysis

#install.packages("syuzhet")
library(syuzhet)

# split text into sentences automatically
sentences <- get_sentences(alice$text)
sentiment_values <- get_sentiment(sentences)

plot(sentiment_values, type = "l",
     main = "Sentiment using Syuzhet",
     ylab = "Sentiment",
     xlab = "Text Progress")

Using the syuzhet package, I analyzed sentiment at the sentence level, which produces a continuous sentiment trajectory across the text. Unlike the Bing, AFINN, and NRC lexicons, which rely on word-level classifications, this method captures how sentiment evolves throughout the narrative. The results show fluctuations in sentiment that reflect changes in tone and events within the story.

The Bing lexicon classifies words strictly as positive or negative, which provides a simple overall sentiment score. In contrast, the Syuzhet method evaluates sentiment at the sentence level and produces a continuous sentiment score. This allows for a more detailed view of how sentiment changes throughout the narrative, rather than just counting positive and negative words.

Conclusion

This assignment was both challenging and informative. By reproducing and extending the sentiment analysis example from Chapter 2, I learned how to work with text data, apply different sentiment lexicons, and visualize patterns in language.

Using Alice’s Adventures in Wonderland as an alternative corpus, I observed that sentiment patterns vary significantly depending on the structure and style of the text. The Bing lexicon provided a simple classification of positive and negative words, while the Syuzhet method offered a more detailed view of how sentiment evolves throughout the narrative.

Overall, this project improved my understanding of text mining, sentiment analysis, and data visualization in R, and demonstrated how different analytical approaches can lead to different insights.