Sentiment Analysis with Text Mining with R

Author

ZIHAO YU

1.How will I tackle the problem?

To begin with, I will use the janeaustenr package to reproduce the original sample code from the book, and then organize and analyze the text of ‘The Great Gatsby’. Next, I will split the text into words and use at least two sentiment lexicons to observe the emotional shifts in the novel. Finally, I will compare the differences between the results of the various lexicon analyses.

2.What data challenges do I anticipate?

Cleaning the text data for the entire book may require additional code and could leave behind some unusual outliers, so further processing will be necessary in follow-up analyses. Additionally, certain words may have different meanings in different contexts, which could also cause deviations in the overall sentiment curve.

The source link is : ‘https://www.gutenberg.org/ebooks/64317’.


1.Reproduce the Base Example

This section reproduces the Chapter 2 sentiment analysis example from ‘Text Mining with R: A Tidy Approach’ by Julia Silge and David Robinson.

library(textdata)
library(tidytext)
library(janeaustenr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
Joining with `by = join_by(word)`
# A tibble: 301 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
# ℹ 291 more rows
library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Citation

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. Chapter 2: Sentiment analysis with tidy data. https://www.tidytextmining.com/sentiment


2.Extend the Analysis

I use a different text corpus: The Great Gatsby by F. Scott Fitzgerald.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ readr     2.1.5
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(gutenbergr)

gatsby_raw <- gutenberg_download(64317)
Using mirror https://aleph.pglaf.org.
tidy_gatsby <-
  gatsby_raw %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))
  ) %>%
  unnest_tokens(word, text)

head(tidy_gatsby)
# A tibble: 6 × 4
  gutenberg_id linenumber chapter word  
         <int>      <int>   <int> <chr> 
1        64317          1       0 the   
2        64317          1       0 great 
3        64317          1       0 gatsby
4        64317          2       0 by    
5        64317          3       0 f     
6        64317          3       0 scott 
gatsby_sentiment <- 
  tidy_gatsby |>
  inner_join(get_sentiments("bing")) |>
  count(index = linenumber %/% 80, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
head(gatsby_sentiment,10)
# A tibble: 10 × 4
   index negative positive sentiment
   <dbl>    <int>    <int>     <int>
 1     0       16       23         7
 2     1       18       30        12
 3     2       21       28         7
 4     3       21       28         7
 5     4       19        9       -10
 6     5       26       16       -10
 7     6       27       24        -3
 8     7       21       26         5
 9     8       12       14         2
10     9       29       10       -19
ggplot(
  gatsby_sentiment, 
  aes(index, sentiment)
) +
  geom_col(show.legend = FALSE,fill = "pink") +
  theme_minimal()

afinn <-
  tidy_gatsby %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
Joining with `by = join_by(word)`
bing_and_nrc <- 
  bind_rows(
  tidy_gatsby %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  tidy_gatsby %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", "negative"))) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 3613 of `x` matches multiple rows in `y`.
ℹ Row 2275 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bind_rows(afinn, bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  theme_minimal()

Conclusion based on the chart.

  1. Based on an analysis of ‘The Great Gatsby’, the generated chart shows that negative sentiments are predominantly present throughout the novel, with a particularly high concentration of negative sentiments toward the end, which aligns with the story’s tragic ending. Overall, the text exhibits a negative emotional bias.

  2. A comparison of the three methods reveals that AFINN shows more positive sentiment in the front section, while NRC displays mostly positive sentiment. Negative sentiment in all three methods is concentrated in the final section, which aligns with the tragic ending of ‘The Great Gatsby’.


Additional Sentiment Analysis Using Another R Package.

External source link: ‘https://rstudio-pubs-static.s3.amazonaws.com/676279_2fa8c2a7a3da4e7089e24442758e9d1b.html’

library(syuzhet)

jockers_sentiment <- 
  gatsby_raw %>%
  mutate(
    linenumber = row_number(),
    index = linenumber %/% 80,
    jockers = get_sentiment(text, method = "syuzhet")
  ) %>%
  group_by(index) %>%
  summarise(sentiment = sum(jockers, na.rm = TRUE), .groups = "drop")

head(jockers_sentiment, 10)
# A tibble: 10 × 2
   index sentiment
   <dbl>     <dbl>
 1     0    7.4   
 2     1   20.3   
 3     2   13.0   
 4     3   14.4   
 5     4    0.0500
 6     5    2.2   
 7     6    6.2   
 8     7    8.4   
 9     8   11.8   
10     9   -4.75  
ggplot(
  jockers_sentiment,
  aes(index, sentiment)
) +
  geom_col(show.legend = FALSE, fill = "pink") +
  theme_minimal() +
  labs(
    title = "Sentiment in The Great Gatsby Using Jockers/Syuzhet",
    x = "Section",
    y = "Sentiment"
)

Conclusion

Using the syuzhet package for sentiment analysis, the graph shows that the sentiment scores for ‘The Great Gatsby’ are generally positive in the first half, but fluctuate more noticeably in the second half, with a higher number of negative sentiment. In particular, the sentiment scores drop significantly toward the end, which aligns with the novel’s plot as it gradually moves toward a tragic conclusion.

Similar to the previous AFINN results, this illustrates a shift in the novel’s sentiment from relatively positive to relatively negative, in contrast to Bing’s analysis showing a consistently negative tone throughout the book and NRC’s analysis indicating a mostly positive sentiment.

How the results differ from the original example

Compared with the original Jane Austen example, this analysis uses only one novel, The Great Gatsby, instead of multiple Jane Austen novels. The Gatsby results show a stronger negative shift near the end, which fits the tragic ending of the novel.