Introduction

For Assignment #10, I am building on an example from the Text Mining with R reading. In this example, I’ll extend the analysis to a new corpus and introduce a new sentiment lexicon, VADER. The first step involves retrieving sentiment scores using the AFINN, Bing, and NRC lexicons.

knitr::opts_chunk$set(echo = TRUE)
library(tidytext)
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
knitr::opts_chunk$set(echo = TRUE)
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

This example is using Jane Austen’s novels, so those novels will be imported, and the data will be tidied by grouping the text by book, numbering each line, and identifying chapter breaks based on common chapter headings. Finally, the text is split into individual words, making it ready for text analysis.

knitr::opts_chunk$set(echo = TRUE)
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Used the NRC lexicon and filter() for the joy words from the book Emma.

knitr::opts_chunk$set(echo = TRUE)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows

inner_join() was used to perform the sentiment analysis.

knitr::opts_chunk$set(echo = TRUE)
library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
knitr::opts_chunk$set(echo = TRUE)

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

knitr::opts_chunk$set(echo = TRUE)

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice
## # A tibble: 122,204 × 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ℹ 122,194 more rows

The three different lexicons are comapared using the novel pride and prejudice, which was filtered for. Then inner_join() is for bing and nrc since they both measure in a binary form. Where as afinn is numeric, so it is mutated.

knitr::opts_chunk$set(echo = TRUE)

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The count sentiment for each lexicon.

knitr::opts_chunk$set(echo = TRUE)

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Most common potive and negative words

Are calculated by using the count data and then vizualized with word clouds in the following chunks.

knitr::opts_chunk$set(echo = TRUE)

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ℹ 2,575 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words
## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ℹ 1,140 more rows

Word Cloud

knitr::opts_chunk$set(echo = TRUE)
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

knitr::opts_chunk$set(echo = TRUE)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Looking at Units beyond Just Words

knitr::opts_chunk$set(echo = TRUE)

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")


p_and_p_sentences$sentence[2]
## [1] "by jane austen"
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
knitr::opts_chunk$set(echo = TRUE)


bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

VADAR Lexicon

As an additional layer, I will include the VADER lexicon. While it is typically used for social media analysis, I thought it would fit Grimm’s Fairy Tales well because its ability to capture nuanced sentiments in text can reveal the underlying emotional tones present in these classic stories. Given the complex characters and moral dilemmas within the tales, VADER’s focus on both positive and negative sentiments will enhance the overall understanding of the narratives emotional landscape.

While trying to run VADER, I encountered several difficulties that slowed down the process significantly. Initially, the full dataset caused long processing times, likely due to its size and complexity, which made it challenging to manage and analyze efficiently. Additionally, I faced warnings regarding the data structure, such as the message indicating that the number of items to replace was not a multiple of the replacement length. This suggested there were mismatches in the expected data format, leading to further complications.

Given these challenges, I decided to simplify my approach and run the VADER sentiment analysis on only a sample of the data. By focusing on a smaller subset, I could more effectively troubleshoot any issues and achieve quicker results, ultimately allowing for a clearer understanding of the sentiment dynamics present in Grimm’s Fairy Tales without the overwhelming burden of processing the entire dataset at once.

knitr::opts_chunk$set(echo = TRUE)
# re-imported data so i can do it line by line instead of by word
#grimm_text <- readLines("C:/Users/tiffh/Downloads/Assignment#10/grimm.txt")
#grimm_df <- data.frame(text = grimm_text, stringsAsFactors = FALSE)
#print(head(grimm_df))

# subset of the first 100 lines for faster processing
#sample_grimm_df <- grimm_df[1:100, ]

# sentiment analysis on the sample
#vader_results_sample <- vader_df(sample_grimm_df)

#print(head(vader_results_sample))
#colnames(vader_results_sample)

#stats on smaple
#summary_statistics <- vader_results_sample %>%
  #summarise(
   # mean_positive = mean(pos, na.rm = TRUE),
   # mean_negative = mean(neg, na.rm = TRUE),
   # mean_neutral = mean(neu, na.rm = TRUE),
    #mean_compound = mean(compound, na.rm = TRUE)
#  )

#print(summary_statistics)

#plot 
#sentiment_counts <- vader_results_sample %>%
 # summarise(
  #  Positive = sum(pos > 0, na.rm = TRUE),
   # Negative = sum(neg > 0, na.rm = TRUE),
   # Neutral = sum(neu > 0, na.rm = TRUE)
#  ) %>%
 # pivot_longer(cols = everything(), names_to = "Sentiment", values_to = "Count")

# Plot the counts
#ggplot(sentiment_counts, aes(x = Sentiment, y = Count, fill = Sentiment)) +
 # geom_bar(stat = "identity") +
 # labs(title = "Counts of Positive, Negative, and Neutral Sentiments",
  #     x = "Sentiment Type",
    #   y = "Count") +
  #theme_minimal()

In examining the sentiments uncover some intriguing insights. The average positive sentiment score is around 0.066, indicating a modest level of positivity in the narratives. In comparison, the mean negative sentiment score is slightly lower at 0.042, suggesting that negativity is less prevalent in these tales., which is surprising to me.

What stands out is the high neutral score of 0.892. This reflects a tendency for the language in the stories to be descriptive and straightforward, rather than heavily emotional. The overall compound score, which averages 0.054, supports this observation, indicating that while there is a hint of positivity, the overall sentiment leans more towards neutrality.

Overall, these findings provide a snapshot of the sentiment present in the sample analyzed, but they may not be representative of every story within Grimm’s Fairy Tales due to the limited nature of the sample.

Conclusion

This is my first time using lexicons, and it has been a highly insightful experience for word analysis. Exploring the various sentiment analysis packages available has opened my eyes to the diverse methods and tools at my disposal fo r understanding language in depth. The ability to quantify and visualize sentiments associated with specific words has enhanced my appreciation for the subtleties of text and provided valuable insights that can be applied to future analyses. I look forward to further exploring these resources and expanding my skills in text mining.

Reference

Silge, Julia, and David Robinson. “Sentiment Analysis.” Tidy Text Mining. Last modified August 21, 2023. https://www.tidytextmining.com/sentiment#most- positive-negative.

GeeksforGeeks. 2024. “Python Sentiment Analysis Using VADER.” Accessed November 2, 2024. https://www.geeksforgeeks.org/python-sentiment- analysis-using-vader/.

“VADER: Valence Aware Dictionary and sEntiment Reasoner.” 2024. Accessed November 2, 2024. https://cran.r-project.org/web/packages/vader/vader. pdf.cal skills in text mining.