Sentiment Analysis



Getting Started

Problem Description and Approach

We’re going to recreate and analyze primary code from Chapter 2, Sentiment analysis with tidy data, from Julia Silge & David Robinson’s Text Mining with R (Last built 2022-02-07). We’ll extend that code to a book or selection of books from the Gutenberg project and incorporate an additional sentiment lexicon.



Re-create base analysis


The following code is from Julia Silge and David Robinson’s (2017) Text Mining with R, Chapter 2: Sentiment analysis with tidy data


Load packages

We are coding in the Tidyverse with additional packages:
- tidytext | for text mining
- janeaustenr | to load a corpus of Jane Austen’s books
- wordcloud | for wordclouds
- reshape2 | allows more control over wordcloud shape

# Load packages --------------------------------------
#install.packages('textdata')
library(tidyverse)
library(tidytext)
library(janeaustenr)
library(wordcloud)
library(reshape2)


Pull sentiment lexicons

We’re pulling the following sentiment lexicons.

AFINN-111 This dataset was published in Finn Ärup Nielsen (2011), “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages (2011) 93-98.

Bing This dataset was first published in Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

nrc This dataset was published in Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.

# Pull sentiment lexicons
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,865 more rows


Convert text to tidy format

We’re converting to tidy format (one word per row) using unnest_tokens() and adding columns for the chapter and line number.

# Convert to tidy format with chapter and line numbers
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)


Apply sentiment lexicons

Here we apply two examples of sentiment analysis.

Code Summary:
- Show ‘Joy’ words by frequency using nrc
- Graph the plot trajectory of each novel using bing

# Show 'Joy' words by frequency using nrc

# Inner join joy words from the nrc sentiment lexicon to the text 
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

# Tabulate the joy words from the book 'Emma' by frequency
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # … with 291 more rows
# Graph the plot trajectory of each novel using bing

# Create chunks of 80 lines for sentiment analysis with bing
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
# Graph the bins by how positive or negative they are
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")


Compare sentiment lexicons

Here we look into how the three sentiment lexicons perform differently when examining the sentiment changes over the narrative arc of the book Pride and Prejudice.

Code Summary:
- Isolate the book, Pride & Prejudice
- Find the net sentiment per each lexicon across the same chunks
- Visualize the net sentiment per each lexicon across the same chunks

# Isolate the book, Pride & Prejudice
pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")
# Find the net sentiment per each lexicon across the same chunks

# AFINN requires a separate pattern since it measures sentiment between -5 and 5
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
# Bing and nrc use the same pattern
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
# Visualize the net sentiment per each lexicon across the same chunks
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The chapter notes that the lexicons perform similarly in tracking the relative changes in sentiment across the novel but with different values. NRC shifts to higher values. AFINN has more variance and Bing finds longer stretches of similar text.

The chapter suggests this is because while both NRC and Bing have more negative to positive words, Bing has a higher ratio of negative to positive words than NRC does.

Code Summary:
- Show negative to positive words in NRC
- Show negative to positive words in Bing

# Show negative to positive words in NRC 
get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3318
## 2 positive   2308
# Show negative to positive words in Bing
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005


Custom Stop words

Here we track which words contributed to which sentiment, positive or negative in this case, to identify words that are confounding the results for this particular text.

Code Summary:
- Tabulate words by frequency that contribute to either sentiment
- Graph the counts for examination
- Create a custom stop-words list to address the confounding word “miss”

# Tabulate words by frequency that contribute to either sentiment
bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
# Graph the counts for examination
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

# Create a custom stop-words list to address the confounding word "miss"
custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

# For demonstration only
custom_stop_words
## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows


Wordclouds

Here we create two word clouds, one by frequency and by contribution to net sentiment.

Code Summary:
- Wordcloud of most frequent words
- Wordcloud of most frequent words split by postive or negative sentiment

# Wordcloud of most frequent words
set.seed(2248)
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

# Wordcloud of most frequent words split by postive or negative sentiment
set.seed(2341)
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"


Other units of text

Code Summary: - Example sentence after tokenizing text into sentences - Split the series of books by chapter using a regex pattern - Show the chapter from each book with the most negative net sentiment

# Tokenize text into sentences
p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

# Show example sentence
p_and_p_sentences$sentence[50]
## [1] "_may_ fall in love with one of them, and therefore you must visit him as"

The chapter suggests trying “iconv(text, to = ‘latin’) in a mutate statement before unnesting” if the sentence tokenizing is having trouble with UTF-8 encoded text, which could improved by changing to ASCII punctuation, “especially with sections of dialogue”.

# Split the series of books by chapter using a regex pattern
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

# Demonstrate split by chapters
austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
# Show the chapter from each book with the most negative net sentiment

# Identify just negative words from Bing
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

# Count words for each chapter
wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the `.groups`
## argument.
# Tabulate chapters with the highest ratio of negative words to words
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343



New corpus and lexicon

Here we extend the code above to a new corpus and incorporate an additional lexicon for sentiment analysis.


New corpus

We are going to apply sentiment analysis to The Brothers Karamazov by Fyodor Dostoyevsky using the gutenbergr package. Visit (gutenberg.org)[https://www.gutenberg.org/] to learn more about Project Gutenberg.

Fyodor Dostoyevsky, translated by Constance Garnett, (1879) The Brothers Karamazov (Gutenberg book ID 28054)[https://www.gutenberg.org/ebooks/28054]

Code Summary:
- Load packages
- Download book
- Truncate book (not used in final knit)

# Load packages --------------------------------------
#install.packages('gutenbergr')
library(gutenbergr)
# Download book
TheBroKov <- gutenberg_download(28054)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
# Truncate book (not used in final knit)
TheBroKovSmall <- data.frame(c(1:37250))
TheBroKovSmall <- TheBroKov[c(1:37250),c(2)]


New sentiment lexicon

The syuzhet package contains the afinn, bing and nrc lexicons above, in addition to the syuzhet lexicon developed in the Nebraska Literary Lab. It also has a way to implement Stanford’s coreNLP sentiment parser which could be a future flex to implement.

Jockers ML (2015). Syuzhet: Extract Sentiment and Plot Arcs from Text. (https://github.com/mjockers/syuzhet)[https://github.com/mjockers/syuzhet].

# Load packages --------------------------------------
#install.packages('syuzhet')
library(syuzhet)



New analysis

This is a hodgepodge of functionality.


Optional custom stop-words

‘Chapter’ was inordinately present in the first 1000 lines of text and so has been added to the stop-words list. We’re leaving behind the chapter numerals, e.g. ‘iv’ ‘iii’ ‘vi’.

# Add custom stop-words like "chapter"
custom_stop_words <- bind_rows(tibble(word = c("chapter"),  
                                      lexicon = c("custom")), 
                               stop_words)


Process text

Here we turn the text into tidy format with one word one row.

# Remove stop words
# Toggle TheBroKovSmall with TheBroKov$text for final run
Tidy_tbk <- TheBroKovSmall %>% unnest_tokens(word,text) %>% anti_join(custom_stop_words)
## Joining, by = "word"


Wordclouds

Here we create two word clouds, one by frequency and one by nrc sentiment.

Code Summary:
- Wordcloud of most frequent words
- Wordcloud of most frequent words split by nrc sentiment

# Wordcloud of most frequent words
set.seed(4647)
Tidy_tbk %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 360), scale=c(3.5,1))

# Wordcloud of most frequent words split by nrc sentiment
set.seed(4749)
Tidy_tbk %>%
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "orange", "yellow", "green", "blue", "darkblue", "purple", "black", "gray", "brown"),
                   max.words = 360)
## Joining, by = "word"


Bing word count

Here we visualize the most frequent words that contributed to net sentiment using the Bing lexicon.

Code Summary:
- Tabulate words by frequency that contribute to either sentiment
- Graph the counts for examination

# Tabulate words by frequency that contribute to either sentiment
bing_word_counts <- Tidy_tbk %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
# Graph the counts for examination
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)


Counts by nrc

Here we show the words most contributing to the joy sentiment with nrc. As a flex we could track this for the other six nrc emotion sentiments and make a chart however we won’t attempt that yet.

# Show 'Joy' words by frequency using nrc
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

# Tabulate the joy words from the book 'Emma' by frequency
Tidy_tbk %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 419 × 2
##    word         n
##    <chr>    <int>
##  1 love       467
##  2 money      429
##  3 god        373
##  4 mother     169
##  5 true       153
##  6 feeling    147
##  7 found      127
##  8 child      122
##  9 church     122
## 10 laughing   115
## # … with 409 more rows
#anger, fear, anticipation, trust, surprise, sadness, joy, and disgust

Syuzhet attempt

I cannot get.sentences or get.sentiment from the syuzhet package and will need to improve this.

stbk <- get.sentiment(Tidy_tbk, method="syuzhet")



Conclusion

It’s tough to extend code to new problems without breaking down the original code into individual components. I’d like to spend more time with text mining and breaking apart the code into first principals.

My favorite lexicons are bing for simplicity, and nrc for the emotional sentiments.

I’d like to go back and create chunks of code from my text for the plot analysis, and truly duplicate the original analysis with the new text.

I would also like to spend more time to work through the entire syuzhet vignette to explore its capabilities and introduce myself to the Stanford coreNLP functions.

Additionally, when I tried to run the whole book it failed but I don’t think the code was the issue. I used the truncated version of the code and was successful with 10,000 lines and then running the whole 37,250 lines through the truncated version.