The task

I’m going to use some code which gives a basic example of textmining with Jane Austen novels and extend it to a new corpus and a new lexicon. I’ll also bring in a new lexicon for sentiment analysis in another file.

The legal bit

The Jane Austen code is from Chapter 2: Looks at Sentiment Analysis in Text Mining with R by Julia Silge and David Robinson.

Silge, J. and Robinson, D., 2017. Text mining with R. 1st ed. Sebastopol, CA, USA: O’Reilly Books.

The “bing” lexicon was first published in Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004. It is available for use with attribution.

AFINN Sentiment Lexicon is available with ODbL v1.0, and NRC Word-Emotion Association Lexicon, known also as EmoLex is available for non-commercial research use.

The example code

We start by loading libraries and downloading the base sentiment analysis datasets.

library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(wordcloud)
library(reshape2)

afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")

We tidy up a dataset that contains some Jane Austen novels.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Then we collect the words from the nrc sentiment set with connotations of joy.

nrc_joy <- nrc %>%
  filter(sentiment == "joy")

We use an inner join to see which words from Emma are joyful, according to the nrc set, and how often they occur.

emma_joy <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

Now, we can use an inner join with the bing set to consider the sentiment changes from beginning to end. Floor division breaks up the text into chunks of 80 lines.

jane_austen_sentiment <- tidy_books %>%
  inner_join(bing) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)

Then we plot the sentiment scores.

ggplot(jane_austen_sentiment,
       aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Let’s work with Pride and Prejudice.

pride_prejudice <- tidy_books %>%
  filter(book == "Pride & Prejudice")

Because the bing sentiment set categorizes words on a scale of -5 to 5, we will have to make a function for comparison distinct from that for the other two sets, which categorize things in a binary manner.

pp_afinn <- pride_prejudice %>%
  inner_join(afinn) %>%
  group_by(index = linenumber %/% 80) %>%
  summarize(sentiment = sum(value)) %>%
  mutate(method = "AFINN")

pp_bing_nrc <- bind_rows(
  pride_prejudice %>%
    inner_join(bing) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>%
    inner_join(nrc %>%
                 filter(sentiment %in% c("positive",
                                         "negative"))) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>%
  mutate(sentiment = positive - negative)

Here is a net sentiment estimation for each chunk of Pride and Prejudice.

bind_rows(pp_afinn,
          pp_bing_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

According to the text and visibly in the charts, NRC is more positive, AFINN has more variance, and Bing et al. find more consistent stretches of text. Reportedly, this happens with other texts, as well.

Why?

nrc %>%
  filter(sentiment %in% c("positive",
                          "negative")) %>%
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312

count(bing, sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

(bing_word_counts <- tidy_books %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup())

## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # … with 2,575 more rows

These are the words that contribute the most to sentiment analysis in the text:

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

It helps to define some stop words to avoid the missuse of some vocabulary.

(custom_stop_words <- bind_rows(tibble(word = c("miss"),
                                       lexicon = c("custom")),
                                stop_words))

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows

And, we can build a wordcloud.

tidy_books %>% 
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

tidy_books %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

We can also examine units greater than simple words. Sometimes negation or sentence complexity can affect the true sentiment of a piece of text. The libraries coreNLP and cleanNLP attempt just this. We can cut the books into chapters.

pp_sentences <- tibble(text = prideprejudice) %>%
  unnest_tokens(sentence, text, token = "sentences")

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex",
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>%
  group_by(book) %>%
  summarize(chapters = n())

## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

bingnegative <- bing %>%
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>%
  ungroup()

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Text Mining (Jane Austen)

Sam Reeves

4/13/2021

The task

The legal bit

The example code