Week 10 Assignment

Introduction
Example Code
Extension Code
Conclusions

library(tidyverse)
library(janeaustenr)
library(tidytext)
library(gutenbergr)
library(wordcloud)
library(reshape2)
library(lexicon)

Introduction

With the textbook Text Mining with R by by Julia Silge and David Robinson, we explore utilizing sentiment analysis on text. We begin with mimicking code examples present in the text, and then we extend it to utilize a different corpus and sentiment lexicons.

Example Code

As previously mentioned, the code in this section is sourced from Text Mining with R by by Julia Silge and David Robinson. We will be extending it later.

Loading Jane Austen

First we load in Jane Austen’s books through the janeaustenr package and transform them into a tidy format.

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE)))) %>%
  ungroup()

Transforming Jane Austen

Then we tokenize our tidy dataframe and remove stop words which are not useful for analysis.

tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books <- tidy_books %>%
  anti_join(stop_words, by = join_by(word))

Analyzing Jane Austen

Here we begin the sentiment analysis initial example by a simple count of words that are joyous according to the nrc sentiment dataset within the book Emma.

Something interesting to note here is the differences between our code results and the textbook example. “Young” has been removed from the latest version of the nrc dataset and “good” has been removed for being a stop word.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy, by = join_by(word)) %>%
  count(word, sort = TRUE)

## # A tibble: 297 × 2
##    word          n
##    <chr>     <int>
##  1 friend      166
##  2 hope        143
##  3 happy       125
##  4 love        117
##  5 deal         92
##  6 found        92
##  7 happiness    76
##  8 pretty       68
##  9 true         66
## 10 comfort      65
## # ℹ 287 more rows

Next we analyze the sentiment of the current point where we are at with each book to see how it fluctuates throughout. This is done with the bing sentiment dataset.

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing"), by = join_by(word)) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Afterwards, we want to focus on the differences between how each sentiment dataset quantifies sentiment for a single book, “Pride and Prejudice”.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn"), by = join_by(word)) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing"), by = join_by(word)) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    , by = join_by(word)) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

We can utilize a dataframe with combined word and sentiment to analyze what is contributing to each sentiment more or less. This can then be visualized through ggplot.

Another anomaly from removing stop words can be seen here where “well”, “good”, and “like”, which are commonly used for their non-sentiment laden meanings, do not appear in positive sentiment.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing"), by = join_by(word)) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

On the other hand miss which is the highest negative sentiment contributor is likely used not to mean missing a shot, but the formal title. Thus, it can be added to a custom_stop_words dataset for removal.

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

Pivoting to a new route of text analysis, we can consider building word clouds from the text we have.

Comparing our wordcloud to the example, we have to reduce the max words in order to fit each of the common words on screen. Additionally, the shape of the wordcloud seems to be random each time it is built.

tidy_books %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 75))

We can extend this wordcloud by visualizing which words are negative and those that are positive.

tidy_books %>%
  inner_join(get_sentiments("bing"), by = join_by(word)) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Finally, we let’s take a look at answering the question of which chapter is the most negative for each Jane Austen novel.

Note yet again, due to removing our stop words as part of the beginning step we get different chapters for half of the books and greater ratios!

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

## # A tibble: 6 × 5
##   book                chapter negativewords words ratio
##   <fct>                 <int>         <int> <int> <dbl>
## 1 Sense & Sensibility      29           172  1135 0.152
## 2 Pride & Prejudice        34           108   646 0.167
## 3 Mansfield Park           45           132   884 0.149
## 4 Emma                     15           147  1012 0.145
## 5 Northanger Abbey         27            55   337 0.163
## 6 Persuasion               21           215  1948 0.110

Extension Code

Here we utilize different text and a new sentiment lexicon following the same general steps as the example code.

Loading A New Corpus

Let’s utilize the Project Gutenberg library to get our own corpus to analyze. I’m a particular fan of John Steinbeck’s Grapes of Wrath so I attempted to search for it, however Project Gutenberg doesn’t seem to have it just yet. Instead, we’ll go with Grapes of Wrath by Boyd Cable which is a work of fiction regarding a WWI soldier’s encounters.

Boyd Cable also has a decent amount of books which fall into the war category as well, but we’ll analyze 3 for better visualization down the line.

boyd_works <- gutenberg_works() %>%
  filter(grepl("cable, boyd", author, ignore.case=TRUE))%>%
  select(gutenberg_id) %>%
  head(3) %>%
  pull(gutenberg_id) %>%
  gutenberg_download(meta_fields = "title") %>%
  select(text, title)

Then after downloading his works, we need to ensure that the chapter each book is captured in the dataframe. Unfortunately, only Grapes of Wrath follows a regex matchable chapter pattern. For the other two books we need to match on the chapter lists individually. Then we process the dataframe so we will have the rows show the line.

Transforming a New Corpus

grapes_book <- boyd_works %>%
  filter(title == "Grapes of wrath") %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]",
                                           ignore_case = TRUE))))

btn_chp <- c(
"THE ADVANCED TRENCHES",
"SHELLS",
"THE MINE",
"ARTILLERY SUPPORT",
"'NOTHING TO REPORT'",
"THE PROMISE OF SPRING",
"THE ADVANCE",
"A CONVERT TO CONSCRIPTION",
"'BUSINESS AS USUAL'",
"A HYMN OF HATE",
"THE COST",
"A SMOKER'S COMPANION",
"THE JOB OF THE AM. COL.",
"THE SIGNALLER'S DAY"
)


btn_book <- boyd_works %>%
  filter(title == "Between the Lines") %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(grepl(paste(btn_chp, collapse = "|"),text))) %>%
  mutate(chapter = ifelse(chapter > 14, chapter-14, 0))

actn_chp <- c(
"IN ENEMY HANDS",
"A BENEVOLENT NEUTRAL",
"DRILL",
"A NIGHT PATROL",
"AS OTHERS SEE",
"THE FEAR OF FEAR",
"ANTI-AIRCRAFT",
"A FRAGMENT",
"AN OPEN TOWN",
"THE SIGNALERS",
"CONSCRIPT COURAGE",
"SMASHING THE COUNTER-ATTACK",
"A GENERAL ACTION",
"AT LAST"
)

actn_book <- boyd_works %>%
  filter(title == "Action Front") %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(grepl(paste(actn_chp, collapse = "|"),text))) %>%
  mutate(chapter = ifelse(chapter > 14, chapter-14, 0))

After preprocessing the data, we can take the combination of each dataframe to tokenize it and remove stop words.

boyd_token <- rbind(grapes_book, btn_book, actn_book) %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words, by = join_by(word)) %>%
  select(book = title, chapter, linenumber, word)

A New Sentiment Dataset

Scouring the internet for an additional unigram sentiment dataset to work with, we end up utilizing the sentiment dataset used within the syuzhet package. However, since the package does not provide direct access to the dataset from R based on my attempts to utilize it, we use the lexicon package to load it in.

The syuzhet sentiment dataset classifies the positivity of unigrams on a sliding scale from -1 to 1.

syuzhet <- lexicon::key_sentiment_jockers %>%
  select(word, sentiment = value)
head(syuzhet)

##          word sentiment
## 1     abandon     -0.75
## 2   abandoned     -0.50
## 3   abandoner     -0.25
## 4 abandonment     -0.25
## 5    abandons     -1.00
## 6    abducted     -1.00

Analyzing Grapes of Wrath

Here we begin the sentiment analysis initial example by a simple count of words that are positive according to being a positive value within the syuzhet dataset.

The characteristics of a war story are fully on display here with “forward”, “sir”, “advance”, and other strategic action terms showing up.

boyd_token %>%
  filter(book == "Grapes of wrath") %>%
  inner_join(syuzhet %>% filter(sentiment > 0), by = join_by(word)) %>%
  count(word, sort = TRUE)

## # A tibble: 550 × 2
##    word         n
##    <chr>    <int>
##  1 forward     57
##  2 chance      37
##  3 found       29
##  4 sir         27
##  5 ready       21
##  6 straight    21
##  7 action      18
##  8 advance     17
##  9 flying      17
## 10 shelter     17
## # ℹ 540 more rows

get_sentiments("bing")

## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows

Next we analyze the sentiment of the current point where we are at with each book to see how it fluctuates throughout. This is done with the syuzhet sentiment dataset again. Perhaps not so surprisingly, as war fiction the sentiment that we see throughout this book is overwhelmingly negative. However, we do see glimpses of hope near the start, middle, and end as a trend. As a purely negative novel does not attract many readers.

boyd_sentiment <- boyd_token %>%
  inner_join(syuzhet, by = join_by(word)) %>%
  group_by(book, index = linenumber %/% 80) %>%
  summarise(sentiment = sum(sentiment), .groups = 'drop')
ggplot(boyd_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Afterwards, we want to focus on the differences between how each sentiment dataset quantifies sentiment for a single book, “Grapes of Wrath”. With this visualization we can see that the Syuzhet sentiment set is comparible to NRC in the amount of variation of sentiment that is possible. This is because of the break down of each individual value into 8 separate stages of sentiment.

grapes_wrath <- boyd_token %>% 
  filter(book == "Grapes of wrath")

syuzhet_join <- grapes_wrath %>%
  inner_join(syuzhet, by = join_by(word)) %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(sentiment), .groups = 'drop') %>%
  mutate(method = "Syuzhet")

afinn <- grapes_wrath %>% 
  inner_join(get_sentiments("afinn"), by = join_by(word)) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  grapes_wrath %>% 
    inner_join(get_sentiments("bing"), by = join_by(word)) %>%
    mutate(method = "Bing et al."),
  grapes_wrath %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    , by = join_by(word)) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

bind_rows(afinn, 
          bing_and_nrc, syuzhet_join) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

We can utilize a dataframe with combined word and sentiment to analyze what is contributing to each sentiment more or less. This can then be visualized through ggplot.

Here we can observe a quirk of sentiment analysis utilizing the syuzheit library for this corpus. Mentioning a gun, rifle, or attack might be negative in sentiment for many other stories. However, in a war story weapons would tend to be more neutral statements while attacks could be positive.

syuzhet_sentiment_counts <- boyd_token %>%
  inner_join(syuzhet, by = join_by(word)) %>%
  group_by(word) %>%
  summarise(sentiment = sum(sentiment), .groups = 'drop') %>%
  mutate(type = ifelse(sentiment>0, "positive", "negative"))

syuzhet_sentiment_counts %>%
  group_by(type) %>%
  slice_max(abs(sentiment), n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, abs(sentiment))) %>%
  ggplot(aes(sentiment, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~type, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

We could remove gun and rifle by setting our custom stop words.

custom_stop_words <- bind_rows(tibble(word = c("fire","gun"),  
                                      lexicon = c("custom")), 
                               stop_words)

Pivoting to a new route of text analysis, we can consider building word clouds from the text we have.

As we know trench is the highest contributor to negative sentiment and since it is referred to so often, it might be wise to add it to the stop list as well. However, I have reservations about this as despite being a common setting for war novels, being in trenches is definitely a negative thing.

boyd_token %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 75))

We can extend this word cloud by visualizing which words are negative and those that are positive.

boyd_token %>%
  inner_join(syuzhet, by = join_by(word)) %>%
  group_by(word) %>%
  summarise(sentiment = sum(sentiment), .groups = 'drop') %>%
  mutate(type = ifelse(sentiment>0, "positive", "negative")) %>%
  acast(word ~ type, value.var = "sentiment", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Finally, we let’s take a look at answering the question of which chapter is the most negative for each Boyd Cable novel.

“Between the Lines” has the most negative ratio towards the start of the novel. While “Action Front” and “Grapes of Wrath” both get more negative towards the end.

syuzhet_negative <- syuzhet %>% 
  filter(sentiment < 0)

wordcounts <- boyd_token %>%
  group_by(book, chapter) %>%
  summarize(words = n())

boyd_token %>%
  semi_join(syuzhet_negative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

## # A tibble: 3 × 5
##   book              chapter negativewords words ratio
##   <chr>               <dbl>         <int> <int> <dbl>
## 1 Action Front           11           303  1214 0.250
## 2 Between the Lines       1           224   971 0.231
## 3 Grapes of wrath        14           173   714 0.242

Conclusions

Sentiment mining from text is a good way to process data that might otherwise not be perfectly quantifiable. However, this naive version of sentiment mining we have gone through in this assignment has many weak points. For example, any negation words such as “not” are not kept in mind. We also saw with the Boyd Cable novels that the existing sentiment dictionaries we have might not be perfectly suited for charting the general sentiment in a war novel as many war related terms are detected as negative within these dictionaries. The corpus you are analyzing must be considered when choosing a proper sentiment dictionary for sentiment mining.

If I were to extend this assignment further, I would like to attempt utilizing bigram analysis to encapsulate negation words and compare the sentiment to this unigram based analysis.