Re-create the base analysis

The assignment is to re-create the R code from Chapter 2 of the textbook:

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly. https://www.tidytextmining.com/sentiment.html.

Here is a re-creation of that code.

The sentiments dataset

Import three sentiment lexicons. The first is the “afinn” lexicon,

library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.3
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

Second is the “bing” lexicon.

get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

The third is the “nrc” dataset, published by Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.

get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Inner Joins for Sentiment Analysis

Create a dataset that includes the works of Jane Austen.

library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.0.3
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
      ignore_case = TRUE
    )))
  ) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Create a dataset containing only the “joy” words from the “nrc” lexicon, and use the new lexicon dataset to evaluate the work “Emma”, counting the “joy” words in it.

nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

Compare the sentiment across all of the works of Jane Austen by joining to the “bing” lexicon, chopping the works into 80-line indexed sections, counting the positive and negative sentiments within each section, and taking the difference of the sentiments to derive a net sentiment for each section.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

head(jane_austen_sentiment)
## # A tibble: 6 x 5
##   book                index negative positive sentiment
##   <fct>               <dbl>    <dbl>    <dbl>     <dbl>
## 1 Sense & Sensibility     0       16       32        16
## 2 Sense & Sensibility     1       19       53        34
## 3 Sense & Sensibility     2       12       31        19
## 4 Sense & Sensibility     3       15       31        16
## 5 Sense & Sensibility     4       16       34        18
## 6 Sense & Sensibility     5       16       51        35

Plot the sentiments for each work.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Compare how the three different lexicons perform against the work Pride and Prejudice.

# filter Jane Austin's works for Pride & Prejudice
pride_prejudice <- tidy_books %>%
  filter(book == "Pride & Prejudice")

# join to the afinn lexicon
afinn <- pride_prejudice %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "AFINN")

head(afinn)
## # A tibble: 6 x 3
##   index sentiment method
##   <dbl>     <dbl> <chr> 
## 1     0        29 AFINN 
## 2     1         0 AFINN 
## 3     2        20 AFINN 
## 4     3        30 AFINN 
## 5     4        62 AFINN 
## 6     5        66 AFINN
# join to the bing and nrc lexicons
bing_and_nrc <- bind_rows(
  pride_prejudice %>%
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>%
    inner_join(get_sentiments("nrc") %>%
      filter(sentiment %in% c(
        "positive",
        "negative"
      ))) %>%
    mutate(method = "NRC")
) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

head(bing_and_nrc)
## # A tibble: 6 x 5
##   method      index negative positive sentiment
##   <chr>       <dbl>    <dbl>    <dbl>     <dbl>
## 1 Bing et al.     0        7       21        14
## 2 Bing et al.     1       20       19        -1
## 3 Bing et al.     2       16       20         4
## 4 Bing et al.     3       19       31        12
## 5 Bing et al.     4       23       47        24
## 6 Bing et al.     5       15       49        34

The Three Sentiment Lexicons

Visualize the net sentiment trend based on the lexicon used to analyze it.

bind_rows(
  afinn,
  bing_and_nrc
) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Compare the overall sentiment levels in two of the lexicons (“nrc” and “bing”).

nrc_sentiment <- get_sentiments("nrc") %>%
  filter(sentiment %in% c(
    "positive",
    "negative"
  )) %>%
  count(sentiment)%>%
  mutate(lexicon = "nrc")

bing_sentiment <- get_sentiments("bing") %>%
  count(sentiment)%>%
  mutate(lexicon = "bing")

lexicon_compare <- bind_rows(nrc_sentiment
      ,bing_sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(pct_neg = negative/(negative + positive),
         pct_pos = positive/(negative + positive))
lexicon_compare
## # A tibble: 2 x 5
##   lexicon negative positive pct_neg pct_pos
##   <chr>      <dbl>    <dbl>   <dbl>   <dbl>
## 1 bing        4781     2005   0.705   0.295
## 2 nrc         3324     2312   0.590   0.410

Positive and Negative skew in the Lexicons

Identify the most common words in the “bing” lexicon and identify which sentiment defines the word. Graph the top 10 results for each sentiment.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

Add the word ‘miss’ to the lexicon as a custom sentiment.

custom_stop_words <- bind_rows(
  tibble(
    word = c("miss"),
    lexicon = c("custom")
  ),
  stop_words
)

custom_stop_words
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Using Wordclouds

Use Wordclouds with the works of Jane Austen.

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.0.3
## Warning: package 'RColorBrewer' was built under R version 4.0.3
library(tidytext)

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Create a sentiment analysis Wordcloud using comparison.cloud(). Negative words are in darker font, positive words are in lighter font.

library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.3
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("gray20", "gray80"),
    max.words = 100
  )

### Looking Beyond Words

Tokenize at the sentence level instead of the word level.

PandP_sentences <- austen_books() %>%
  filter(book == "Pride & Prejudice") %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
      ignore_case = TRUE
    )))
  ) %>%
  ungroup() %>%
  unnest_tokens(sentence, text, token = "sentences")

Tokenize at the chapter level.

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text,
    token = "regex",
    pattern = "Chapter|CHAPTER [\\dIVXLC]"
  ) %>%
  ungroup()

austen_chapters %>%
  group_by(book) %>%
  summarise(chapters = n())
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

Find the number and ratio of negative words in the most negative chapter of each book.

bingnegative <- get_sentiments("bing") %>%
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords / words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Using a different corpus and lexicon

Project Gutenberg offer readers (and researchers) the full text of over 60,000 eBooks for free. For the purposes of this assginment, I chose a sample of the works of Charles Dickens.

new corpus

Download the sampling of books by Charles Dickens from the Project Gutenberg website.

library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.0.3
dickens786 <- gutenberg_download(c(786)) %>%
    mutate(book = "Hard Times")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
dickens1400 <- gutenberg_download(c(1400)) %>%
    mutate(book = "Great Expectations")
dickens730 <- gutenberg_download(c(730)) %>%
    mutate(book = "Oliver Twist")
dickens766 <- gutenberg_download(c(766)) %>%
    mutate(book = "David Copperfield")
dickens1023 <- gutenberg_download(c(1023)) %>%
    mutate(book = "Bleak House")
dickens564 <- gutenberg_download(c(564)) %>%
    mutate(book = "The Mystery of Edwin Drood")

dickens <- bind_rows(dickens786, dickens1400, dickens730, dickens766, dickens1023, dickens564)
dickens <-  subset(dickens, select = c(text,book))

Index the books by chapter and line number and tokenize the words into a tidy dataset.

tidy_dickens <- dickens %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
      ignore_case = TRUE
    )))
  ) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Remove the stop words from the tidy_dickens dataset, then identify the most common words in the works of Dickens by wordcount.

data(stop_words)

tidy_dickens <- tidy_dickens %>%
  anti_join(stop_words)

tidy_dickens %>%
  count(word, sort = TRUE)
## # A tibble: 25,701 x 2
##    word       n
##    <chr>  <int>
##  1 time    2385
##  2 sir     2238
##  3 dear    2045
##  4 miss    1956
##  5 hand    1710
##  6 head    1626
##  7 night   1442
##  8 house   1394
##  9 day     1363
## 10 looked  1243
## # ... with 25,691 more rows

new lexicon

Identify a new lexicon for sentiment analysis; the loughran package is part of the tidytext package.

get_sentiments("loughran")
## # A tibble: 4,150 x 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # ... with 4,140 more rows

Compare the sentiment across all of the works of Dickens by joining to the “loughran” lexicon, chopping the works into 80-line indexed sections, counting the positive and negative sentiments within each section, and taking the difference of the sentiments to derive a net sentiment for each section.

dickens_sentiment <- tidy_dickens %>%
  inner_join(get_sentiments("loughran")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative - constraining - litigious - uncertainty)

head(dickens_sentiment)
## # A tibble: 6 x 9
##   book  index constraining litigious negative positive superfluous uncertainty
##   <chr> <dbl>        <dbl>     <dbl>    <dbl>    <dbl>       <dbl>       <dbl>
## 1 Blea~     0            0         3        1        1           0           0
## 2 Blea~     1            0        10       19        4           0           4
## 3 Blea~     2            0         2        5        0           0           0
## 4 Blea~     3            0        16       16        0           0           1
## 5 Blea~     4            0         8       17        0           0           1
## 6 Blea~     5            2         7        2        2           0           3
## # ... with 1 more variable: sentiment <dbl>

The loughran sentiment index appears to have six categories of words: constraining, litigious, negative, positive, superfluous, and uncertainty. All of the categories seem to be negative with the exception of “positive”, which could potentially skew the overall sentiment heavily into the negative.

Let’s look further into the details of the sentiment lexicon itself to see what might be happening.

loughran_word_counts <- tidy_dickens %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

loughran_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()

There appear to be words such as “obliged” and “committed” in the constraining category that could have had positive meanings at the time Dickens wrote his works. For example, “I am much obliged, madam.” and “I have committed all of my love to you” that could be construed as positive. Likewise, the words “justice” and “consent” in the legal category could be used in Dickens’ work in the context of “Justice has been done” and “I consent to grant you the hand of my daughter in marriage.” These would have been construed at the time of their use as positive phrases.

Plot the sentiments for each work.

ggplot(dickens_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

As expected through the examination of the loughran lexicon, it appears that most of the works of Dickens result in an overall negative sentiment in this analysis.

Conclusion

The process of sentiment analysis is not a straightforward one. The choice of which sentiment lexicon to use in the context of a specific analysis is vital to the success or failure of the sentiment analysis. Using the wrong lexicon for the context can result in disastrous conclusions. So the moral of this story is: choose your lexicons wisely!

Overall for this assignment, I found the NRC lexicon to be the most useful, since it was the one most balanced between positive and negative words.