Overview

The task at hand was to get the primary example code from Chapter 2 about sentiment analysis in Text Mining with R1. Then, we had to extend the code by either working with a different corpus or another sentiment lexicon.

Sentiments dataset

We are loading the sentiments data set with the AFINN, bing, and nrc lexicon. They are based on single words and are associated with positivity or negativity, along with some emotions.

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Sentiment analysis with inner join

Sentiment analysis can be done with inner join. We are able to look at Jane Austen books and create columns to denote the line numbers and chapter that each word belongs to. In order to find words that are valued as “joy” in Emma, we have to first filter out the nrc lexicon and filter our data set to only one book. Then we can perform an inner_join and find the count for each joy word. We can remove some common words performing an anti_join with another dictionary data set. We can also find the sentiment score and how it changes from one section to another by calculating the net sentiment. This way we are able to see how the emotions change throughout the novel visually.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
      ignore_case = TRUE
    )))
  ) %>%
  ungroup() %>%
  unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the three sentiment dictionaries

It is also interesting to see how the lexicons compare between one another. Focusing on Pride and Prejudice, we are able to see how the three lexicons show changes in sentiment throughout the book. AFINN has the largest values and more variance. NRC has more positivity with high sentiment. Bing has lower values and finds longer sections of text with positive or negative associations. Overall, the three find similar sentimental trends throughout the novel.Bing was able to have larger negative values because it has more negative words in its lecixon compared to the others.

pride_prejudice <- tidy_books %>%
  filter(book == "Pride & Prejudice")

pride_prejudice
## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows
afinn <- pride_prejudice %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` ungrouping output (override with `.groups` argument)
bing_and_nrc <- bind_rows(
  pride_prejudice %>%
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>%
    inner_join(get_sentiments("nrc") %>%
      filter(sentiment %in% c(
        "positive",
        "negative"
      ))) %>%
    mutate(method = "NRC")
) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(
  afinn,
  bing_and_nrc
) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

get_sentiments("nrc") %>%
  filter(sentiment %in% c(
    "positive",
    "negative"
  )) %>%
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
get_sentiments("bing") %>%
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Most common positive and negative words

We are also able to find which words provide the most to each sentiment. Some words can also further be removed from the list by identifying custom stop words. For example, “miss” was considered a negative word, but in Jane Austen’s books, it refers to a female. We can remove it in order to obtain a better sentiment value.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()
## Selecting by n

custom_stop_words <- bind_rows(
  tibble(
    word = c("miss"),
    lexicon = c("custom")
  ),
  stop_words
)

custom_stop_words
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds

We are also able to map the word frequencies using the wordcloud package. Furthermore, we can use it to show a comparison between negative and positive word frequency with comparison.cloud.

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("gray20", "gray80"),
    max.words = 100
  )
## Joining, by = "word"

Looking at units beyond just words

Instead of looking at words alone, it may be beneficial to looks at n-grams, sentences, or paragraphs. Sometimes, the words will have a different meaning, especially with negation. For example, “bad” has a negative meaning but “not bad at all” has a positive meaning. We can also split the tokens with regex. Furthermore, we can take a look to find chapters that have the most negative sentiment value or most positive value. This can denote that something tragic happened.

PandP_sentences <- tibble(text = prideprejudice) %>%
  unnest_tokens(sentence, text, token = "sentences")

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text,
    token = "regex",
    pattern = "Chapter|CHAPTER [\\dIVXLC]"
  ) %>%
  ungroup()

austen_chapters %>%
  group_by(book) %>%
  summarise(chapters = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
bingnegative <- get_sentiments("bing") %>%
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords / words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
## Selecting by ratio
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Extension

Similar to Jane Austen, we can look at the books by Charles Dickens by using the gutenbergr package. We are looking at the books A Tale of Two Cities, Oliver Twist, Our Mutual Friend, David Copperfield, Bleak House, and Little Dorrit. We can then tidy the data and seperate by word tokens while removing the stop words. I used the custom_stop_words because Dickens also used “miss” frequently.

dickens <- gutenberg_download(c(98, 730, 766, 883, 1023, 963))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_dickens <- dickens %>%
  group_by(gutenberg_id) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
      ignore_case = TRUE
    )))
  ) %>%
  ungroup() %>%
  unnest_tokens(word, text) %>%
  anti_join(custom_stop_words) %>%
  mutate(book = case_when(
    gutenberg_id == 98 ~ "A Tale of Two Cities",
    gutenberg_id == 730 ~ "Oliver Twist",
    gutenberg_id == 766 ~ "David Copperfield",
    gutenberg_id == 883 ~ "Our Mutual Friend", 
    gutenberg_id == 1023 ~ "Bleak House",
    gutenberg_id == 963 ~ "Little Dorrit")) %>%
  select(-gutenberg_id) %>%
  select(book, everything())
## Joining, by = "word"
tidy_dickens
## # A tibble: 534,491 x 4
##    book                 linenumber chapter word      
##    <chr>                     <int>   <int> <chr>     
##  1 A Tale of Two Cities          1       0 tale      
##  2 A Tale of Two Cities          1       0 cities    
##  3 A Tale of Two Cities          3       0 story     
##  4 A Tale of Two Cities          3       0 french    
##  5 A Tale of Two Cities          3       0 revolution
##  6 A Tale of Two Cities          5       0 charles   
##  7 A Tale of Two Cities          5       0 dickens   
##  8 A Tale of Two Cities          8       0 contents  
##  9 A Tale of Two Cities         11       0 book      
## 10 A Tale of Two Cities         11       0 recalled  
## # ... with 534,481 more rows

Here are some of the most frequent words in these novels.

tidy_dickens %>%
  count(word, sort = TRUE)
## # A tibble: 27,833 x 2
##    word       n
##    <chr>  <int>
##  1 time    3008
##  2 sir     2861
##  3 dear    2779
##  4 hand    2383
##  5 head    2156
##  6 day     1814
##  7 night   1811
##  8 house   1808
##  9 eyes    1799
## 10 looked  1644
## # ... with 27,823 more rows

Here is a visualization of the sentiment values across the plots in each book by Charles Dickens. We are able to see that there are more negative sections than positive.

tidy_dickens %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
ggplot(., aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")
## Joining, by = "word"

Here are the top negative and positibe words that Dickens uses in these 6 novels.

tidy_dickens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    y = "Contribution to sentiment",
    x = NULL
  ) +
  coord_flip()
## Joining, by = "word"
## Selecting by n

tidy_dickens %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = c("gray20", "gray80"),
    max.words = 100
  )
## Joining, by = "word"

Conclusion

At first, I thought that AFINN would be the best lexicon to use as it has numeric sentiment values but I enjoyed using bing lexicon as we don’t know to the extent the of how positive or negative the word was used in the context. It was also interesting to see how to pull novels and search in the gutenbergr package.

References


  1. Silge, J., & Robinson, D. (2017). Chapter 2: Sentiment Analysis with Tidy Data. In Text mining with R: A tidy approach (pp. 13-30). Sebastopol, CA: O’Reilly.↩︎