Sentiment analysis with tidy data

In this assignment I will perform sentiment analysis from different corpuses. As per the instruction, I will first recreate the sentiment analysis example code taken from chapter 2 of the book, “Text Mining with R” (Julia Silge & David Robinson) using the novel written by Jane Austen. Later, I will apply similar kind of sentiment analysis on H.G.Wells’s The Time Machine using a different sentiment lexicon.

Load packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(textdata)
## Warning: package 'textdata' was built under R version 4.2.2
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.2.2
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.2.2
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.2.2
## Loading required package: RColorBrewer
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.2
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

Getting sentiment lexicons

Below are the codes for getting the three sentiment lexicons:

get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows

Sentiment Analysis with Jane Austen Novels

Making the text is in a tidy format with one word per row.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Now, the text is in a tidy format with one word per row. So, we are ready to do the sentiment analysis. Finding the most joy words in Emma by performing the sentiment analysis by using the inner_join function.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # … with 291 more rows

We see mostly positive, happy words about hope, friendship, and love here. We also see some words that may not be used joyfully by Austen (“found”, “present”).

Performing the sentiment analysis with bing lexicon by calculating a net sentiment (positive - negative).

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"

Now we can plot these sentiment scores across the plot trajectory of each novel. Notice that we are plotting against the index on the x-axis that keeps track of narrative time in sections of text.

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

We can see from plot above, how the plot of each novel changes toward more positive or negative sentiment over the trajectory of the story.

Comparing the three sentiment dictionaries

Let’s use all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice. First, let’s use filter() to choose only the words from the one novel we are interested in.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 × 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # … with 122,194 more rows

Calculating the net sentiment with three lexicons:

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"

Comparing the three sentiment lexicons using Pride and Prejudice

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

The three different lexicons for calculating sentiment give results that are different in an absolute sense but have similar relative trajectories through the novel. We see similar dips and peaks in sentiment at about the same places in the novel, but the absolute values are significantly different. The AFINN lexicon gives the largest absolute values, with high positive values. The lexicon from Bing et al. has lower absolute values and seems to label larger blocks of contiguous positive or negative text. The NRC results are shifted higher relative to the other two, labeling the text more positively, but detects similar relative changes in the text. We find similar differences between the methods when looking at other novels; the NRC sentiment is high, the AFINN sentiment has more variance, the Bing et al. sentiment appears to find longer stretches of similar text, but all three agree roughly on the overall trends in the sentiment through a narrative arc.

Why is, for example, the result for the NRC lexicon biased so high in sentiment compared to the Bing et al. result? Let’s look briefly at how many positive and negative words are in these lexicons.

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon. This will contribute to the effect we see in the plot above, as will any systematic difference in word matches, e.g. if the negative words in the NRC lexicon do not match the words that Jane Austen uses very well. Whatever the source of these differences, we see similar relative trajectories across the narrative arc, with similar changes in slope, but marked differences in absolute sentiment from lexicon to lexicon. This is all important context to keep in mind when choosing a sentiment lexicon for analysis.

Most common positive and negative words

# Finding each word's contribution to each sentiment

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # … with 2,575 more rows
# Graphical representation of each word's contribution towards positive and negative sentiments

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

It is seen that the word ‘miss’ is contributing significantly towards negative sentiment. But here the context of this word is not negative. SO,we can create a custom stop word list to adjust for anomalies in the sentiment lexicon used.

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)
custom_stop_words
## # A tibble: 1,150 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows

Wordclouds

# looking at the most common words as a whole with wordcloud.

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

# Wordcloud of most frequent words splitted by postive or negative sentiment

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"

Looking at units beyond just words

# Example of tokenizing text into sentences
p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")
p_and_p_sentences$sentence[2]
## [1] "by jane austen"

Splitting the text of Jane Austen’s novels into a data frame by chapter by using regex pattern.

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()
austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

Finding the chapter that has the highest proportion of negative words in each book

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")
wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

New corpus and lexicon

Previously i used the janeaustenr package to explore tidying text. Now, i am going to use the gutenbergr package (Robinson 2016). The gutenbergr package provides access to the public domain works from the Project Gutenberg collection. The package includes tools both for downloading books and a complete dataset of Project Gutenberg metadata that can be used to find works of interest. Here, I am going to perform sentiment analysis for different lexicons on ‘The Time Machine’ which a novel written by H.G.Wells.

Installation and loading gutenbergr package

devtools::install_github("ropensci/gutenbergr")
## Skipping install of 'gutenbergr' from a github remote, the SHA1 (9d245385) has not changed since last install.
##   Use `force = TRUE` to force installation
library(gutenbergr)

Sentiment Analysis for H.G.Wells’s ‘The Time Machine’ novel.

#Load the book, tidying the text into word token

the_time_machine<- gutenberg_download(35)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_time_machine <- the_time_machine %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
# Most common words in the book

tidy_time_machine %>%
  count(word, sort = TRUE)
## # A tibble: 4,172 × 2
##    word          n
##    <chr>     <int>
##  1 time        207
##  2 machine      88
##  3 white        61
##  4 traveller    57
##  5 hand         49
##  6 morlocks     48
##  7 people       46
##  8 weena        46
##  9 found        44
## 10 light        43
## # … with 4,162 more rows
# Wordcloud of most frequent words

tidy_time_machine %>%
  count(word) %>%
  with(wordcloud(word, n, max.words =100))

Use of new lexicon “louhgran”

get_sentiments("loughran") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   2355
## 2 positive    354
sentiment_positive <- get_sentiments("loughran") %>% 
  filter(sentiment == "positive")

tidy_time_machine%>%
  inner_join(sentiment_positive) %>%
  count(word, sort=TRUE) %>% 
ungroup()
## Joining, by = "word"
## # A tibble: 69 × 2
##    word           n
##    <chr>      <int>
##  1 beautiful     12
##  2 strong        10
##  3 perfect        9
##  4 strength       9
##  5 abundant       8
##  6 assured        8
##  7 dream          7
##  8 pleasant       7
##  9 incredible     6
## 10 perfectly      5
## # … with 59 more rows
# Wordcloud of most frequent words split by loughran sentiment

tidy_time_machine %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"

Finding the contribution of words to different sentiments with loughran lexicon

loughran_word_counts <- tidy_time_machine %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
loughran_word_counts
## # A tibble: 349 × 3
##    word      sentiment       n
##    <chr>     <chr>       <int>
##  1 fear      negative       24
##  2 suddenly  uncertainty    22
##  3 doubt     negative       13
##  4 doubt     uncertainty    13
##  5 sudden    uncertainty    13
##  6 beautiful positive       12
##  7 appeared  uncertainty    11
##  8 possibly  uncertainty    11
##  9 slower    negative       11
## 10 broken    negative       10
## # … with 339 more rows
loughran_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

The section below repeats the analysis with the bing lexicon to categorize words as positive and negative

bing_word_counts <- tidy_time_machine %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 816 × 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 strange  negative     32
##  2 darkness negative     29
##  3 dark     negative     27
##  4 fear     negative     24
##  5 fell     negative     16
##  6 soft     positive     16
##  7 bright   positive     15
##  8 struck   negative     15
##  9 odd      negative     14
## 10 dim      negative     13
## # … with 806 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Finding joy words using nrc lexicon

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_time_machine %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 164 × 2
##    word          n
##    <chr>     <int>
##  1 found        44
##  2 sun          29
##  3 green        28
##  4 feeling      20
##  5 beautiful    12
##  6 save         11
##  7 comfort      10
##  8 humanity     10
##  9 perfect       9
## 10 abundant      8
## # … with 154 more rows

References

  1. Robinson, J. S. and D. (n.d.). Sentiment analysis with Tidy Data Text mining with R. https://www.tidytextmining.com/sentiment.html

  2. The Time Machine by H.G.Wells. https://www.gutenberg.org/ebooks/35