Week 10 Assignment – Sentiment Analysis

Assigned Task:

Re-create and analyze primary code from the textbook. Provide citation to text book, using a standard citation syntax like APA or MLA.

Citation to text book, using a standard citation MLA syntax.

Silge and Robinson - O’Reilly - 2017

Import Libraries

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.1.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The sentiments datasets:

The tidytext package provides access to several sentiment lexicons. The three general purpose lexions that are based on unigrams, i.e., single words are: 1. AFINN - lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. 2. nrc - lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative. 3. bing - lexicon categorizes words in a binary fashion into positive and negative categories.

The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one.

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,875 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,865 more rows

Sentiment analysis with inner join:

library(janeaustenr)

## Warning: package 'janeaustenr' was built under R version 4.1.3

austen_books()

## # A tibble: 73,422 x 2
##    text                    book               
##  * <chr>                   <fct>              
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility
##  2 ""                      Sense & Sensibility
##  3 "by Jane Austen"        Sense & Sensibility
##  4 ""                      Sense & Sensibility
##  5 "(1811)"                Sense & Sensibility
##  6 ""                      Sense & Sensibility
##  7 ""                      Sense & Sensibility
##  8 ""                      Sense & Sensibility
##  9 ""                      Sense & Sensibility
## 10 "CHAPTER 1"             Sense & Sensibility
## # ... with 73,412 more rows

library(janeaustenr)
library(dplyr)
library(stringr)

## Warning: package 'stringr' was built under R version 4.1.2

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>% ungroup() %>%

  unnest_tokens(word, text)

We chose the name ‘word’ for the output column from unnest_tokens() function, because the sentiment lexicons and stop word datasets have columns named word; performing inner joins and anti-joins is thus easier.
The text is in a tidy format with one word per row, we can perform sentiment analysis now.
First, use the NRC lexicon and filter() for the joy words.
Then, filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis
Use count() from dplyr, to count the most common joy words in Emma.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 301 x 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ... with 291 more rows

There are mostly positive, happy words.

Now, examine how sentiment changes throughout each novel,

using mostly dplyr functions.

First,find a sentiment score for each word using the Bing lexicon and inner_join().
Then, count up how many positive and negative words there are in defined sections of each book.
Next, define index to keeps track of which 80-line section of text we are counting up negative and positive sentiment in. The %/% operator does integer division (x %/% y is equivalent to floor(x/y))
Use pivot_wider() so that we have negative and positive sentiment in separate columns.
Lastly, calculate a net sentiment (positive - negative)

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

Visualization using ggplot2:

Now, plot these sentiment scores across the plot trajectory of each novel, by plotting against the index on the x-axis that keeps track of narrative time in sections of text.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.2

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the three sentiment dictionaries:

Now, compare all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice.

First, use filter() to choose only the words from the one novel we are interested in.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice

## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows

Since, AFINN lexicon measures sentiment with a numeric score between -5 and 5, now, use a different pattern for the AFINN lexicon than for the other two.

Use inner_join() to calculate the sentiment in different ways.
Next, define index to keeps track of which 80-line section of text we are counting up negative and positive sentiment in. The %/% operator does integer division (x %/% y is equivalent to floor(x/y))
Use pivot_wider() so that we have negative and positive sentiment in separate columns.
Lastly, calculate a net sentiment (positive - negative)

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

Binding all three sentiment lexicon:

Now, we have calculated an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon.

Next step is, to bind them together to compare and visualize.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

### Analysis:

The results are different from the three different lexicons for calculating sentiments.
They have similar dips and peaks.
The AFINN lexicon shows the largest absolute values, with high positive values. This means sentiment has more variance.
The Bing lexicon shows the lower absolute values, and labels larger blocks of contiguous positive or negative text.
Compare to the other two, NRC results are shifted higher, labeling the text more positively, but detects similar relative changes in the text. This means, longer stretches of similar text,

Comparing the Count of NRC and Bing lexicons:

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3318
## 2 positive   2308

get_sentiments("bing") %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Both have more positive than negative words, ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon.

Advantage of using the data frame for Sentiments and Words:

is that we can analyze word counts that contribute to each sentiment by using count() function.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

bing_word_counts

## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows

and we can pipe straight into ggplot2 to show in the visualization

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

A deviation from what is standard:

The word “miss” is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works.

A custom stop-words list can be used with bind_rows() funtion, to handle this deviation.

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds Package:

Visualize the most common words in Jane Austen’s work again in wordcloud.

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.1.3

## Loading required package: RColorBrewer

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

###### reshape2 Package:

Now, perform sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words using reshape2 package.

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.1.3

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

Looking at units beyond just words:

The sentiment analysis algorithms look beyond only unigrams (i.e. single words) and try to understand the sentiment of a sentence as a whole. These algorithms try to understand that:

I am not having a good day.

is a sad sentence because of the use of negation.

here. text is tokenized into sentences, and it makes sense to use a new name for the output column in such a case.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

p_and_p_sentences

## # A tibble: 15,545 x 1
##    sentence                                                                  
##    <chr>                                                                     
##  1 "pride and prejudice"                                                     
##  2 "by jane austen"                                                          
##  3 "chapter 1"                                                               
##  4 "it is a truth universally acknowledged, that a single man in possession" 
##  5 "of a good fortune, must be in want of a wife."                           
##  6 "however little known the feelings or views of such a man may be on his"  
##  7 "first entering a neighbourhood, this truth is so well fixed in the minds"
##  8 "of the surrounding families, that he is considered the rightful property"
##  9 "of some one or other of their daughters."                                
## 10 "\"my dear mr."                                                           
## # ... with 15,535 more rows

Looking at one sentence:

p_and_p_sentences$sentence[2]

## [1] "by jane austen"

#> [1] "by jane austen"

Drawback:

The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text.

Another Option:

To split the text of Jane Austen’s novels into a data frame by chapter.

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>% ungroup()
  

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

Now, find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words?

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% ungroup()

## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

These are the chapters with the most sad words in each book, normalized for number of words in the chapter.

Summary:

Sentiment analysis helps us understand the sentiments expressed in words.
We perform sentiment analysis to tag positive and negative words using an inner join.
We analyzed word counts that contribute to each sentiment.
We piped dataframes straight into ggplot2 to show in the visualization.
Text is tokenized into sentences and chapters.