Summary

Tidy data principles can be applied to natural language processing. When text is organized in a format with one token per row, tasks like removing stop words or calculating word frequencies are natural applications of familiar operations within the tidy tool ecosystem. Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. This assignment is to leverage sentiment analysis using tidy data principles; when text data is in a tidy data structure, sentiment analysis can be implemented as an inner join.


Acceptance Criteria

  • Re-create base analysis
  • Extend analysis to new corpus and new lexicon
  • Reproducibility and Submission
  • Workflow

Re-create base analysis

Sentiments datasets

There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are

  • AFINN from Finn Årup Nielsen
  • bing from Bing Liu and collaborators
  • ncr from Saif Mohammad and Peter Turney.

Implementation

Load required libraries

library(stringr)
library(tidytext)
library(textdata)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(dplyr)
library(textdata)
library(wordcloud)
library(janeaustenr)
library(reshape2)
library(ggwordcloud)
library(wordcloud)
library(gutenbergr)
library(DT)

The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one. ### Get Sentiments for AFINN

sentiments.afinn <- get_sentiments("afinn")
datatable(sentiments.afinn, filter = 'bottom', options = list(pageLength = 10))

Get Sentiments for bing

sentiments.bing <-get_sentiments("bing")
datatable(sentiments.bing, filter = 'bottom', options = list(pageLength = 10))

Get Sentiments for nrc

sentiments.nrc<-get_sentiments("nrc")
datatable(sentiments.nrc, filter = 'bottom', options = list(pageLength = 10))

Sentiment analysis with inner join

Get Sentiments for nrc

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis.

Most common joy words in Emma

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts, how long the lines were to start with, etc. We then use pivot_wider() so that we have negative and positive sentiment in separate columns, and lastly ###

Calculate a net sentiment (positive - negative)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
# plot these sentiment scores
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Plot of each novel changes toward more positive or negative sentiment over the trajectory of the story.

Comparing the three sentiment dictionaries

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows

Lets find the net sentiment in each of these sections of

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

Let’s bind them together and visualize them

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Let’s look briefly at how many positive and negative words are in these lexicons

# afinn
sentiments.afinn.negative<-sentiments.afinn %>% filter(value <0)
count(sentiments.afinn.negative)
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1598
sentiments.afinn.neutral<-sentiments.afinn %>% filter(value ==0)
count(sentiments.afinn.neutral)
## # A tibble: 1 x 1
##       n
##   <int>
## 1     1
sentiments.afinn.positive<-sentiments.afinn %>% filter(value >0)
count(sentiments.afinn.positive)
## # A tibble: 1 x 1
##       n
##   <int>
## 1   878
# bing
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
# nrc
get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312

Most common positive and negative words

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

Lets us spot an anomaly in the sentiment analysis; the word “miss” is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works. If it were appropriate for our purposes, we could easily add “miss” to a custom stop-words list using bind_rows(). We could implement that with a strategy such as this

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)
count(custom_stop_words)
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1150

Word Cloud

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Looking at units beyond just words

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")
  p_and_p_sentences$sentence[2]
## [1] "by jane austen"

The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text. Lets try to split the text of Jane Austen’s novels into a data frame by chapter.

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()
austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

Let’s find the number of negative words in each chapter and divide by the total words in each chapter

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Summary

This was an interesting dive into Sentment Analysis which provides a way to understand the attitudes and opinions expressed in texts. We explored how to approach sentiment analysis using tidy data principles; when text data is in a tidy data structure, sentiment analysis can be implemented as an inner join. We can use sentiment analysis to understand how a narrative arc changes throughout its course or what words with emotional and opinion content are important for a particular text.

Extend analysis to new corpus and new lexicon

Author - Jane Austen

# download all books from Jane Austen
austen <- gutenberg_works(author == "Austen, Jane") %>%
  gutenberg_download(meta_fields = "title")
janeausten <- gutenberg_download(c(unique(austen$gutenberg_id)))

#Verne metadata
jane_metadata <- gutenberg_metadata[
    which(gutenberg_metadata$gutenberg_id %in% c(unique(austen$gutenberg_id))),
    c("gutenberg_id","title")]
jane_metadata
## # A tibble: 10 x 2
##    gutenberg_id title                                                           
##           <int> <chr>                                                           
##  1          105 "Persuasion"                                                    
##  2          121 "Northanger Abbey"                                              
##  3          141 "Mansfield Park"                                                
##  4          158 "Emma"                                                          
##  5          161 "Sense and Sensibility"                                         
##  6          946 "Lady Susan"                                                    
##  7         1212 "Love and Freindship [sic]"                                     
##  8         1342 "Pride and Prejudice"                                           
##  9        31100 "The Complete Project Gutenberg Works of Jane Austen\nA Linked ~
## 10        42078 "The Letters of Jane Austen\r\nSelected from the compilation of~
#Adding book to title to each Jane work
jane_books <- merge(janeausten,jane_metadata,by="gutenberg_id")

#Rename title to book
jane_books <- rename(jane_books,c("book" = "title"))

New lexicon - loughran

loughran_sent <- get_sentiments("loughran") %>%
  filter(sentiment %in% c("positive","negative"))

Tidy Work for jane books

#Creating the tidy janes data set
tidydata <- jane_books[,c("text","book")] %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(
           str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
head(jane_books)
##   gutenberg_id        text       book
## 1          105  Persuasion Persuasion
## 2          105             Persuasion
## 3          105             Persuasion
## 4          105          by Persuasion
## 5          105             Persuasion
## 6          105 Jane Austen Persuasion

Comparing lexicons

northanger_abey <- tidydata %>%
  filter(book == "Northanger Abbey")

afinn2 <- northanger_abey %>%
  inner_join(get_sentiments("afinn")) %>%group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(value)) %>%mutate(method = "AFINN")

bing_and_nrc2 <-
  bind_rows(
    northanger_abey %>%inner_join(get_sentiments("bing")) %>%mutate(method = "Bing"),
    northanger_abey %>%inner_join(get_sentiments("nrc") %>%filter(sentiment %in% c("positive","negative"))) %>%
      mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

loughran <- northanger_abey %>%
  inner_join(loughran_sent) %>%mutate(method = "Loughran-McDonald") %>%
  count(method,index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%mutate(sentiment = positive-negative)

NRC, AFINN, Bing Sentiment

bind_rows(afinn2, 
          bing_and_nrc2) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Word Cloud

tidydata %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

Using loughran

tidydata %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Using Bing

tidydata %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Conclusion

When using the Loughran-McDonald lexicon to compare Jane Austin novels,author showed an average positive sentiment per every 80 line. That would be a great simplification of each genre and author. Used sentimental analysis using loughran lexicon an found difefrent sentiment words.