Data607-Week10-Sentiment Analysis

Sentiment Analysis

We will use Sentiment analysis to have text analysis to systematically identify, extract, quantify, and study affective states and subjective information. We will do this on the corpus of Novels and using different sentiment lexicon as discussed further in below sections.

Loading the required libraries:

#install.packages("tidytext")
library(tidytext)
#install.packages("textdata")
library(textdata)
#install.packages("janeaustenr")
library(janeaustenr)
library(dplyr)
library(stringr)
library(knitr)
library(tidyr)
library(ggplot2)
library(wordcloud)
library(reshape2)

#Below additionaly for our use case:
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
#devtools::install_github("mjockers/syuzhet")
library("syuzhet")

Use Case: Corpus - Harry Potter / Sentiment lexicon - loughran

Corpus - Harry Potter

The use case leverages the data provided in the harrypotter package. The package has been provided by bradleyboehmke.

The seven novels we are working with, and are provided by the harrypotter package, include:

philosophers_stone: Harry Potter and the Philosophers Stone (1997)
chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: Harry Potter and the Half-Blood Prince (2005)
deathly_hallows: Harry Potter and the Deathly Hallows (2007)

Each text is in a character vector with each element representing a single chapter.

#To perform sentiment analysis we need to have our data in a tidy format:

#Vector of title names:
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
            "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
            "Deathly Hallows")

#vector of books:
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
           goblet_of_fire, order_of_the_phoenix, half_blood_prince,
           deathly_hallows)

# Creating the tidy dataset series.full:  
series.full <- tibble()
for(i in seq_along(titles)) {
        
        clean <- tibble(chapter = seq_along(books[[i]]),
                        text = books[[i]]) %>%
             unnest_tokens(word, text) %>%
             mutate(book = titles[i]) %>%
             select(book, everything())

        series.full <- rbind(series.full, clean)
}

# set factor to keep books in order of publication:
series.full$book <- factor(series.full$book, levels = rev(titles))

#final tidy dataset ready for Analysis:
series.full

## # A tibble: 1,089,386 x 3
##    book                chapter word   
##    <fct>                 <int> <chr>  
##  1 Philosopher's Stone       1 the    
##  2 Philosopher's Stone       1 boy    
##  3 Philosopher's Stone       1 who    
##  4 Philosopher's Stone       1 lived  
##  5 Philosopher's Stone       1 mr     
##  6 Philosopher's Stone       1 and    
##  7 Philosopher's Stone       1 mrs    
##  8 Philosopher's Stone       1 dursley
##  9 Philosopher's Stone       1 of     
## 10 Philosopher's Stone       1 number 
## # ... with 1,089,376 more rows

Sentiment lexicon - loughran

# Getting the sentiment lexicom for loughran:
loughran.sentiments <- get_sentiments("loughran")
str(loughran.sentiments)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4150 obs. of  2 variables:
##  $ word     : chr  "abandon" "abandoned" "abandoning" "abandonment" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...

Score Analysis from the loughran lexicon

We will first Remove Stop Words from the book series dataset.This will help us to look and process a reduced and focused word sets for our analysis:

#We will use the anti_join() function to remove all stop words from our series set:
series.main <- series.full %>%
  anti_join(stop_words)

series.main

## # A tibble: 409,338 x 3
##    book                chapter word     
##    <fct>                 <int> <chr>    
##  1 Philosopher's Stone       1 boy      
##  2 Philosopher's Stone       1 lived    
##  3 Philosopher's Stone       1 dursley  
##  4 Philosopher's Stone       1 privet   
##  5 Philosopher's Stone       1 drive    
##  6 Philosopher's Stone       1 proud    
##  7 Philosopher's Stone       1 perfectly
##  8 Philosopher's Stone       1 normal   
##  9 Philosopher's Stone       1 people   
## 10 Philosopher's Stone       1 expect   
## # ... with 409,328 more rows

we can see the final dataset size has reduced with the removal from stop words; from 1,089,386 to 409,338 rows.

Checking for negative AND positive sentiments in the first book philosophers_stone :

# Creating a dataset for `negative` sentiment tokens:
loughran.sentiments.negative <- loughran.sentiments %>% 
  filter(sentiment == "negative")

# Creating a dataset for `positive` sentiment tokens:
loughran.sentiments.positive <- loughran.sentiments %>% 
  filter(sentiment == "positive")

# For negative tokens, we can use the inner_join() function to get all of the negative words from the book "Philosopher's Stone"; we will then count each frequency of the word occurnaces and plot on a wordcloud:
series.main %>%
 filter(book == "Philosopher's Stone" ) %>%
  inner_join(loughran.sentiments.negative) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100 ))

# We will repeat the above steps for positive tokens and plot similarly on a wordcloud:
series.main %>%
 filter(book == "Philosopher's Stone" ) %>%
  inner_join(loughran.sentiments.positive) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100 ))

We can see there are more negative words then positive words in the first Book.

Chapter wise Sentiments score:

# We will just take the positive and negative sentiments across all books in the series chappter wise grouped:
series.main.sentiment <- series.main %>%
  inner_join(loughran.sentiments) %>%
  count(book, index = chapter %/% 1, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

series.main.sentiment

## # A tibble: 200 x 9
##    book  index constraining litigious negative positive superfluous uncertainty
##    <fct> <dbl>        <dbl>     <dbl>    <dbl>    <dbl>       <dbl>       <dbl>
##  1 Deat~     1            4         2       47       20           0          10
##  2 Deat~     2            1        11       85       26           0          16
##  3 Deat~     3            0         2       45       10           0           7
##  4 Deat~     4            2         1       59        7           0           9
##  5 Deat~     5            1         4       85        5           0          16
##  6 Deat~     6            6         3       92       25           0          19
##  7 Deat~     7            2         5       75       33           0          18
##  8 Deat~     8            1         4       63       30           0          15
##  9 Deat~     9            1         3       51        4           0           4
## 10 Deat~    10            8         2       75       28           0          11
## # ... with 190 more rows, and 1 more variable: sentiment <dbl>

# We can now plot it to visualize the spread of the sentiments:
ggplot(series.main.sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

From the ggplot and the table data above, we can see that the book overall is more negative then postive score for each chapter; the books are not for smaller children perhaps.

Top words across all sentiments in all books:

# Below runs across all the books in the series for all sentiments types and plots the top 15 words per sentiment category for the entie series: 
series.main %>%
  inner_join(loughran.sentiments) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()  %>%
  group_by(sentiment) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

We can conlude by saying that harry potter did have a adventurous but a thrilling life. Rightly fits into the genre of Fantasy, drama, young adult fiction, mystery, and thriller [wikipedia link]

Appendix: The Sentiments Dataset

Below are from the example code provided in book Text Mining with R, Chapter 2 looks at Sentiment Analysis.

Sentiments Lexicon

The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are

AFINN from Finn Årup Nielsen,
bing from Bing Liu and collaborators, and
nrc from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one.

afinn.sentiments <- get_sentiments("afinn")
str(afinn.sentiments)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of  2 variables:
##  $ word : chr  "abandon" "abandoned" "abandons" "abducted" ...
##  $ value: num  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   word = col_character(),
##   ..   value = col_double()
##   .. )

bing.sentiments <- get_sentiments("bing")
str(bing.sentiments)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6786 obs. of  2 variables:
##  $ word     : chr  "2-faces" "abnormal" "abolish" "abominable" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...

nrc.sentiments <- get_sentiments("nrc")
str(nrc.sentiments)

## Classes 'tbl_df', 'tbl' and 'data.frame':    13901 obs. of  2 variables:
##  $ word     : chr  "abacus" "abandon" "abandon" "abandon" ...
##  $ sentiment: chr  "trust" "fear" "negative" "sadness" ...

`Joy` score from the NRC lexicon

Let’s look at the words with a joy score from the NRC lexicon and compare against the corpus austen_books.

# austen_books pull:
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

kable(head(tidy_books,10 ) )

book	linenumber	chapter	word
Sense & Sensibility	1	0	sense
Sense & Sensibility	1	0	and
Sense & Sensibility	1	0	sensibility
Sense & Sensibility	3	0	by
Sense & Sensibility	3	0	jane
Sense & Sensibility	3	0	austen
Sense & Sensibility	5	0	1811
Sense & Sensibility	10	1	chapter
Sense & Sensibility	10	1	1
Sense & Sensibility	13	1	the

# nrc sentiments joy:
nrc.sentiments.joy <- nrc.sentiments %>%
  filter(sentiment == 'joy')

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc.sentiments.joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

# nrc sentiments sadness:
nrc.sentiments.sadness <- nrc.sentiments %>%
  filter(sentiment == 'sadness')

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc.sentiments.sadness) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 347 x 2
##    word           n
##    <chr>      <int>
##  1 doubt         98
##  2 ill           72
##  3 bad           60
##  4 leave         58
##  5 mother        57
##  6 feeling       56
##  7 impossible    41
##  8 pain          34
##  9 evil          33
## 10 wanting       33
## # ... with 337 more rows

# jane_austen_sentiment:
jane_austen_sentiment <- tidy_books %>%
  inner_join(bing.sentiments) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

The Three Sentiment Dictionaries

Let’s use all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice. First, let’s use filter() to choose only the words from the one novel we are interested in.

# pride_prejudice:
pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

kable(head(pride_prejudice,10 ) )

book	linenumber	chapter	word
Pride & Prejudice	1	0	pride
Pride & Prejudice	1	0	and
Pride & Prejudice	1	0	prejudice
Pride & Prejudice	3	0	by
Pride & Prejudice	3	0	jane
Pride & Prejudice	3	0	austen
Pride & Prejudice	7	1	chapter
Pride & Prejudice	7	1	1
Pride & Prejudice	10	1	it
Pride & Prejudice	10	1	is

# Get the split counts:
afinn <- pride_prejudice %>% 
  inner_join(afinn.sentiments) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(bing.sentiments) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(nrc.sentiments %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"
## Joining, by = "word"

# Plot:
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

# Positive and negative words are in these lexicons:
nrc.sentiments %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312

bing.sentiments %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Common Positive and Negative Words

we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

# bing_word_counts:
bing_word_counts <- tidy_books %>%
  inner_join(bing.sentiments) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

bing_word_counts

## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

# custom_stop_words:
custom_stop_words <- bind_rows(tibble(word = c("miss"), 
                                          lexicon = c("custom")), 
                               stop_words)

custom_stop_words

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds

Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud. The size of a word’s text in below figure is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

tidy_books %>%
  inner_join(bing.sentiments) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

N-Grams

Looking at units beyond just words; some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole.

# token = "sentences"; we may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case : 
PandP_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

PandP_sentences$sentence[2]

## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."

# chapters: unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.  
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

#bing.sentiments.negative; let’s find the number of negative words in each chapter and divide by the total words in each chapter. For each book, which chapter has the highest proportion of negative words:  
bing.sentiments.negative <- bing.sentiments %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bing.sentiments.negative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()

## Joining, by = "word"

## Selecting by ratio

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Data607-Week10-Sentiment Analysis

Vinayak Kamath

04/05/2020

Sentiment Analysis

Use Case: Corpus - Harry Potter / Sentiment lexicon - loughran

Corpus - Harry Potter

Sentiment lexicon - loughran

Score Analysis from the loughran lexicon

Appendix: The Sentiments Dataset

Sentiments Lexicon

`Joy` score from the NRC lexicon

The Three Sentiment Dictionaries

Common Positive and Negative Words

Wordclouds

N-Grams

Data607-Week10-Sentiment Analysis

Vinayak Kamath

04/05/2020

Sentiment Analysis

Use Case: Corpus - Harry Potter / Sentiment lexicon - loughran

Corpus - Harry Potter

Sentiment lexicon - loughran

Score Analysis from the loughran lexicon

Appendix: The Sentiments Dataset

Sentiments Lexicon

Joy score from the NRC lexicon

The Three Sentiment Dictionaries

Common Positive and Negative Words

Wordclouds

N-Grams

`Joy` score from the NRC lexicon