Sentiment Analysis

We will use Sentiment analysis to have text analysis to systematically identify, extract, quantify, and study affective states and subjective information. We will do this on the corpus of Novels and using different sentiment lexicon as discussed further in below sections.


Loading the required libraries:


Use Case: Corpus - Harry Potter / Sentiment lexicon - loughran

Corpus - Harry Potter

The use case leverages the data provided in the harrypotter package. The package has been provided by bradleyboehmke.

The seven novels we are working with, and are provided by the harrypotter package, include:

  • philosophers_stone: Harry Potter and the Philosophers Stone (1997)
  • chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
  • prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
  • goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
  • order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
  • half_blood_prince: Harry Potter and the Half-Blood Prince (2005)
  • deathly_hallows: Harry Potter and the Deathly Hallows (2007)

Each text is in a character vector with each element representing a single chapter.

## # A tibble: 1,089,386 x 3
##    book                chapter word   
##    <fct>                 <int> <chr>  
##  1 Philosopher's Stone       1 the    
##  2 Philosopher's Stone       1 boy    
##  3 Philosopher's Stone       1 who    
##  4 Philosopher's Stone       1 lived  
##  5 Philosopher's Stone       1 mr     
##  6 Philosopher's Stone       1 and    
##  7 Philosopher's Stone       1 mrs    
##  8 Philosopher's Stone       1 dursley
##  9 Philosopher's Stone       1 of     
## 10 Philosopher's Stone       1 number 
## # ... with 1,089,376 more rows

Sentiment lexicon - loughran
## Classes 'tbl_df', 'tbl' and 'data.frame':    4150 obs. of  2 variables:
##  $ word     : chr  "abandon" "abandoned" "abandoning" "abandonment" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...

Score Analysis from the loughran lexicon
  1. We will first Remove Stop Words from the book series dataset.This will help us to look and process a reduced and focused word sets for our analysis:

    ## # A tibble: 409,338 x 3
    ##    book                chapter word     
    ##    <fct>                 <int> <chr>    
    ##  1 Philosopher's Stone       1 boy      
    ##  2 Philosopher's Stone       1 lived    
    ##  3 Philosopher's Stone       1 dursley  
    ##  4 Philosopher's Stone       1 privet   
    ##  5 Philosopher's Stone       1 drive    
    ##  6 Philosopher's Stone       1 proud    
    ##  7 Philosopher's Stone       1 perfectly
    ##  8 Philosopher's Stone       1 normal   
    ##  9 Philosopher's Stone       1 people   
    ## 10 Philosopher's Stone       1 expect   
    ## # ... with 409,328 more rows

    we can see the final dataset size has reduced with the removal from stop words; from 1,089,386 to 409,338 rows.


  1. Checking for negative AND positive sentiments in the first book philosophers_stone :

    We can see there are more negative words then positive words in the first Book.


  1. Chapter wise Sentiments score:

    ## # A tibble: 200 x 9
    ##    book  index constraining litigious negative positive superfluous uncertainty
    ##    <fct> <dbl>        <dbl>     <dbl>    <dbl>    <dbl>       <dbl>       <dbl>
    ##  1 Deat~     1            4         2       47       20           0          10
    ##  2 Deat~     2            1        11       85       26           0          16
    ##  3 Deat~     3            0         2       45       10           0           7
    ##  4 Deat~     4            2         1       59        7           0           9
    ##  5 Deat~     5            1         4       85        5           0          16
    ##  6 Deat~     6            6         3       92       25           0          19
    ##  7 Deat~     7            2         5       75       33           0          18
    ##  8 Deat~     8            1         4       63       30           0          15
    ##  9 Deat~     9            1         3       51        4           0           4
    ## 10 Deat~    10            8         2       75       28           0          11
    ## # ... with 190 more rows, and 1 more variable: sentiment <dbl>

    From the ggplot and the table data above, we can see that the book overall is more negative then postive score for each chapter; the books are not for smaller children perhaps.


  1. Top words across all sentiments in all books:

    We can conlude by saying that harry potter did have a adventurous but a thrilling life. Rightly fits into the genre of Fantasy, drama, young adult fiction, mystery, and thriller [wikipedia link]


Appendix: The Sentiments Dataset

Below are from the example code provided in book Text Mining with R, Chapter 2 looks at Sentiment Analysis.

Sentiments Lexicon

The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are

  • AFINN from Finn Årup Nielsen,
  • bing from Bing Liu and collaborators, and
  • nrc from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one.

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of  2 variables:
##  $ word : chr  "abandon" "abandoned" "abandons" "abducted" ...
##  $ value: num  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   word = col_character(),
##   ..   value = col_double()
##   .. )
## Classes 'tbl_df', 'tbl' and 'data.frame':    6786 obs. of  2 variables:
##  $ word     : chr  "2-faces" "abnormal" "abolish" "abominable" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    13901 obs. of  2 variables:
##  $ word     : chr  "abacus" "abandon" "abandon" "abandon" ...
##  $ sentiment: chr  "trust" "fear" "negative" "sadness" ...

Joy score from the NRC lexicon

Let’s look at the words with a joy score from the NRC lexicon and compare against the corpus austen_books.

book linenumber chapter word
Sense & Sensibility 1 0 sense
Sense & Sensibility 1 0 and
Sense & Sensibility 1 0 sensibility
Sense & Sensibility 3 0 by
Sense & Sensibility 3 0 jane
Sense & Sensibility 3 0 austen
Sense & Sensibility 5 0 1811
Sense & Sensibility 10 1 chapter
Sense & Sensibility 10 1 1
Sense & Sensibility 13 1 the
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows
## Joining, by = "word"
## # A tibble: 347 x 2
##    word           n
##    <chr>      <int>
##  1 doubt         98
##  2 ill           72
##  3 bad           60
##  4 leave         58
##  5 mother        57
##  6 feeling       56
##  7 impossible    41
##  8 pain          34
##  9 evil          33
## 10 wanting       33
## # ... with 337 more rows
## Joining, by = "word"


The Three Sentiment Dictionaries

Let’s use all three sentiment lexicons and examine how the sentiment changes across the narrative arc of Pride and Prejudice. First, let’s use filter() to choose only the words from the one novel we are interested in.

book linenumber chapter word
Pride & Prejudice 1 0 pride
Pride & Prejudice 1 0 and
Pride & Prejudice 1 0 prejudice
Pride & Prejudice 3 0 by
Pride & Prejudice 3 0 jane
Pride & Prejudice 3 0 austen
Pride & Prejudice 7 1 chapter
Pride & Prejudice 7 1 1
Pride & Prejudice 10 1 it
Pride & Prejudice 10 1 is
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Common Positive and Negative Words

we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

## Joining, by = "word"
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows
## Selecting by n

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds

Let’s look at the most common words in Jane Austen’s works as a whole again, but this time as a wordcloud. The size of a word’s text in below figure is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.

## Joining, by = "word"

## Joining, by = "word"


N-Grams

Looking at units beyond just words; some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole.

## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
## Joining, by = "word"
## Selecting by ratio
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343