Text Mining with R

1. Assignment Overview
2. Sentiment analysis with tidy data
3. Sentiment analysis of Harry Potter books
4. Conclusion

1. Assignment Overview

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
1) Work with a different corpus of your choosing, and
2) Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.

2. Sentiment analysis with tidy data

Sentiment analysis helps to approach the emotional content of text using tools of text mining. One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

2.1 The sentiments datasets

There are several ways to evaluate the emotion in text but there are three the most commonly used lexicons:
- AFINN from Finn Årup Nielsen, it separates words into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust in a binary fashion (“yes”/“no”).
- bing from Bing Liu and collaborators, it separates words into positive and negative categories using yes/no.
- nrc from Saif Mohammad and Peter Turney, it assigns words with a score between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
They are all based on the single words.
The function get_sentiments() from the tidytext library shows us the sentiment lexicons. The results of the function are shown as tibble with two columns: word and its value/sentiment.

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,872 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,862 more rows

2.2 Sentiment analysis with inner join

Sentiment analysis below is done with the help of inner join.
We will perform the Sentiment analysis by finding the most common joy words in Emma using NRC lexicon.
First, we will add columns to keep track of which line and chapter of the book each word comes from by using group_by() and mutate() functions. Next, we will convert the text of the novel to the tidy format using unnest_tokens() function.

library(janeaustenr) # library to get full texts for Jane Austen's 6 completed novels, ready for text analysis
library(dplyr) #library to load tools for working with data frames
library(stringr) #library to provide fast, correct implementations of common string manipulations

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate( #create two columns with row number and chapter number for each word
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text) #split a column into tokens, flattening the table into one-token-per-row

As a result we get a tibble with 725055 observations of 4 variables (book, linenumber, chapter, word).

tidy_books

## # A tibble: 725,055 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ... with 725,045 more rows

We will use NRC lexicon, function filter() to filter only joy words and to use words only from the book Emma. inner_join() function will perform the sentiment analysis, count() will count how many times each word occurred in the book.
As a result, there are 301 joy words in the book Emma, word “good” appeared 359 times. Some words here may not be joy words, e.g. “found”, “present”.

nrc_joy <- get_sentiments("nrc") %>%  #get only joy words from NRC lexicon
  filter(sentiment == "joy") 

tidy_books %>% #get only joy words from Emma and count how many times they appear in the book
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) #sort=TRUE will sort the numbers in desc order

## Joining, by = "word"

## # A tibble: 301 x 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ... with 291 more rows

Next, we will be using bing lexicon to count up how many positive and negative words there are in defined sections of each book.
The result is the tibble with 920 observations of 5 variables (book, index, negative, positive, sentiment).

library(tidyr) #library to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value

jane_austen_sentiment <- tidy_books %>% #use the tibble with words from Jane Austen's 6 completed novels
  inner_join(get_sentiments("bing")) %>% #get bing lexicon and perform sentiment analysis
  count(book, index = linenumber %/% 80, sentiment) %>% #integer division
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% #negative and positive sentiment are in separate columns
  mutate(sentiment = positive - negative) #create new column with negative and positive sentiment

The plot below shows how the plot of each novel changes toward more positive or negative sentiment over the trajectory of Jane Austen’s novels. There are definitely more positive sentiments through all these novels.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

2.3 Comparing the three sentiment dictionaries

We can check how each lexicon works for the Jane Austen’s novel “Pride and Prejudice”, if they show the same sentiment over the narrative.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice") #filter only Pride & Prejudice book

pride_prejudice

## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows

The sentiment analysis using afinn lexicon starts as usual with inner_join(), add column with the value for each index. As a result, we get a tibble with 163 observations of 3 variables (index, sentiment, method).
The results from the bing and nrc lexicons will be stored in the same tibble.

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% #add column with sentiment value
  mutate(method = "AFINN") #add column with lexicon name

bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."), #add rows with bing method in column method
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>% ##add rows with with nrc method in column method
  count(method, index = linenumber %/% 80, sentiment) %>% #add column index
  pivot_wider(names_from = sentiment, #create columns with sentiment value
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)  #create new column with negative and positive sentiment

Now we can bind the results of the sentiment analysis for each lexicon.The plots look a little different in some parts (especially their absolute values) but in general they all show more positive sentiments through the Pride & Prejudice book. There are also similar dips and peaks at about the same places in the novel.

bind_rows(afinn, #combine three plots together
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

As there is difference in the absolute values of the sentiments for different lexicons.
It can be explained by the ration between negative and positive words, in the bing this ration is 2.38 which is higher than in in the nrc (1.44). There is definitely a systematic difference in word matches.

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308

get_sentiments("bing") %>% 
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

2.4 Most common positive and negative words

The code below counts how much each word contributed to each sentiment. We will use again the tibble with info about each word in all the novels of Jane Austen and bing lexicon will help us. The result is tibble with 2585 observations of 3 variables (word, sentiment, n).

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>% #adds twoo columns with how many times each word appears and it's sentiment
  ungroup()

bing_word_counts

## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows

It is easier to observe the results by visualizing them. The plot below shows the contribution of each word to the sentiment. “Miss” has the most impact on the negative sentiment while word “well” has the most impact on the positive.

bing_word_counts %>% #take the result of bing count
  group_by(sentiment) %>% #group by the sentiment
  slice_max(n, n = 10) %>%#count each group
  ungroup() %>%
  mutate(word = reorder(word, n)) %>% #add new column with the calculations
  ggplot(aes(n, word, fill = sentiment)) + #plot the graph
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

But word “miss” is not necessarily is a negative word. If needed, we could add dd “miss” to a custom stop-words list using bind_rows(). The result is tibble with words that are neutral like “a”, “above”, etc.

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

custom_stop_words

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

2.5 Wordclouds

Using the wordcloud library, we can visualize the most common words in Jane Austen’s works.It seems that “miss” and “time” one of the most common words there.

library(wordcloud)

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

comparison.cloud() function will visualize the most common positive and negative words. The size of the words is in proportion to its frequency within its sentiment. We see “miss” and “good” as the most common again.

library(reshape2)

tidy_books %>%
  inner_join(get_sentiments("bing")) %>% #sentiment analysis
  count(word, sentiment, sort = TRUE) %>% #count sentiment for each word
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% #turn the data frame into a matrix
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

2.6 Looking at units beyond just words

There are other ways to analyze the emotion of the text. For example, to analyse the emotion of the sentences like “I am not having a good day”. For packages as coreNLP , cleanNLP, and sentimentr, we need to tokenize text into sentences.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

p_and_p_sentences$sentence[2]

## [1] "by jane austen"

Or we can split the text of Jane Austen’s novels into tokens using a regex pattern in unnest_tokens() by chapter. The result is the tibble with the name of each book and the number of chapters in each of them.

austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

The code below will help us to answer the question “What are the most negative chapters in each of Jane Austen’s novels?” using the bing lexicon and tokens. The result is the tibble with the info about the most negative chapter in each book: book name, the most negative chapter of the book, amount of negative words in the chapter, total number of words in the chapter, and the ration between negative words and the total words in each chapter.

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative") #get all the negative words in bing

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n()) #count how many words in each chapter

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>% #get all negative words in each chapter
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>% #the number of negative words in each chapter and divide by the total words in each chapter
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

3. Sentiment analysis of Harry Potter books

We are going to analyze the Harry Potter series. The books can be loaded from the GitHub repo ‘bradleyboehmke/harrypotter’.
Each book is a character vector where each element is a chapter.

library(devtools)
install_github("bradleyboehmke/harrypotter")

class(philosophers_stone)

## [1] "character"

3.1 Tranform data

Before using lexicons, we need to transform the character into a tibble with 3 columns: name of the book, chapter number and word. Thus each word of the book has it’s own row.
We will first combine all books in one list, so we can convert each in data frame and combine in one single data frame.

book_names <- c("Philosophers stone", "Chamber of secrets",
              "Prisoner of Azkaban", "Goblet of fire",
              "Order of the Phoenix", "Half-blood prince",
              "Deathly hallows")

hp_books <- list(
  philosophers_stone,
  chamber_of_secrets,
  prisoner_of_azkaban,
  goblet_of_fire,
  order_of_the_phoenix,
  half_blood_prince,
  deathly_hallows
) %>%
  set_names(book_names) %>%
  map_df(as_tibble, .id = "book") %>% # convert each book to a data frame and merge into a single data frame
  mutate(book = factor(book, levels = book_names)) %>%
  drop_na(value) %>%
  group_by(book) %>%
  mutate(chapter = row_number(book)) %>% #add column with chapter number
  ungroup() %>%
  unnest_tokens(word, value)

hp_books

## # A tibble: 1,089,386 x 3
##    book               chapter word   
##    <fct>                <int> <chr>  
##  1 Philosophers stone       1 the    
##  2 Philosophers stone       1 boy    
##  3 Philosophers stone       1 who    
##  4 Philosophers stone       1 lived  
##  5 Philosophers stone       1 mr     
##  6 Philosophers stone       1 and    
##  7 Philosophers stone       1 mrs    
##  8 Philosophers stone       1 dursley
##  9 Philosophers stone       1 of     
## 10 Philosophers stone       1 number 
## # ... with 1,089,376 more rows

3.2 NRC Lexicon

We can check the most common word in the books with the nrc lexicon. It is of course “Harry”.

hp_books %>% 
  inner_join(get_sentiments("nrc")) %>%
  count(word, sort = TRUE) #sort=TRUE will sort the numbers in desc order

## # A tibble: 3,702 x 2
##    word          n
##    <chr>     <int>
##  1 harry     49671
##  2 good       5325
##  3 death      5299
##  4 professor  4012
##  5 feeling    3910
##  6 found      1842
##  7 ministry   1728
##  8 time       1713
##  9 mother     1704
## 10 lord       1564
## # ... with 3,692 more rows

With the same nrc lexicon, we will analyze the words and give each of them its own sentiment from the nrc lexicon. The result is that Harry Potter books are actually negative. There are 55k of negative words while there are only 87k of positive words.

nrc <- hp_books  %>%
        right_join(get_sentiments("nrc")) %>%
        filter(!is.na(sentiment)) %>%
        count(sentiment, sort = TRUE)
nrc

## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 negative     55091
##  2 positive     37758
##  3 sadness      34878
##  4 anger        32742
##  5 trust        23154
##  6 fear         21536
##  7 anticipation 20625
##  8 joy          13800
##  9 disgust      12861
## 10 surprise     12817

3.3 Bing lexicon

hp_sentiment <- hp_books %>% #use the tibble with words from all the books
        group_by(book) %>% 
        mutate(word_count = 1:n(),
               index = word_count %/% 500 + 1) %>% 
  inner_join(get_sentiments("bing")) %>% #get bing lexicon and perform sentiment analysis
  count(book, index = index, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% #negative and positive sentiment are in separate columns
  mutate(sentiment = positive - negative,
               book = factor(book, levels = book_names)) #create new column with negative and positive sentiment

The plot below shows how the plot of each book changes toward more positive or negative sentiment over the trajectory of all the books. There are definitely more negative sentiments through all these stories. It looks like the Death Hallows is the most negative book.

ggplot(hp_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

The code below will help us to answer the question “What are the most negative chapters in each of the books?” using the bing lexicon and tokens. The result is the tibble with the info about the most negative chapter in each book: book name, the most negative chapter of the book, amount of negative words in the chapter, total number of words in the chapter, and the ration between negative words and the total words in each chapter.
The most negative chapter is 35 of the Order of the Phoenix.

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative") #get all the negative words in bing

wordcounts <- hp_books %>%
  group_by(book, chapter) %>%
  summarize(words = n()) #count how many words in each chapter

hp_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>% #get all negative words in each chapter
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>% #the number of negative words in each chapter and divide by the total words in each chapter
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()

## # A tibble: 7 x 5
##   book                 chapter negativewords words  ratio
##   <fct>                  <int>         <int> <int>  <dbl>
## 1 Philosophers stone        15           215  5104 0.0421
## 2 Chamber of secrets        10           221  5325 0.0415
## 3 Prisoner of Azkaban       17           181  4192 0.0432
## 4 Goblet of fire            35           273  5963 0.0458
## 5 Order of the Phoenix      35           353  7773 0.0454
## 6 Half-blood prince         28           200  3597 0.0556
## 7 Deathly hallows           18           172  3484 0.0494

By building the histogram with the bing lexicon, we can find teh most negative and positive words in all the books. Like, well were the most common for positive sentiment, dark and death for the negative.

bing_word_counts <- hp_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts

## # A tibble: 3,313 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 like   positive   2416
##  2 well   positive   1969
##  3 right  positive   1643
##  4 good   positive   1065
##  5 dark   negative   1034
##  6 great  positive    877
##  7 death  negative    757
##  8 magic  positive    606
##  9 better positive    533
## 10 enough positive    509
## # ... with 3,303 more rows

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

We can visualize the same results but with the word cloud.

hp_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

3.4 Additional lexicon

We will use another most common lexicon for the sentiment analysis called Loughran.This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.
We will use the lexion on the book called Philosophers stone. The word “could” is one of the most common stands for “uncertanity”.

ph <- hp_books %>% 
  filter(book == "Philosophers stone")

ph

## # A tibble: 77,875 x 3
##    book               chapter word   
##    <fct>                <int> <chr>  
##  1 Philosophers stone       1 the    
##  2 Philosophers stone       1 boy    
##  3 Philosophers stone       1 who    
##  4 Philosophers stone       1 lived  
##  5 Philosophers stone       1 mr     
##  6 Philosophers stone       1 and    
##  7 Philosophers stone       1 mrs    
##  8 Philosophers stone       1 dursley
##  9 Philosophers stone       1 of     
## 10 Philosophers stone       1 number 
## # ... with 77,865 more rows

loughran_counts <- ph %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

loughran_counts

## # A tibble: 328 x 3
##    word     sentiment       n
##    <chr>    <chr>       <int>
##  1 could    uncertainty   195
##  2 good     positive       88
##  3 great    positive       72
##  4 suddenly uncertainty    69
##  5 might    uncertainty    54
##  6 almost   uncertainty    47
##  7 better   positive       41
##  8 nearly   uncertainty    37
##  9 against  negative       34
## 10 best     positive       34
## # ... with 318 more rows

We can also visualize the results of the Loughran lexicon for each of the sentiment. Leader is the uncertanity sentiment.

loughran_counts  %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

3.5 Comparing lexicons

As the last step, we will compare the results of the lexicons using Philosophers stone book.

The AFINN lexicon.

afinn <- ph %>% 
          mutate(word_count = 1:n(),
               index = word_count %/% 500 + 1) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = word_count %/% 500 + 1) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
afinn

## # A tibble: 156 x 3
##    index sentiment method
##    <dbl>     <dbl> <chr> 
##  1     1        11 AFINN 
##  2     2         4 AFINN 
##  3     3         7 AFINN 
##  4     4        12 AFINN 
##  5     5         3 AFINN 
##  6     6        21 AFINN 
##  7     7        -7 AFINN 
##  8     8         8 AFINN 
##  9     9        17 AFINN 
## 10    10        13 AFINN 
## # ... with 146 more rows

The bing and nrc lexicons together.

bing_and_nrc <- bind_rows(
  ph %>% 
    mutate(word_count = 1:n(),
                         index = word_count %/% 500 + 1) %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  ph %>% 
        mutate(word_count = 1:n(),
                         index = word_count %/% 500 + 1) %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = word_count %/% 500 + 1, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

The plot to compare the results. Overall, all lexicons show the negative idea of the Harry Potter books. Though Afinn lexicon shows more of positive sentiments while nrc shows more negative.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

4. Conclusion

Using lexicons for the sentiment analysis, we can understand the “mood” in the text, how it changes through the text. In the document above, we considered two sent of books: novels by Jane Austin and Harry Potter series. The Jane Austin books are mostly positive while Harry Potter books can make you feel sad at the end. We also compared three the most common lexicons, the overall results are the same for both series of books though there is difference in the absolute values of the sentiments for different lexicons. From the analysis, we found the most negative chapters, for Jane Austin it is book Sense & Sensibility, chapter 43, for Harry Potter it is Order of the Phoenix, chapter 35.The new lexicon used is Loughran showed similar results, uncertain and negative words are the most common in the books.