Sentiment Analysis

“Sentiment Analysis is the process of computationally identifying and categorizing opinions in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.” - Oxford Dictionary

The purpose of this project is two-fold: First, to take a deep dive into the mechanics and application of Sentiment Analysis by following an example provided by Juilia Silge and David Robinson from their book “Text Mining with R - A Tidy Approach”. Second, to choose another corpus and incorporate another lexicon, not used in the example below, to perform sentiment analysis.

Part I - The Example

The following code is from Chapter 2 of “Text Mining with R - A Tidy Approach”, entitled “Sentiment Analysis with Tidy Data”. A full citation of the code can be found at the end of the code excerpt.

2.1 - The Sentiments Dataset

library(janeaustenr)
library(tidyverse)
library(stringr)
library(tidytext)
library(tidytext)

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

2.2 Sentiment Analaysis with Inner Join

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # ... with 725,045 more rows
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows
library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
jane_austen_sentiment
## # A tibble: 920 x 5
##    book                index negative positive sentiment
##    <fct>               <dbl>    <dbl>    <dbl>     <dbl>
##  1 Sense & Sensibility     0       16       32        16
##  2 Sense & Sensibility     1       19       53        34
##  3 Sense & Sensibility     2       12       31        19
##  4 Sense & Sensibility     3       15       31        16
##  5 Sense & Sensibility     4       16       34        18
##  6 Sense & Sensibility     5       16       51        35
##  7 Sense & Sensibility     6       24       40        16
##  8 Sense & Sensibility     7       23       51        28
##  9 Sense & Sensibility     8       30       40        10
## 10 Sense & Sensibility     9       15       19         4
## # ... with 910 more rows
library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

2.3 Comparing the three sentiment dictionaries

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice
## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"Joining, by = "word"
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

get_sentiments("nrc") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

2.4 Most Common Positive and Negative Words

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

2.5 Wordclouds

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"


The code above was sourced from:


Part II - Taking the Reins

For part II of this project, I have chosen to analyze text from the book “The Count of Monte Cristo” by Alexandre Dumas, which was written in 1844. To get the text of this book, I used the gutenbergr library, which allows for the search and download of public domain texts.

library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 3.6.3
count_of_monte_cristo <- gutenberg_download(1184) 

Now that we have the book downloaded, let’s take a look at its structure.

count_of_monte_cristo
## # A tibble: 61,339 x 2
##    gutenberg_id text                       
##           <int> <chr>                      
##  1         1184 "THE COUNT OF MONTE CRISTO"
##  2         1184 ""                         
##  3         1184 ""                         
##  4         1184 ""                         
##  5         1184 "by Alexandre Dumas [père]"
##  6         1184 ""                         
##  7         1184 ""                         
##  8         1184 ""                         
##  9         1184 ""                         
## 10         1184 ""                         
## # ... with 61,329 more rows

In looking at the above output, this text is not in a “tidy text” format and will need to be transformed before analysis can take place. In addition, the first 159 lines of this text contain the table of contents, so we will exclude those lines. We will use the unnest_tokens() function from the tidytext library to break each word into an individual row.

#removing the first 159 rows of text which are table of contents
count_cristo <- count_of_monte_cristo[c(159:nrow(count_of_monte_cristo)),]

#using unnest_tokens to have each line be broken into indidual rows. 
comc <- count_cristo %>% unnest_tokens(word, text)
comc
## # A tibble: 464,757 x 2
##    gutenberg_id word      
##           <int> <chr>     
##  1         1184 chapter   
##  2         1184 1         
##  3         1184 marseilles
##  4         1184 the       
##  5         1184 arrival   
##  6         1184 on        
##  7         1184 the       
##  8         1184 24th      
##  9         1184 of        
## 10         1184 february  
## # ... with 464,747 more rows

Now that our text is “tidy”, we can begin with our analysis. Having never read the book (I have seen the movie, though), I have a lot to discover. Let’s first take a high level look at the text. Is this book more positive or negative?

comc_sentiment <- comc %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(sentiment) %>%
  mutate(total = n / sum(n))

ggplot(comc_sentiment) + 
  aes(x = sentiment, y = total) + 
  geom_col( ) + 
  labs(title = "Overall Sentiment of Count of Monte Cristo") + 
  ylab("Percent") + 
  xlab("Sentiment") +
  geom_text() + 
  aes(label = round(total, 4)*100, vjust = -.5) +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

It appears that, on the whole, this book is split almost 50/50, both positive and negative. Let’s see which words are adding most to the negative sentiments.

comc %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE) %>% 
  top_n(10) %>%
  mutate(word = reorder(word, desc(n))) %>%
  ggplot() + 
  aes(x = word, y = n) +
  labs(title = "Most Frequent Negative Words adding to Negative Sentiment") + 
  ylab("Count") + 
  xlab("Word") +
  geom_col() + 
  geom_text(aes(label = n, vjust = -.5)) + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

In looking at the above chart, it is clear that all of these words are adding to the negative sentiment of this novel. “Death” and “poor” definitely are affecting the sentiment more than other words. In looking at the chart above, we can also see that all of these words have been correctly categorized by the lexicon. Now let’s take a look at the positive words to see if there are any specific words that stick out.

comc %>% 
  inner_join(get_sentiments("bing")) %>% 
  filter(sentiment == "positive") %>%
  count(word, sentiment, sort = TRUE) %>% 
  top_n(10) %>%
  mutate(word = reorder(word, desc(n))) %>%
  ggplot() + 
  aes(x = word, y = n) +
  labs(title = "Most Frequent Positive Words adding to Positive Sentiment") + 
  ylab("Count") + 
  xlab("Word") +
  geom_col() + 
  geom_text(aes(label = n, vjust = -.5)) + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

In contrast to the chart above of words with negative sentiment, it looks like the word “well” here is the obvious winner, adding the most positive sentiment to this novel. The word “like” also is adding to the positive sentiment above the other words as well.

Now that we’ve got a feel for the overall novel as a whole, let’s see if we can get a feel for how the sentiment changes over the course of the novel. To do this, we’ll experiment with a different lexicon, “afinn”, which gives a value from -5 to 5 for each word. In addition, we’ll start this analysis by looking at our original data set (excluding the table of contents), which shows full lines of text as seen below.

count_cristo
## # A tibble: 61,181 x 2
##    gutenberg_id text                                                            
##           <int> <chr>                                                           
##  1         1184 " Chapter 1. Marseilles—The Arrival"                            
##  2         1184 ""                                                              
##  3         1184 "On the 24th of February, 1815, the look-out at Notre-Dame de l~
##  4         1184 "signalled the three-master, the _Pharaon_ from Smyrna, Trieste~
##  5         1184 "Naples."                                                       
##  6         1184 ""                                                              
##  7         1184 "As usual, a pilot put off immediately, and rounding the Châtea~
##  8         1184 "got on board the vessel between Cape Morgiou and Rion island." 
##  9         1184 ""                                                              
## 10         1184 "Immediately, and according to custom, the ramparts of Fort Sai~
## # ... with 61,171 more rows

We will create two new columns, one showing the original row number, and the other showing which chapter the row came from. This will give us the flexibility to look at the sentiment over the course of both chapters and line numbers.

comc_index <- count_cristo %>% 
  filter(text != "") %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("(?<=Chapter )([\\dII]{1,3})", ignore_case =  TRUE)))) 

comc_index
## # A tibble: 45,040 x 4
##    gutenberg_id text                                          linenumber chapter
##           <int> <chr>                                              <int>   <int>
##  1         1184 " Chapter 1. Marseilles—The Arrival"                   1       1
##  2         1184 "On the 24th of February, 1815, the look-out~          2       1
##  3         1184 "signalled the three-master, the _Pharaon_ f~          3       1
##  4         1184 "Naples."                                              4       1
##  5         1184 "As usual, a pilot put off immediately, and ~          5       1
##  6         1184 "got on board the vessel between Cape Morgio~          6       1
##  7         1184 "Immediately, and according to custom, the r~          7       1
##  8         1184 "were covered with spectators; it is always ~          8       1
##  9         1184 "ship to come into port, especially when thi~          9       1
## 10         1184 "has been built, rigged, and laden at the ol~         10       1
## # ... with 45,030 more rows

Now, let’s use unnest_tokens again to get this data set into a “tidy” structure and join in the “afinn” lexicon

comc_tidy <- comc_index %>% 
  unnest_tokens(word, text) %>% 
  inner_join(get_sentiments("afinn")) 
## Joining, by = "word"

Now that we have a clean data set, let’s move forward with our analysis of sentiment change over the course of the book. As there are 117 chapters in The Count of Monte Cristo, let’s break the book into sections of 2 chapters each so it’s easier to visualize.

chapter_sentiment <- comc_tidy %>% 
  select(chapter, value) %>%
  group_by(chapter = chapter %/% 2) %>% 
  summarize(net_sentiment = sum(value))

ggplot(chapter_sentiment) + 
  aes(x = as.factor(chapter), y = net_sentiment) + 
  geom_col(fill = "dodgerblue2") +
  labs(title = "Net Sentiment From Beginning to End of Count of Monte Cristo") + 
  ylab("Net Sentiment") + 
  xlab("Index - Each Index Includes Two Chapters") +
  theme(
    panel.background = element_rect(fill = "grey95", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

Looking at the above chart, we can get a feel for how the sentiment changes throughtout the novel. We can see the most negative chapters are chapter 7 and chapter 8. A quick Google search tells me that this is the part of the book where the main character is betrayed and falsely condemned to life in prison as well as his first few days in the prison. We can also see that there is quite a bit of positive sentiment toward the middle of this book, and then, at the latter half of the end, the sentiment begins to shift toward more negativity.

Now let’s get a little more adventurous. There is less widely used lexicon called “loughran”. This lexicon maps words to the following five words you can see below.

sent <- get_sentiments("loughran") 

unique(sent$sentiment)
## [1] "negative"     "positive"     "uncertainty"  "litigious"    "constraining"
## [6] "superfluous"

Some of these words are not used frequently, so let’s define them:

  1. Litigious: unreasonably prone to go to law to settle disputes.
  2. Superfluous: unnecessary, especially through being more than enough.

Having defined these words, let’s explore this lexicon and see what types of words in The Count of Monte Cristo are litigious and superfluos.

  comc_index %>% 
  unnest_tokens(word, text) %>% 
  inner_join(get_sentiments("loughran")) %>%
  filter(sentiment %in% c("litigious", "superfluous")) %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ggplot() + 
  aes(x = reorder(word,desc(n)), y = n) + 
  geom_col() +
  facet_grid(~sentiment, scales = "free_x")  + 
  geom_text(aes(label = n, vjust = -.5)) + 
  labs(title = "Looking at Words Associated with Litigious and Superfluous") + 
  ylab("Count") + 
  xlab("Word") + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

In looking at the chart above, we can see that “litigious” does relate strongly to words that have a legal connotation. In looking at the “superfluous” chart, I’m having a hard time seeing where a breakout of this sentiment would be helpful - perhaps in looking at certain contexts where you are trying to see if a person is being arrogant or prideful in their speech. Perhaps this lexicon is more of a “domain specific” lexicon. A similar and possibly more useful lexicon for looking at emotion is the NRC lexicon. This lexicon contains the following emotions:

sent <- get_sentiments("nrc") 

unique(sent$sentiment)
##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

You can see already that this lexicon is most likely better for every day use than the loughran lexicon. To conclude our brief analysis of The Count of Monte Cristo, let’s take a look at which words have the highest counts for each emotion in the NRC lexicon.

  comc_index %>% 
  unnest_tokens(word, text) %>% 
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ggplot() + 
  aes(x = reorder(word,desc(n)), y = n) + 
  geom_col() +
  facet_wrap(~sentiment, ncol = 2, scales = "free_x")  + 
  geom_text(aes(label = n, vjust = -.5)) + 
  labs(title = "Looking at Words Associated with Litigious and Superfluous") + 
  ylab("Count") + 
  xlab("Word") + 
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    axis.text.y = element_blank(), 
    axis.ticks.y = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

In looking at the above charts, you can get a good feel for which words tie to which emotions. Another thing that we can see is that our lexicon is picking up the word “count” probably as a term of trust as in “you can count on me”. However, we know that the title of our book is The Count of Monte Cristo and that probably every time the main character is addressed, he is called Count, which is definitely skewing our analysis here. Were we to extend this, we would want to use an anti_join() to remove the word “count” from this analysis.

Conclusion

Sentiment analysis can be incredibly powerful if done correctly. In order to have a powerful story to tell about the text you are analyzing, you need to make sure you are asking the right questions about the text. By this I mean, are you seeing how positive or negative a text is? Or are you trying to see what emotions are contained within your text? Or are you looking for something else? Answering these questions will help you answer the most important question of you analysis: Which lexicon should I use?

Selecting the proper lexicon will make or break your analysis. It is important to spend the proper time understanding different strengths and weaknesses of each lexicon and then doing some exploratory data analysis using several different lexicons to make sure you select the right one to answer the questions you have asked. Additionally, you will want to go through and check the top words that are contributing to your sentiment to make sure you don’t have false positives, such as in our “count” example above.