Sentinment Analysis

Author

Rashad Long

Sentiment Analysis with tidy data

The tidytext package provides access to several sentiment lexicons. The three that are used in Text Mining with R, Chapter 2 - Sentiment Analysis are :

AFINN from Finn Årup Nielsen,
bing from Bing Liu and collaborators, and
nrc from Saif Mohammad and Peter Turney.

Besides the lexicons used in the text, I have also incorporated 3 other lexicons. The latter 2 are from the quanteda package. Those lexicons are:

loughran from Loughran, T. and McDonald, B
huliu
gi

I also changed the corpus to books by Charles Darwin. Obtained from the gutenbergr package.

if(!require("tidyverse")) {install.packages("tidyverse"); library("tidyverse")}
if(!require("tidytext")) {install.packages("tidytext"); library("tidytext")}
if(!require("quanteda")) {install.packages("quanteda"); library("quanteda")}
if(!require("textstem")) {install.packages("textstem"); library("textstem")}
if(!require("gutenbergr")) {install.packages("gutenbergr"); library("gutenbergr")}
if(!require("quanteda.sentiment")) {install.packages("quanteda.sentiment"); library("quanteda.sentiment")}
if(!require("scales")) {install.packages("scales"); library("scales")}
if(!require("ggplot2")) {install.packages("ggplot2"); library("ggplot2")}

The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one. (Silge and Robinson 2024)

# View the lexicon data-frames
get_sentiments("afinn")

# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

get_sentiments("bing")

# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows

get_sentiments("nrc")

# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

get_sentiments("loughran")

# A tibble: 4,150 × 2
   word         sentiment
   <chr>        <chr>    
 1 abandon      negative 
 2 abandoned    negative 
 3 abandoning   negative 
 4 abandonment  negative 
 5 abandonments negative 
 6 abandons     negative 
 7 abdicated    negative 
 8 abdicates    negative 
 9 abdicating   negative 
10 abdication   negative 
# ℹ 4,140 more rows

Since gi and huliu lexicons are returned as dictionaries, I wanted to convert them to data-frames in order to easily implement and compare them later on with the other lexicons.(Flynn 2023)

#Create  a list from a data dictionary
huliu <- data_dictionary_HuLiu %>% as.list()
gi <- data_dictionary_geninqposneg %>% as.list()

# Split the list into positive and negative sentiment data-frames
huliu_pos <- data.frame(huliu[1], sentiment = "positive")
names(huliu_pos)[1] <- "word"
huliu_neg <- data.frame(huliu[2], sentiment = "negative")
names(huliu_neg)[1] <- "word"

gi_pos <- data.frame(gi[1], sentiment = "positive")
names(gi_pos)[1] <- "word"
gi_neg <- data.frame(gi[2], sentiment = "negative")
names(gi_neg)[1] <- "word"

# Combine the Data Frames
huliu <- rbind(huliu_pos, huliu_neg)
gi <- rbind(gi_pos, gi_neg)


#Display the data-frames
as_tibble(huliu)

# A tibble: 6,789 × 2
   word        sentiment
   <chr>       <chr>    
 1 a+          positive 
 2 abound      positive 
 3 abounds     positive 
 4 abundance   positive 
 5 abundant    positive 
 6 accessable  positive 
 7 accessible  positive 
 8 acclaim     positive 
 9 acclaimed   positive 
10 acclamation positive 
# ℹ 6,779 more rows

as_tibble(gi)

# A tibble: 3,663 × 2
   word       sentiment
   <chr>      <chr>    
 1 abide      positive 
 2 ability    positive 
 3 able       positive 
 4 abound     positive 
 5 absolve    positive 
 6 absorbent  positive 
 7 absorption positive 
 8 abundance  positive 
 9 abundant   positive 
10 accede     positive 
# ℹ 3,653 more rows

Sentiment analysis with inner join

What are the most common fear words in The The Voyage of the Beagle?

First, we need to take the text of the novels and convert the text to the tidy format using unnest_tokens(). Let’s also set up some other columns to keep track of which line and chapter of the book each word comes from; we use group_by and mutate to construct those columns.

# Load Charles Darwin's top books using gutenbergr
my_mirror <- "http://mirror.csclub.uwaterloo.ca/gutenberg/"
darwin_books <- gutenberg_download(c(944, 1228, 2300, 1227), mirror = my_mirror)
as_tibble(darwin_books)

# A tibble: 79,084 × 2
   gutenberg_id text                                                            
          <int> <chr>                                                           
 1          944 "  THE VOYAGE OF THE BEAGLE BY"                                 
 2          944 "  CHARLES DARWIN"                                              
 3          944 ""                                                              
 4          944 ""                                                              
 5          944 ""                                                              
 6          944 ""                                                              
 7          944 ""                                                              
 8          944 "About the online edition."                                     
 9          944 ""                                                              
10          944 "The degree symbol is represented as \"degs.\" Italics are repr…
# ℹ 79,074 more rows

# Identify line numbers, chapters, and books. Delete the ID column. Tokenize the text. 
darwin_books <- darwin_books |> 
  group_by(gutenberg_id) |> 
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE))),
         book = case_when(
           gutenberg_id == 944 ~ "The Voyage of the Beagle",
           gutenberg_id == 1228 ~ "On the Origin of Species",
           gutenberg_id == 2300 ~ "The Descent of Man, and Selection in Relation to Sex",
           gutenberg_id == 1227 ~ "The Expression of the Emotions in Man and Animals"
         )) |>
  ungroup() |> 
  select(-gutenberg_id) |> 
  unnest_tokens(word,text)

as_tibble(darwin_books)

# A tibble: 786,575 × 4
   linenumber chapter book                     word   
        <int>   <int> <chr>                    <chr>  
 1          1       0 The Voyage of the Beagle the    
 2          1       0 The Voyage of the Beagle voyage 
 3          1       0 The Voyage of the Beagle of     
 4          1       0 The Voyage of the Beagle the    
 5          1       0 The Voyage of the Beagle beagle 
 6          1       0 The Voyage of the Beagle by     
 7          2       0 The Voyage of the Beagle charles
 8          2       0 The Voyage of the Beagle darwin 
 9          8       0 The Voyage of the Beagle about  
10          8       0 The Voyage of the Beagle the    
# ℹ 786,565 more rows

First, let’s use the NRC lexicon and filter() for the fear words. Next, let’s filter() the data frame with the text from the books for the words from The Voyage of the Beagle and then use inner_join() to perform the sentiment analysis. What are the most common fear words in The Voyage of the Beagle? Let’s use count() from dplyr

# Use nrc lexicon to filter the fear words
nrc_fear <- get_sentiments("nrc") |> 
  filter(sentiment == "fear")

darwin_books |>
  filter(book == "The Voyage of the Beagle") |> 
  inner_join(nrc_fear) |>
  count(word, sort = TRUE)

Joining with `by = join_by(word)`

# A tibble: 555 × 2
   word           n
   <chr>      <int>
 1 case         129
 2 doubt         80
 3 broken        74
 4 elevation     60
 5 owing         60
 6 fire          59
 7 change        56
 8 difficulty    55
 9 lines         55
10 earthquake    52
# ℹ 545 more rows

We see the counts on the words that can generate fear in the book The Voyage of the Beagle.

Next, we use the bing lexicon to find a sentiment score for each section of text. We use integer division (%/%) to define larger sections of text that span multiple lines. We can use the same pattern with count(), pivot_wider(), and mutate() to find the net sentiment in each of these sections of text.

# Use Bing lexicon to find sentiment score for each section of text
charles_darwin_sentiment <- darwin_books |> 
  inner_join(get_sentiments("bing")) |> 
  count(book, index = linenumber %/% 80, sentiment) |> 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`

# Plot the sentiment score across each novel
  ggplot(charles_darwin_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Comparing the 5 sentiment dictionaries

Let’s use all 5 sentiment lexicons and examine how the sentiment changes across the narrative arc of On the Origin of Species.

# Filter the book "On the Origin of Species"
origin_species <- darwin_books |> 
  filter(book == "On the Origin of Species")

origin_species

# A tibble: 157,002 × 4
   linenumber chapter book                     word       
        <int>   <int> <chr>                    <chr>      
 1          1       0 On the Origin of Species click      
 2          1       0 On the Origin of Species on         
 3          1       0 On the Origin of Species any        
 4          1       0 On the Origin of Species of         
 5          1       0 On the Origin of Species the        
 6          1       0 On the Origin of Species filenumbers
 7          1       0 On the Origin of Species below      
 8          1       0 On the Origin of Species to         
 9          1       0 On the Origin of Species quickly    
10          1       0 On the Origin of Species view       
# ℹ 156,992 more rows

# Use inner_join() to calculate the sentiment in different ways
afinn <- origin_species |> 
  inner_join(get_sentiments("afinn")) |>
  group_by(index = linenumber %/% 80) |>
  summarise(sentiment = sum(value)) |>
  mutate(method = "AFINN")

Joining with `by = join_by(word)`

bing_nrc_loughran_gi_huliu <- bind_rows(
  origin_species |> 
    inner_join(get_sentiments("bing")) |> 
    mutate(method = "Bing et al."),
  origin_species |>
    inner_join(get_sentiments("loughran")) |> 
    mutate(method = "Loughran"),
  origin_species |>
    inner_join(get_sentiments("nrc") |>
                 filter(sentiment %in% c("positive", "negative"))) |>
    mutate(method = "NRC"),
  origin_species |> 
    inner_join(gi) |> 
    mutate(method = "GI"),
  origin_species |>
    inner_join(huliu) |> 
    mutate(method  = "HuLiu")
) |> 
  count(method, index = linenumber %/% 80, sentiment) |> 
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) |> 
  mutate(sentiment = positive - negative)

Joining with `by = join_by(word)`
Joining with `by = join_by(word)`

Warning in inner_join(origin_species, get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 343 of `x` matches multiple rows in `y`.
ℹ Row 2998 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Joining with `by = join_by(word)`

Warning in inner_join(origin_species, filter(get_sentiments("nrc"), sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 182 of `x` matches multiple rows in `y`.
ℹ Row 4873 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Joining with `by = join_by(word)`

Warning in inner_join(origin_species, gi): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 114 of `x` matches multiple rows in `y`.
ℹ Row 3443 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Joining with `by = join_by(word)`

# Bind and Visualize the sentiment score across the book

bind_rows(afinn, bing_nrc_loughran_gi_huliu) |>
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap( ~ method, ncol = 1, scales = "free_y")

The 5 different lexicons show different sentiment scores across the narrative arc of On the Origin of Species. We see similar dips and peaks in sentiment at about the same places in the book, but the sentiment scores are different for each lexicon.The Loughran lexicon seems to have the most negative sentiment scores.

Why does the Loughran lexicon have the most negative sentiment scores? Let’s look briefly at how many positive and negative words are in these lexicons.

The Loughran lexicon has the most negative sentiment scores because it has the highest ratio of negative words with 87% of the words being negative. This is probably due to the fact that Loughran is meant for financial text and has a lot of negative words related to financial terms.

# Count the positive and negative words in the lexicons add ratio for more clarity
for (i in c("nrc", "bing", "loughran")) {
  print(get_sentiments(i) |> 
    filter(sentiment %in% c("positive", "negative")) |>
    count(sentiment) |> 
    mutate(ratio = n / sum(n)))
}

# A tibble: 2 × 3
  sentiment     n ratio
  <chr>     <int> <dbl>
1 negative   3316 0.590
2 positive   2308 0.410
# A tibble: 2 × 3
  sentiment     n ratio
  <chr>     <int> <dbl>
1 negative   4781 0.705
2 positive   2005 0.295
# A tibble: 2 × 3
  sentiment     n ratio
  <chr>     <int> <dbl>
1 negative   2355 0.869
2 positive    354 0.131

 as_tibble(huliu) |> 
  filter(sentiment %in% c("positive", "negative")) |>
  count(sentiment) |> 
  mutate(ratio = n / sum(n))

# A tibble: 2 × 3
  sentiment     n ratio
  <chr>     <int> <dbl>
1 negative   4783 0.705
2 positive   2006 0.295

as_tibble(gi) |> 
  filter(sentiment %in% c("positive", "negative")) |>
  count(sentiment) |> 
  mutate(ratio = n / sum(n))

# A tibble: 2 × 3
  sentiment     n ratio
  <chr>     <int> <dbl>
1 negative   2010 0.549
2 positive   1653 0.451

Conclusion

I was able to compare the sentiment scores across the narrative arc of On the Origin of Species using 5 different sentiment lexicons. I also found that the Loughran lexicon had the most negative sentiment scores due to the high ratio of negative words in the lexicon. I also was able to examine the proportion of positive and negative words in the lexicons. I would have liked to figure out how to use another lexicon called lang15 however, I could not get it to function properly. Due to time constraints, I was unable to figure out how to use it. I would have also liked to have a more in-depth analysis of the sentiment scores across the narrative arc of all of Charles Darwin’s books. All in all, I was able to learn how to use several packages and I was able to learn how to use sentiment analysis in R.

References

Flynn, Lauren. 2023. “Comparing Sentiment Analysis Dictionaries in r.” https://medium.com/@laurenflynn1211/comparing-sentiment-analysis-dictionaries-in-r-c695fca64326.

Silge, Julia, and David Robinson. 2024. “Text Mining with r: A Tidy Approach.” https://www.tidytextmining.com/sentiment.