if(!require("tidyverse")) {install.packages("tidyverse"); library("tidyverse")}
if(!require("tidytext")) {install.packages("tidytext"); library("tidytext")}
if(!require("quanteda")) {install.packages("quanteda"); library("quanteda")}
if(!require("textstem")) {install.packages("textstem"); library("textstem")}
if(!require("gutenbergr")) {install.packages("gutenbergr"); library("gutenbergr")}
if(!require("quanteda.sentiment")) {install.packages("quanteda.sentiment"); library("quanteda.sentiment")}
if(!require("scales")) {install.packages("scales"); library("scales")}
if(!require("ggplot2")) {install.packages("ggplot2"); library("ggplot2")}Sentinment Analysis
Sentiment Analysis with tidy data
The tidytext package provides access to several sentiment lexicons. The three that are used in Text Mining with R, Chapter 2 - Sentiment Analysis are :
AFINNfrom Finn Årup Nielsen,bingfrom Bing Liu and collaborators, andnrcfrom Saif Mohammad and Peter Turney.
Besides the lexicons used in the text, I have also incorporated 3 other lexicons. The latter 2 are from the quanteda package. Those lexicons are:
loughranfrom Loughran, T. and McDonald, Bhuliugi
I also changed the corpus to books by Charles Darwin. Obtained from the gutenbergr package.
The function get_sentiments() allows us to get specific sentiment lexicons with the appropriate measures for each one. (Silge and Robinson 2024)
# View the lexicon data-frames
get_sentiments("afinn")# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# ℹ 2,467 more rows
get_sentiments("bing")# A tibble: 6,786 × 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
7 abomination negative
8 abort negative
9 aborted negative
10 aborts negative
# ℹ 6,776 more rows
get_sentiments("nrc")# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# ℹ 13,862 more rows
get_sentiments("loughran")# A tibble: 4,150 × 2
word sentiment
<chr> <chr>
1 abandon negative
2 abandoned negative
3 abandoning negative
4 abandonment negative
5 abandonments negative
6 abandons negative
7 abdicated negative
8 abdicates negative
9 abdicating negative
10 abdication negative
# ℹ 4,140 more rows
Since gi and huliu lexicons are returned as dictionaries, I wanted to convert them to data-frames in order to easily implement and compare them later on with the other lexicons.(Flynn 2023)
#Create a list from a data dictionary
huliu <- data_dictionary_HuLiu %>% as.list()
gi <- data_dictionary_geninqposneg %>% as.list()
# Split the list into positive and negative sentiment data-frames
huliu_pos <- data.frame(huliu[1], sentiment = "positive")
names(huliu_pos)[1] <- "word"
huliu_neg <- data.frame(huliu[2], sentiment = "negative")
names(huliu_neg)[1] <- "word"
gi_pos <- data.frame(gi[1], sentiment = "positive")
names(gi_pos)[1] <- "word"
gi_neg <- data.frame(gi[2], sentiment = "negative")
names(gi_neg)[1] <- "word"
# Combine the Data Frames
huliu <- rbind(huliu_pos, huliu_neg)
gi <- rbind(gi_pos, gi_neg)
#Display the data-frames
as_tibble(huliu)# A tibble: 6,789 × 2
word sentiment
<chr> <chr>
1 a+ positive
2 abound positive
3 abounds positive
4 abundance positive
5 abundant positive
6 accessable positive
7 accessible positive
8 acclaim positive
9 acclaimed positive
10 acclamation positive
# ℹ 6,779 more rows
as_tibble(gi)# A tibble: 3,663 × 2
word sentiment
<chr> <chr>
1 abide positive
2 ability positive
3 able positive
4 abound positive
5 absolve positive
6 absorbent positive
7 absorption positive
8 abundance positive
9 abundant positive
10 accede positive
# ℹ 3,653 more rows
Sentiment analysis with inner join
What are the most common fear words in The The Voyage of the Beagle?
First, we need to take the text of the novels and convert the text to the tidy format using unnest_tokens(). Let’s also set up some other columns to keep track of which line and chapter of the book each word comes from; we use group_by and mutate to construct those columns.
# Load Charles Darwin's top books using gutenbergr
my_mirror <- "http://mirror.csclub.uwaterloo.ca/gutenberg/"
darwin_books <- gutenberg_download(c(944, 1228, 2300, 1227), mirror = my_mirror)
as_tibble(darwin_books)# A tibble: 79,084 × 2
gutenberg_id text
<int> <chr>
1 944 " THE VOYAGE OF THE BEAGLE BY"
2 944 " CHARLES DARWIN"
3 944 ""
4 944 ""
5 944 ""
6 944 ""
7 944 ""
8 944 "About the online edition."
9 944 ""
10 944 "The degree symbol is represented as \"degs.\" Italics are repr…
# ℹ 79,074 more rows
# Identify line numbers, chapters, and books. Delete the ID column. Tokenize the text.
darwin_books <- darwin_books |>
group_by(gutenberg_id) |>
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE))),
book = case_when(
gutenberg_id == 944 ~ "The Voyage of the Beagle",
gutenberg_id == 1228 ~ "On the Origin of Species",
gutenberg_id == 2300 ~ "The Descent of Man, and Selection in Relation to Sex",
gutenberg_id == 1227 ~ "The Expression of the Emotions in Man and Animals"
)) |>
ungroup() |>
select(-gutenberg_id) |>
unnest_tokens(word,text)
as_tibble(darwin_books)# A tibble: 786,575 × 4
linenumber chapter book word
<int> <int> <chr> <chr>
1 1 0 The Voyage of the Beagle the
2 1 0 The Voyage of the Beagle voyage
3 1 0 The Voyage of the Beagle of
4 1 0 The Voyage of the Beagle the
5 1 0 The Voyage of the Beagle beagle
6 1 0 The Voyage of the Beagle by
7 2 0 The Voyage of the Beagle charles
8 2 0 The Voyage of the Beagle darwin
9 8 0 The Voyage of the Beagle about
10 8 0 The Voyage of the Beagle the
# ℹ 786,565 more rows
First, let’s use the NRC lexicon and filter() for the fear words. Next, let’s filter() the data frame with the text from the books for the words from The Voyage of the Beagle and then use inner_join() to perform the sentiment analysis. What are the most common fear words in The Voyage of the Beagle? Let’s use count() from dplyr
# Use nrc lexicon to filter the fear words
nrc_fear <- get_sentiments("nrc") |>
filter(sentiment == "fear")
darwin_books |>
filter(book == "The Voyage of the Beagle") |>
inner_join(nrc_fear) |>
count(word, sort = TRUE)Joining with `by = join_by(word)`
# A tibble: 555 × 2
word n
<chr> <int>
1 case 129
2 doubt 80
3 broken 74
4 elevation 60
5 owing 60
6 fire 59
7 change 56
8 difficulty 55
9 lines 55
10 earthquake 52
# ℹ 545 more rows
We see the counts on the words that can generate fear in the book The Voyage of the Beagle.
Next, we use the bing lexicon to find a sentiment score for each section of text. We use integer division (%/%) to define larger sections of text that span multiple lines. We can use the same pattern with count(), pivot_wider(), and mutate() to find the net sentiment in each of these sections of text.
# Use Bing lexicon to find sentiment score for each section of text
charles_darwin_sentiment <- darwin_books |>
inner_join(get_sentiments("bing")) |>
count(book, index = linenumber %/% 80, sentiment) |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
mutate(sentiment = positive - negative)Joining with `by = join_by(word)`
# Plot the sentiment score across each novel
ggplot(charles_darwin_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")Comparing the 5 sentiment dictionaries
Let’s use all 5 sentiment lexicons and examine how the sentiment changes across the narrative arc of On the Origin of Species.
# Filter the book "On the Origin of Species"
origin_species <- darwin_books |>
filter(book == "On the Origin of Species")
origin_species# A tibble: 157,002 × 4
linenumber chapter book word
<int> <int> <chr> <chr>
1 1 0 On the Origin of Species click
2 1 0 On the Origin of Species on
3 1 0 On the Origin of Species any
4 1 0 On the Origin of Species of
5 1 0 On the Origin of Species the
6 1 0 On the Origin of Species filenumbers
7 1 0 On the Origin of Species below
8 1 0 On the Origin of Species to
9 1 0 On the Origin of Species quickly
10 1 0 On the Origin of Species view
# ℹ 156,992 more rows
# Use inner_join() to calculate the sentiment in different ways
afinn <- origin_species |>
inner_join(get_sentiments("afinn")) |>
group_by(index = linenumber %/% 80) |>
summarise(sentiment = sum(value)) |>
mutate(method = "AFINN")Joining with `by = join_by(word)`
bing_nrc_loughran_gi_huliu <- bind_rows(
origin_species |>
inner_join(get_sentiments("bing")) |>
mutate(method = "Bing et al."),
origin_species |>
inner_join(get_sentiments("loughran")) |>
mutate(method = "Loughran"),
origin_species |>
inner_join(get_sentiments("nrc") |>
filter(sentiment %in% c("positive", "negative"))) |>
mutate(method = "NRC"),
origin_species |>
inner_join(gi) |>
mutate(method = "GI"),
origin_species |>
inner_join(huliu) |>
mutate(method = "HuLiu")
) |>
count(method, index = linenumber %/% 80, sentiment) |>
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) |>
mutate(sentiment = positive - negative)Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
Warning in inner_join(origin_species, get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 343 of `x` matches multiple rows in `y`.
ℹ Row 2998 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Joining with `by = join_by(word)`
Warning in inner_join(origin_species, filter(get_sentiments("nrc"), sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 182 of `x` matches multiple rows in `y`.
ℹ Row 4873 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Joining with `by = join_by(word)`
Warning in inner_join(origin_species, gi): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 114 of `x` matches multiple rows in `y`.
ℹ Row 3443 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Joining with `by = join_by(word)`
# Bind and Visualize the sentiment score across the book
bind_rows(afinn, bing_nrc_loughran_gi_huliu) |>
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap( ~ method, ncol = 1, scales = "free_y")The 5 different lexicons show different sentiment scores across the narrative arc of On the Origin of Species. We see similar dips and peaks in sentiment at about the same places in the book, but the sentiment scores are different for each lexicon.The Loughran lexicon seems to have the most negative sentiment scores.
Why does the Loughran lexicon have the most negative sentiment scores? Let’s look briefly at how many positive and negative words are in these lexicons.
The Loughran lexicon has the most negative sentiment scores because it has the highest ratio of negative words with 87% of the words being negative. This is probably due to the fact that Loughran is meant for financial text and has a lot of negative words related to financial terms.
# Count the positive and negative words in the lexicons add ratio for more clarity
for (i in c("nrc", "bing", "loughran")) {
print(get_sentiments(i) |>
filter(sentiment %in% c("positive", "negative")) |>
count(sentiment) |>
mutate(ratio = n / sum(n)))
}# A tibble: 2 × 3
sentiment n ratio
<chr> <int> <dbl>
1 negative 3316 0.590
2 positive 2308 0.410
# A tibble: 2 × 3
sentiment n ratio
<chr> <int> <dbl>
1 negative 4781 0.705
2 positive 2005 0.295
# A tibble: 2 × 3
sentiment n ratio
<chr> <int> <dbl>
1 negative 2355 0.869
2 positive 354 0.131
as_tibble(huliu) |>
filter(sentiment %in% c("positive", "negative")) |>
count(sentiment) |>
mutate(ratio = n / sum(n))# A tibble: 2 × 3
sentiment n ratio
<chr> <int> <dbl>
1 negative 4783 0.705
2 positive 2006 0.295
as_tibble(gi) |>
filter(sentiment %in% c("positive", "negative")) |>
count(sentiment) |>
mutate(ratio = n / sum(n))# A tibble: 2 × 3
sentiment n ratio
<chr> <int> <dbl>
1 negative 2010 0.549
2 positive 1653 0.451
Conclusion
I was able to compare the sentiment scores across the narrative arc of On the Origin of Species using 5 different sentiment lexicons. I also found that the Loughran lexicon had the most negative sentiment scores due to the high ratio of negative words in the lexicon. I also was able to examine the proportion of positive and negative words in the lexicons. I would have liked to figure out how to use another lexicon called lang15 however, I could not get it to function properly. Due to time constraints, I was unable to figure out how to use it. I would have also liked to have a more in-depth analysis of the sentiment scores across the narrative arc of all of Charles Darwin’s books. All in all, I was able to learn how to use several packages and I was able to learn how to use sentiment analysis in R.