For this assignment, I pulled the book “The Odyssey” from the gutenbergr package and performed sentiment analysis on its contents. The book is translated from Greek poems written by Homer in the 8th century BC. It follows the story of Odysseus, king of Ithaca, as he journeys home to his wife after the Trojan War. When I read this book in high school, I remember it was a tragedy as Odysseus lost most of his men on the way home and his wife grieving as she had believed Odysseus had died during the war. So, I would expect to see a more negative sentiment throughout the book.
The first part of this Rmarkdown is code from chapter 2 of Text Mining with R by David Robinson and Julia Silge. The primary example code analyzes the sentiment of books by Jane Austen. It also compares the three sentiment dictionaries across the book “Pride and Prejudice”. The section following this example will be a sentiment analysis of “The Odyssey”. For the additional sentiment lexicon, I decided to go with the “loughran” lexicon. This lexicon was developed as a tool for financial sentiment analysis. It will be interesting to compare how similar this finance lexicon is to the other 3 lexicons.
library(textdata)
library(tidytext)
library(tidyverse)
library(janeaustenr)
library(gutenbergr)
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
#Grabbing the book from the gutenberg package
the_odyssey <- gutenberg_download(3160)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
head(the_odyssey,10)
## # A tibble: 10 × 2
## gutenberg_id text
## <int> <chr>
## 1 3160 "cover"
## 2 3160 ""
## 3 3160 ""
## 4 3160 ""
## 5 3160 ""
## 6 3160 "The Odyssey"
## 7 3160 ""
## 8 3160 "by Homer"
## 9 3160 ""
## 10 3160 "Translated by Alexander Pope"
#Creating a data frame with line numbers and tokens
tidy_odyssey <- the_odyssey %>%
mutate(linenumber = row_number()) %>%
unnest_tokens(word,text) #%>%
#anti_join(stop_words)
head(tidy_odyssey,10)
## # A tibble: 10 × 3
## gutenberg_id linenumber word
## <int> <int> <chr>
## 1 3160 1 cover
## 2 3160 6 the
## 3 3160 6 odyssey
## 4 3160 8 by
## 5 3160 8 homer
## 6 3160 10 translated
## 7 3160 10 by
## 8 3160 10 alexander
## 9 3160 10 pope
## 10 3160 13 contents
#Comparing the sentiment analysis dictionaries
afinn <- tidy_odyssey %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
tidy_odyssey %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_odyssey %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
#The NRC plot seems to have less negative sentiment than the other 2 plots
tidy_odyssey %>%
inner_join(get_sentiments("nrc")) %>%
count(sentiment) %>%
arrange(desc(n))
## Joining, by = "word"
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 positive 9034
## 2 negative 6966
## 3 trust 4623
## 4 fear 4134
## 5 anticipation 4020
## 6 joy 3620
## 7 anger 3226
## 8 sadness 3225
## 9 disgust 2040
## 10 surprise 1634
tidy_odyssey %>%
inner_join(get_sentiments("loughran")) %>%
count(sentiment) %>%
arrange(desc(n))
## Joining, by = "word"
## # A tibble: 6 × 2
## sentiment n
## <chr> <int>
## 1 negative 1816
## 2 positive 960
## 3 litigious 613
## 4 uncertainty 507
## 5 constraining 180
## 6 superfluous 1
#There seems to be a lot of tokens missing
# Plotting the sentiment
plot_odyssey <- tidy_odyssey %>%
inner_join(get_sentiments("loughran") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "Loughran") %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
plot_odyssey %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE)
In the story, Odysseus is trying to return home to his wife but there are challenges along the way. I was expecting the sentiments to be mostly negative throughout the story, similar to the sentiment plot in “Bing et al”. In the “AFINN” method, I can understand the general positive sentiment as it looks at the sum of the sentiment and some words carry higher ratings than others. However, in the “NRC” method, I was expecting to see a plot similar to that of “Bing et al” but it did not account for other words with a negative sentiment such as fear, anger, sadness, and disgust.
From the sentiment analysis of using the “loughran” lexicon, there were a lot of words that were not given a sentiment. There was almost double the amount of tokens with negative sentiment than positive. Although the plot differs from the plot of the other three lexicons, it was what I had expected from my experience reading the book. However, I might hold some bias as the book was extremely hard to understand due to it being written as one very long poem.