This assignment aims at replicating and expanding upon the sentiment analysis code provided in Chapter 2 of Tidy Text Mining with R: A Tidy Approach. We start by getting the provided code to work and then extending the code in two ways:
The Jane Austen book data set is loaded from the
janeaustenr library using austen_books().
The text is then tokenized.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
We use the NRC lexicon for sentiment analysis to determine the most common joy words in Emma. We first get the joy sentiment words from the NRC lexicon and then inner join this to the tokenized books data set where the book is Emma.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # … with 291 more rows
Next, we use the Bing lexicon to see how the sentiment changes throughout each book in the data set. The code calculates the net sentiment in 80-line segments throughout the books and then plots the sentiment for each novel. From the plots, we can see the changes toward positive and negative sentiment for each book.
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
First, we filter the data set for lines of text from Pride and Prejudice.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
rmarkdown::paged_table(pride_prejudice)
We then compare the sentiments from the AFFIN, NRC, and Bing lexicons. AFFIN measures sentiment on a scale from -5 to 5, while NRC categorizes sentiment into a few categories and Bing separates into positive and negative.
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
We can then compare the sentiments by plotting them.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
We can see that the values for sentiment differ, however the plots all follow a similar pattern of dips and peaks.
Using the Bing lexicon to split the texts into positive and negative words, we can see how much each word contributes to the positive or negative sentiment.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # … with 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
From the graph, we can see that “miss” is a word that is classified as negative. However, Jane Austen uses “Miss” as a title for young ladies. We can remove this word from the sentiment by adding our own custom stop words.
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # … with 1,140 more rows
We could also visualize the most used words by creating a wordcloud.
tidy_books %>%
anti_join(custom_stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
We see a lot of the characters’ names appear here. We also see that “time” seems to be the most used word, as it is the largest in the cloud.
We can also use reshape2’s acast()
function to turn the information into a matrix and then use
comparison.cloud() to compare the positive and negative
words.
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
anti_join(custom_stop_words) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
Words that did not have a positive or negative connotation, or which were simply not thought to be included in the Bing dictionary, do not appear in this comparison wordcloud. The words that stick out the most are “happy”, “love”, and “pleasure”, which have all been classified as positive. The most used negative word is “poor”.
Information on Jockers lexicon found here. The Jockers lexicon rates sentiment on a scale from -1 to 1.
jockers_sentiments <- hash_sentiment_jockers %>%
select(word = x, sentiment = y)
jockers_word_counts <- pride_prejudice %>%
inner_join(jockers_sentiments) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
jockers <- pride_prejudice %>%
inner_join(jockers_sentiments) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(sentiment)) %>%
mutate(method = "Jockers")
Let’s see how this new method compares to the previous methods by adding it to the comparison plot.
bind_rows(afinn,
bing_and_nrc,
jockers) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
The new method follows the same trends as the other methods.
As an avid Harry Potter reader, I was curious to track the sentiment across the seven books.
Harry Potter book data sets found here.
First, I created a data set with all seven books.
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows")
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince, deathly_hallows)
hp_books <- tibble()
for (i in 1:length(books)) {
temp <- as.data.frame(books[[i]]) %>%
mutate(book = titles[[i]])
temp <- temp %>%
mutate(chapter_title = str_extract(temp[,1], "[A-Z -]+"),
text = str_remove(temp[,1], chapter_title),
chapter = c(1:nrow(temp))) %>%
select(book, chapter, text)
if (i == 1) {
hp_books <- temp
} else {
hp_books <- rbind(hp_books, temp)
}
}
hp_books <- hp_books %>%
mutate(book = factor(book, levels = c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows")))
I then wanted to see how much each word contributes to the overall sentiment, as well as how much each word contributes to the positive and negative sentiment of the series.
tidy_hp <- hp_books %>%
unnest_tokens(word, text)
tidy_hp %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
with(wordcloud(word, n, max.words = 100))
tidy_hp %>%
inner_join(get_sentiments("bing")) %>%
anti_join(stop_words) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("pink", "lightgreen"),
max.words = 100)
We can see that “fudge” and “moody” are both words that contribute to negative sentiment. However, in Harry Potter, these are character names. We can filter these out using custom stop words.
hp_stop_words <- data.frame(word = c("fudge", "moody"),
lexicon = c("custom", "custom"))
tidy_hp <- tidy_hp %>%
anti_join(hp_stop_words)
Let’s use the NRC lexicon to see the breakup of types of sentiment within each book. First I want to see the sentiments broken up into the emotions that are included in the NRC lexicon: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
# count sentiments
hp_sentiment_counts_nrc <- tidy_hp %>%
inner_join(get_sentiments("nrc")) %>%
group_by(book) %>%
count(sentiment)
# total number of words in sentiment count for each book
hp_sentiment_totals_nrc <- hp_sentiment_counts_nrc %>%
group_by(book) %>%
summarize(total = sum(n))
# join the 2 and calculate percentage
hp_sentiment_counts_nrc <- hp_sentiment_counts_nrc %>%
inner_join(hp_sentiment_totals_nrc) %>%
mutate(prop = n / total)
hp_sentiment_counts_nrc %>%
filter(!sentiment %in% c("positive", "negative")) %>%
ggplot(aes(x = sentiment, y = prop, fill = sentiment)) +
geom_bar(stat = "identity") +
facet_wrap(~book) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Sentiment", y = "Percentage of Sentiment") +
theme(legend.position = "none", axis.text.x = element_text(angle = 90))
It is interesting that each of these seven books have an almost identical breakup of these emotions. The most notable differences seem to be an increase in fear in the fourth, fifth, and seventh books, and a decrease in trust in the seventh book.
Next, I will also split to see how the books compare in positive and negative sentiment. I am interested to see if there is a difference between Bing and NRC lexicon here.
################ bing ################
hp_sentiment_counts_bing <- tidy_hp %>%
inner_join(get_sentiments("bing")) %>%
group_by(book) %>%
count(sentiment)
hp_sentiment_totals_bing <- hp_sentiment_counts_bing %>%
group_by(book) %>%
summarize(total = sum(n))
hp_sentiment_counts_bing <- hp_sentiment_counts_bing %>%
inner_join(hp_sentiment_totals_bing) %>%
mutate(prop = n / total, lexicon = "Bing")
############# nrc pos and neg ################
hp_sentiment_counts_nrc2 <- hp_sentiment_counts_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
select(book, sentiment, n)
hp_sentiment_totals_nrc2 <- hp_sentiment_counts_nrc2 %>%
group_by(book) %>%
summarize(total = sum(n))
hp_sentiment_counts_nrc2 <- hp_sentiment_counts_nrc2 %>%
inner_join(hp_sentiment_totals_nrc2) %>%
mutate(prop = n / total, lexicon = "NRC")
hp_sentiment_counts_nrc2 %>%
rbind(hp_sentiment_counts_bing) %>%
ggplot(aes(x = sentiment, y = prop, fill = lexicon)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~book) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Sentiment", y = "Percentage of Sentiment")
It seems that the NRC lexicon tracks more negative sentiment while the Bing lexicon tracks more positive sentiment. Once again, the breakups of sentiment between each book is almost identical. The Deathly Hallows stands out as the least positive of the seven.
Now let’s track the sentiment throughout the books. The Harry Potter data sets include a row for each chapter, so we will track how the sentiment changes between each chapter. We will do this using the Jockers lexicon.
tidy_hp %>%
inner_join(jockers_sentiments) %>%
group_by(book, index = chapter %/% 1) %>%
summarise(sentiment = sum(sentiment)) %>%
ggplot(aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
We can track the overall sentiment for each chapter of each book. The last book seems to have the most overall negative sentiment.
In this assignment, I got the sample code from the text to work for the Jane Austen bookset. I also used the Jockers lexicon to extend the analysis and to compare it to the plots tracking sentiment using the AFINN, NRC, and Bing lexicons.
I also analyzed the sentiment of Harry Potters books. I used the Bing and NRC lexicons to track positive and negative sentiment, as well as some other emotions. Based on the plots, the breakdowns of sentiment for each book seems to be practically identical, with minor differences for some books. I also used the Jockers lexicon to track the sentiment for each chapter. If I were to extend this, I would want to track the sentiment based on an amount of words, rather than just overall chapter sentiment, to really see how the sentiment changes over the course of the book.