Data 607 Assignment 10A: Sentiment Analysis
Introduction
For this assignment, I start by using the example base code for sentiment analysis from Chapter 2 of Text Mining with R by Julia Silge and David Robinson .
I will then replicate the code to conduct sentiment analysis on Bram Stoker’s Dracula using the gutenbergr package to download the text. I will also incorporate the Syuzhet sentiment lexicon in my analysis as it is meant to be used on plot arcs.
We will use the following libraries
- The tidytext library
- The janeaustenr library
- The gutenbergr library
- The wordcloud library
- The reshape2 library
- The kableExtra library
- The syuzhet library
Sentiment Analysis Jane Austen’s Works
The following base code comes from
Silge, Julia, and David Robinson. Text Mining with R: A Tidy
Approach.
O’Reilly Media, 2017. https://www.tidytextmining.com/
Getting the AFINN, bing and nrc Sentiment Lexicons
The code base in the book first gets the AFINN, bing and and nrc sentiment lexicons from the tidytext package using the get_sentiments() function
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
In the book, once the janeaustenr package is loaded, a data frame is created to store the text from Jane Austen’s seven novels, the text is then tidied using the unnest_tokens() funtion and columns are created to store the chapters and line numbers of each word in each novel.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)Next, the book uses the nrc lexicon to find the most common joy words in Jane Austen’s Emma.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
Next an index is created (every 80 lines is a chunk of text ) using bing to track the trajectory of the novel. Then a plot of sentiment scores along the trajectory of each of Jane Austen’s novel is created.
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Next the base code in the book plots these sentiment scores across the plot trajectory of each novel.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")Pride and Prejudice Sentiment Changes by Sentiment Lexicon
The base code in the book also compare how each sentiment lexicon calculates sentiment changes in Jane Austen’s Pride and Prejuidce by estimating the net sentiment in every 80 lines of the novel.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")The Most Common Positive and Negative Words
The book analyzes the word counts that contributed to the total positive and negative sentiments.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
And then the book uses ggplot2 to visualize the how many times each of the top ten negative and positive words appeared in Jane Austen’s novels.
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)It is noted in the book that the word miss is counted as a negative word by the bing lexicon and that it is not a negative word in the context of Jane Austen’s writing and thus can be added as a stop-word by custom making a stop-words list.
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ℹ 1,140 more rows
Wordclouds
The book uses wordclouds to visualize the most common words that appear in Jane Austen’s novels.
## Joining with `by = join_by(word)`
The book also uses an regex pattern to split each of Jane Austen’s novels into chapters.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
Once the novels are split into chapters it is possible to get a wordcount of the number of negative words in each of Jane Austen’s novels, and the ratio of neagtive words to all words in each novel so as to find which novel has the most chapters with high ratios of negative words.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'book'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
Sentiment Analysis for Bram Stoker’s Dracula
We are going to modify Julia Silge and David Robinson’s base code from Text Mining with R! so that we can run a sentiment analysis on Bram Stoker’s Dracula. We will be downloading the text using the guntbergr package.
## Determining mirror for Project Gutenberg from
## https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.
Tidying the Text
The Gutenburg Dracula text I am using lists the contents by chapter, is followed by a short paragraph:
“How these papers have been placed in sequence will be made manifest in the reading of them. All needless matters have been eliminated, so that a history almost at variance with the possibilities of later-day belief may stand forth as simple fact. There is throughout no statement of past things wherein memory may err, for all the records chosen are exactly contemporary, given from the standpoints and within the range of knowledge of those who made them.”
and then the first chapter begins. Because Bram Stoker’s Dracula is one novel and not a body of work like Jane Austen’s seven novels the base code was written for, I will be comparing sentiment per chapter and overall for Dracula. To do this I have to remove the contents list of chapters, as I do not want to accidentally have more chapters in my data frame than there really are. Dracula has 27 chapters.
start_string <- which(str_detect(dracula$text, "^\\s*of knowledge of those who made them.\\.?\\s*$"))
dracula_clean <- if (length(start_string) > 0) {
dracula %>% slice(start_string[1]:n())
} else {
dracula
}
head(dracula_clean)## # A tibble: 6 × 2
## gutenberg_id text
## <int> <chr>
## 1 345 "of knowledge of those who made them."
## 2 345 ""
## 3 345 ""
## 4 345 ""
## 5 345 ""
## 6 345 "DRACULA"
Now that the conent’s list of chapters has been removed, we are free to split the novel into chapters.
This and the rest of the code that follows is modified from
Silge, Julia, and David Robinson. Text Mining with R: A Tidy
Approach.
O’Reilly Media, 2017. https://www.tidytextmining.com/
tidy_stoker2 <- dracula_clean %>%
mutate(
linenum = row_number(),
chapters = cumsum(str_detect(text,
regex("^CHAPTER [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)Before I continue I am going to verify that correct number of chapter are counted.
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [26] 25 26 27
Okay, great!
Joy Words
As in the base code example, we are going to find the most common joy words in Dracula using the nrc lexicon.
# finding joy words in dracula using nrc lexicon
tidy_stoker2 %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining with `by = join_by(word)`
## # A tibble: 305 × 2
## word n
## <chr> <int>
## 1 good 258
## 2 friend 183
## 3 found 153
## 4 god 150
## 5 diary 98
## 6 love 84
## 7 child 73
## 8 hope 66
## 9 sweet 66
## 10 kind 60
## # ℹ 295 more rows
We will also follow the base code to plot of sentiment scores along the novel’s chapters using the bing lexicon, here the index consists of 100 lines in the book.
dracula_sentiment <- tidy_stoker2 %>%
inner_join(get_sentiments("bing")) %>%
count(chapters, index = linenum %/% 100, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
ggplot(dracula_sentiment, aes(index, sentiment, fill = chapters)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapters, ncol = 6 ,scales = "free_x")It’s no surprise that the sentiment scores are largely negative for most chapters, with many remaining negative throughout.
Overall Sentiment Per Dracula Chapter
We will also look at the overall sentiment tracjectory of the Bram Stoker’s novel, we are continuing to use the bing lexicon.
chapter_sentiment <- dracula_sentiment %>%
group_by(chapters) %>%
summarise(sentiment = sum(sentiment))
ggplot(chapter_sentiment, aes(x = chapters, y = sentiment)) +
geom_col(fill = "red", color = "white") +
labs(title = "Overall Sentiment Per Dracula Chapter",
x = "Chapter",
y = "Overall sentiment") +
theme_dark()Again, it is no suprise that the overall sentiment over the trajectory of Dracula trends negative.
Comparing the AFINN, Bing and NRC
Now, we will modify the base code so that we can also compare how each sentiment lexicon calculates sentiment changes in Dracula by estimating the net sentiment in every 80 lines of the novel.
# afinn, bing and nrc
afinn_dracula <- tidy_stoker2 %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenum %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining with `by = join_by(word)`
bing_nrc_dracula <- bind_rows(
tidy_stoker2 %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_stoker2 %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenum %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 496 of `x` matches multiple rows in `y`.
## ℹ Row 3013 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn_dracula,
bing_nrc_dracula) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")The NRC lexicon labels more text chunks far more positive than the other two lexicons. Unlike the Pride and Prejudice example from the book, for Dracula the three lexicons agree slightly less on the overall trends in sentiments. The NRC lexicon once again perceives more of the text as positive than the other two lexicons do.
Dracula Positive and Negative Word Counts
As in the base code example, we will also count the top 10 words that appear in Dracula and ascertain whether they are positive or negative using the bing lexicon.
bing_word_count <- tidy_stoker2 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining with `by = join_by(word)`
## # A tibble: 10 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 like positive 292
## 2 good positive 258
## 3 well positive 245
## 4 poor negative 193
## 5 great positive 183
## 6 work positive 146
## 7 fear negative 137
## 8 dead negative 109
## 9 right positive 99
## 10 terrible negative 99
Again, like in the base code, let us visualize the top ten negative words and positive words that appear in Dracula side-by-side.
bing_word_count %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment in Dracula",
y = NULL)Most of the words make sense in their respective category, except for miss which appears as a negative word here as it did in the analysis of Jane Austen’s novels.
Wordclouds for Dracula
Like in the base example let us visualize the words that most appear in Dracula using a wordcloud.
tepes_colors = c("black", "gray40", "slategray","firebrick1", "firebrick3", "red3", "red4")
tidy_stoker2 %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 80,
scale = c(2, 0.4), #could not fit larger words otherwise
rot.per = 0.35,
random.order = FALSE,
colors = tepes_colors)) # put most frequent in center## Joining with `by = join_by(word)`
Now let’s take a look at the most common positive and negative words, we will display them in a word cloud as well like in the base code.
tidy_stoker2 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)## Joining with `by = join_by(word)`
Like in the base code example we are going to get a wordcount for the number of negative words in each of chapter of Dracula, and the ratio of neagtive words to all words in each chapter so as to find which chapters have the highest ratios of negative words.
#group word count by chapter first
wordcounts_dracula <- tidy_stoker2 %>%
group_by(chapters) %>%
summarize(words = n())
#now we will find the negative word ratio per chapter
negative_count <- tidy_stoker2 %>%
semi_join(bingnegative) %>%
group_by(chapters) %>%
summarize(negativewords = n(), .groups = "drop") %>%
left_join(wordcounts_dracula, by = "chapters") %>%
mutate(ratio = negativewords/words) %>%
filter(chapters != 0) %>%
ungroup()## Joining with `by = join_by(word)`
Now we can display the negative word ratio for each chapter starting with the chapter with the highest negative word ratio.
negative_rows <- nrow(negative_count)
negative_count %>%
arrange(ratio) %>%
kable("html", caption = "Top Chapters by Negative Word Ratio") %>%
kable_styling(full_width = FALSE, position = "center") %>%
row_spec(0, background = "red", color = "white", bold = TRUE) %>%
row_spec(1:negative_rows, background = "black", color = "white") | chapters | negativewords | words | ratio |
|---|---|---|---|
| 20 | 141 | 5958 | 0.0236657 |
| 2 | 139 | 5525 | 0.0251584 |
| 5 | 94 | 3607 | 0.0260604 |
| 18 | 184 | 6996 | 0.0263007 |
| 24 | 170 | 6339 | 0.0268181 |
| 26 | 204 | 7162 | 0.0284837 |
| 17 | 164 | 5626 | 0.0291504 |
| 6 | 173 | 5744 | 0.0301184 |
| 10 | 188 | 5999 | 0.0313386 |
| 14 | 205 | 6499 | 0.0315433 |
| 3 | 186 | 5747 | 0.0323647 |
| 8 | 207 | 6356 | 0.0325677 |
| 22 | 179 | 5496 | 0.0325691 |
| 9 | 198 | 5977 | 0.0331270 |
| 11 | 175 | 5209 | 0.0335957 |
| 25 | 213 | 6316 | 0.0337239 |
| 1 | 195 | 5770 | 0.0337955 |
| 12 | 266 | 7330 | 0.0362892 |
| 4 | 218 | 5885 | 0.0370433 |
| 15 | 224 | 5873 | 0.0381406 |
| 23 | 220 | 5722 | 0.0384481 |
| 13 | 259 | 6659 | 0.0388947 |
| 7 | 223 | 5648 | 0.0394830 |
| 21 | 254 | 6227 | 0.0407901 |
| 19 | 235 | 5717 | 0.0411055 |
| 16 | 193 | 4628 | 0.0417027 |
| 27 | 344 | 8234 | 0.0417780 |
The Szyuhet Lexicon
The Syuzhet lexicon is an r package for extraction of sentiment and sentiment-based plot arcs from text. Fitting for our sentiment analysis on Dracula.
Comparing sentiment scores acorss the AFINN, Bing, NRC and Syuzhet Lexicons
The base code example compared how each sentiment lexicon calculates sentiment changes along Dracula’s trajectory, now we will add the syuzhet lexion to the comparison.
Running the entirety of Dracula’s text through the syuzhet lexicon to get sentiment scores can take a long time (from experience), I found it best to run only the distinct words through the lexicon to save time.
# get only unique words
unique_words <- tidy_stoker2 %>% select(word) %>% distinct()
# get syuzhet sentiment scores
syuzhet_scores <- unique_words %>%
rowwise() %>%
mutate(syuzhet_sentiment = get_sentiment(word, method = "syuzhet")) %>%
ungroup()Now we’re ready to make the comparison!
syuzhet_dracula <- tidy_stoker2 %>%
left_join(syuzhet_scores, by = "word") %>%
group_by(index = linenum %/% 80) %>%
summarise(sentiment = sum(syuzhet_sentiment, na.rm = TRUE)) %>%
mutate(method = "Syuzhet")Note: when we join the scores we have for every unique word to all the words in our text we will have a score for every word in Dracula. So don’t worry about having made things run faster by not getting the syuzhet sentiment score for each word individually every time it in the text appeared when we first got the sentiment scores.
bind_rows(afinn_dracula,
bing_nrc_dracula, syuzhet_dracula) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")As we can see from the comparison, the Syuzhet lexicon identifies similar trends in sentiment as the other lexicons, but does identify more positive sentiment than the AFINN and Bing lexicons do, but not as much as the NRC lexicon does.
Overall Sentiment Per Chapter in Dracula Using Syuzhet
Now let us take a look at the overall sentiment over Dracula’s story flow using the szyuzhet lexicon’s overall sentiment score per chapter.
tidy_syuzhet <- tidy_stoker2 %>%
left_join(syuzhet_scores, by = "word")
chapter_sentiment_syuzhet <- tidy_syuzhet %>%
group_by(chapters) %>%
summarise(
total_sentiment = sum(syuzhet_sentiment),
avg_sentiment = mean(syuzhet_sentiment),
.groups = "drop"
)ggplot(chapter_sentiment_syuzhet, aes(x = chapters, y = total_sentiment)) +
geom_col(fill = "darkred") +
labs(title = "Syuzhet Sentiment Across Dracula Chapters",
x = "Chapter",
y = "Net sentiment") +
theme_minimal()The overall sentiment per chapter over Dracula’s story flow using the szyuzhet lexicon is positive! Now this is a surprise and perhaps speaks to some limitations of the lexicon, as we know Dracula is not generally perceived as a book full of positive sentiment (it’s horror after all).