I’m going to use some code which gives a basic example of textmining with Jane Austen novels and extend it to a new corpus and a new lexicon. I’ll also bring in a new lexicon for sentiment analysis in another file.
The Jane Austen code is from Chapter 2: Looks at Sentiment Analysis in Text Mining with R by Julia Silge and David Robinson.
Silge, J. and Robinson, D., 2017. Text mining with R. 1st ed. Sebastopol, CA, USA: O’Reilly Books.
The “bing” lexicon was first published in Minqing Hu and Bing Liu, ``Mining and summarizing customer reviews.’’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004. It is available for use with attribution.
AFINN Sentiment Lexicon is available with ODbL v1.0, and NRC Word-Emotion Association Lexicon, known also as EmoLex is available for non-commercial research use.
We start by loading libraries and downloading the base sentiment analysis datasets.
library(tidytext)
library(janeaustenr)
library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(wordcloud)
library(reshape2)
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
We tidy up a dataset that contains some Jane Austen novels.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Then we collect the words from the nrc sentiment set with connotations of joy.
nrc_joy <- nrc %>%
filter(sentiment == "joy")
We use an inner join to see which words from Emma are joyful, according to the nrc set, and how often they occur.
emma_joy <- tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
Now, we can use an inner join with the bing set to consider the sentiment changes from beginning to end. Floor division breaks up the text into chunks of 80 lines.
jane_austen_sentiment <- tidy_books %>%
inner_join(bing) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
Then we plot the sentiment scores.
ggplot(jane_austen_sentiment,
aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Let’s work with Pride and Prejudice.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
Because the bing sentiment set categorizes words on a scale of -5 to 5, we will have to make a function for comparison distinct from that for the other two sets, which categorize things in a binary manner.
pp_afinn <- pride_prejudice %>%
inner_join(afinn) %>%
group_by(index = linenumber %/% 80) %>%
summarize(sentiment = sum(value)) %>%
mutate(method = "AFINN")
pp_bing_nrc <- bind_rows(
pride_prejudice %>%
inner_join(bing) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(nrc %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
Here is a net sentiment estimation for each chunk of Pride and Prejudice.
bind_rows(pp_afinn,
pp_bing_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
According to the text and visibly in the charts, NRC is more positive, AFINN has more variance, and Bing et al. find more consistent stretches of text. Reportedly, this happens with other texts, as well.
Why?
nrc %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
count(bing, sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
(bing_word_counts <- tidy_books %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup())
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # … with 2,575 more rows
These are the words that contribute the most to sentiment analysis in the text:
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
It helps to define some stop words to avoid the missuse of some vocabulary.
(custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words))
## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # … with 1,140 more rows
And, we can build a wordcloud.
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
tidy_books %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
We can also examine units greater than simple words. Sometimes negation or sentence complexity can affect the true sentiment of a piece of text. The libraries coreNLP and cleanNLP attempt just this. We can cut the books into chapters.
pp_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarize(chapters = n())
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
bingnegative <- bing %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343