Problem Description and Approach
We’re going to recreate and analyze primary code from Chapter 2, Sentiment analysis with tidy data, from Julia Silge & David Robinson’s Text Mining with R (Last built 2022-02-07). We’ll extend that code to a book or selection of books from the Gutenberg project and incorporate an additional sentiment lexicon.
The following code is from Julia Silge and David Robinson’s (2017) Text Mining with R, Chapter 2: Sentiment analysis with tidy data
We are coding in the Tidyverse with additional
packages:
- tidytext | for text mining
- janeaustenr | to load a corpus of Jane Austen’s
books
- wordcloud | for wordclouds
- reshape2 | allows more control over wordcloud
shape
# Load packages --------------------------------------
#install.packages('textdata')
library(tidyverse)
library(tidytext)
library(janeaustenr)
library(wordcloud)
library(reshape2)
We’re pulling the following sentiment lexicons.
AFINN-111 This dataset was published in Finn Ärup Nielsen (2011), “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages (2011) 93-98.
Bing This dataset was first published in Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.
nrc This dataset was published in Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.
# Pull sentiment lexicons
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,865 more rows
We’re converting to tidy format (one word per row) using
unnest_tokens()
and adding columns for the chapter and line
number.
# Convert to tidy format with chapter and line numbers
<- austen_books() %>%
tidy_books group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Here we apply two examples of sentiment analysis.
Code Summary:
- Show ‘Joy’ words by frequency using nrc
- Graph the plot trajectory of each novel using bing
# Show 'Joy' words by frequency using nrc
# Inner join joy words from the nrc sentiment lexicon to the text
<- get_sentiments("nrc") %>%
nrc_joy filter(sentiment == "joy")
# Tabulate the joy words from the book 'Emma' by frequency
%>%
tidy_books filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # … with 291 more rows
# Graph the plot trajectory of each novel using bing
# Create chunks of 80 lines for sentiment analysis with bing
<- tidy_books %>%
jane_austen_sentiment inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
# Graph the bins by how positive or negative they are
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Here we look into how the three sentiment lexicons perform differently when examining the sentiment changes over the narrative arc of the book Pride and Prejudice.
Code Summary:
- Isolate the book, Pride & Prejudice
- Find the net sentiment per each lexicon across the same chunks
- Visualize the net sentiment per each lexicon across the same
chunks
# Isolate the book, Pride & Prejudice
<- tidy_books %>%
pride_prejudice filter(book == "Pride & Prejudice")
# Find the net sentiment per each lexicon across the same chunks
# AFINN requires a separate pattern since it measures sentiment between -5 and 5
<- pride_prejudice %>%
afinn inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
# Bing and nrc use the same pattern
<- bind_rows(
bing_and_nrc %>%
pride_prejudice inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
%>%
pride_prejudice inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
%>%
) mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
# Visualize the net sentiment per each lexicon across the same chunks
bind_rows(afinn,
%>%
bing_and_nrc) ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
The chapter notes that the lexicons perform similarly in tracking the relative changes in sentiment across the novel but with different values. NRC shifts to higher values. AFINN has more variance and Bing finds longer stretches of similar text.
The chapter suggests this is because while both NRC and Bing have more negative to positive words, Bing has a higher ratio of negative to positive words than NRC does.
Code Summary:
- Show negative to positive words in NRC
- Show negative to positive words in Bing
# Show negative to positive words in NRC
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3318
## 2 positive 2308
# Show negative to positive words in Bing
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
Here we track which words contributed to which sentiment, positive or negative in this case, to identify words that are confounding the results for this particular text.
Code Summary:
- Tabulate words by frequency that contribute to either sentiment
- Graph the counts for examination
- Create a custom stop-words list to address the confounding word
“miss”
# Tabulate words by frequency that contribute to either sentiment
<- tidy_books %>%
bing_word_counts inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
# Graph the counts for examination
%>%
bing_word_counts group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
# Create a custom stop-words list to address the confounding word "miss"
<- bind_rows(tibble(word = c("miss"),
custom_stop_words lexicon = c("custom")),
stop_words)
# For demonstration only
custom_stop_words
## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # … with 1,140 more rows
Here we create two word clouds, one by frequency and by contribution to net sentiment.
Code Summary:
- Wordcloud of most frequent words
- Wordcloud of most frequent words split by postive or negative
sentiment
# Wordcloud of most frequent words
set.seed(2248)
%>%
tidy_books anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
# Wordcloud of most frequent words split by postive or negative sentiment
set.seed(2341)
%>%
tidy_books inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining, by = "word"
Code Summary: - Example sentence after tokenizing text into sentences - Split the series of books by chapter using a regex pattern - Show the chapter from each book with the most negative net sentiment
# Tokenize text into sentences
<- tibble(text = prideprejudice) %>%
p_and_p_sentences unnest_tokens(sentence, text, token = "sentences")
# Show example sentence
$sentence[50] p_and_p_sentences
## [1] "_may_ fall in love with one of them, and therefore you must visit him as"
The chapter suggests trying “iconv(text, to = ‘latin’) in a mutate statement before unnesting” if the sentence tokenizing is having trouble with UTF-8 encoded text, which could improved by changing to ASCII punctuation, “especially with sections of dialogue”.
# Split the series of books by chapter using a regex pattern
<- austen_books() %>%
austen_chapters group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
# Demonstrate split by chapters
%>%
austen_chapters group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
# Show the chapter from each book with the most negative net sentiment
# Identify just negative words from Bing
<- get_sentiments("bing") %>%
bingnegative filter(sentiment == "negative")
# Count words for each chapter
<- tidy_books %>%
wordcounts group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'book'. You can override using the `.groups`
## argument.
# Tabulate chapters with the highest ratio of negative words to words
%>%
tidy_books semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
Here we extend the code above to a new corpus and incorporate an additional lexicon for sentiment analysis.
We are going to apply sentiment analysis to The Brothers Karamazov by Fyodor Dostoyevsky using the gutenbergr package. Visit (gutenberg.org)[https://www.gutenberg.org/] to learn more about Project Gutenberg.
Fyodor Dostoyevsky, translated by Constance Garnett, (1879) The Brothers Karamazov (Gutenberg book ID 28054)[https://www.gutenberg.org/ebooks/28054]
Code Summary:
- Load packages
- Download book
- Truncate book (not used in final knit)
# Load packages --------------------------------------
#install.packages('gutenbergr')
library(gutenbergr)
# Download book
<- gutenberg_download(28054) TheBroKov
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
# Truncate book (not used in final knit)
<- data.frame(c(1:37250))
TheBroKovSmall <- TheBroKov[c(1:37250),c(2)] TheBroKovSmall
The syuzhet package contains the afinn, bing and nrc lexicons above, in addition to the syuzhet lexicon developed in the Nebraska Literary Lab. It also has a way to implement Stanford’s coreNLP sentiment parser which could be a future flex to implement.
Jockers ML (2015). Syuzhet: Extract Sentiment and Plot Arcs from Text. (https://github.com/mjockers/syuzhet)[https://github.com/mjockers/syuzhet].
# Load packages --------------------------------------
#install.packages('syuzhet')
library(syuzhet)
This is a hodgepodge of functionality.
‘Chapter’ was inordinately present in the first 1000 lines of text and so has been added to the stop-words list. We’re leaving behind the chapter numerals, e.g. ‘iv’ ‘iii’ ‘vi’.
# Add custom stop-words like "chapter"
<- bind_rows(tibble(word = c("chapter"),
custom_stop_words lexicon = c("custom")),
stop_words)
Here we turn the text into tidy format with one word one row.
# Remove stop words
# Toggle TheBroKovSmall with TheBroKov$text for final run
<- TheBroKovSmall %>% unnest_tokens(word,text) %>% anti_join(custom_stop_words) Tidy_tbk
## Joining, by = "word"
Here we create two word clouds, one by frequency and one by nrc sentiment.
Code Summary:
- Wordcloud of most frequent words
- Wordcloud of most frequent words split by nrc sentiment
# Wordcloud of most frequent words
set.seed(4647)
%>%
Tidy_tbk count(word) %>%
with(wordcloud(word, n, max.words = 360), scale=c(3.5,1))
# Wordcloud of most frequent words split by nrc sentiment
set.seed(4749)
%>%
Tidy_tbk inner_join(get_sentiments("nrc")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "orange", "yellow", "green", "blue", "darkblue", "purple", "black", "gray", "brown"),
max.words = 360)
## Joining, by = "word"
Here we visualize the most frequent words that contributed to net sentiment using the Bing lexicon.
Code Summary:
- Tabulate words by frequency that contribute to either sentiment
- Graph the counts for examination
# Tabulate words by frequency that contribute to either sentiment
<- Tidy_tbk %>%
bing_word_counts inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
# Graph the counts for examination
%>%
bing_word_counts group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Here we show the words most contributing to the joy sentiment with nrc. As a flex we could track this for the other six nrc emotion sentiments and make a chart however we won’t attempt that yet.
# Show 'Joy' words by frequency using nrc
<- get_sentiments("nrc") %>%
nrc_joy filter(sentiment == "joy")
# Tabulate the joy words from the book 'Emma' by frequency
%>%
Tidy_tbk inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 419 × 2
## word n
## <chr> <int>
## 1 love 467
## 2 money 429
## 3 god 373
## 4 mother 169
## 5 true 153
## 6 feeling 147
## 7 found 127
## 8 child 122
## 9 church 122
## 10 laughing 115
## # … with 409 more rows
#anger, fear, anticipation, trust, surprise, sadness, joy, and disgust
I cannot get.sentences or get.sentiment from the syuzhet package and will need to improve this.
<- get.sentiment(Tidy_tbk, method="syuzhet") stbk
It’s tough to extend code to new problems without breaking down the original code into individual components. I’d like to spend more time with text mining and breaking apart the code into first principals.
My favorite lexicons are bing for simplicity, and nrc for the emotional sentiments.
I’d like to go back and create chunks of code from my text for the plot analysis, and truly duplicate the original analysis with the new text.
I would also like to spend more time to work through the entire syuzhet vignette to explore its capabilities and introduce myself to the Stanford coreNLP functions.
Additionally, when I tried to run the whole book it failed but I don’t think the code was the issue. I used the truncated version of the code and was successful with 10,000 lines and then running the whole 37,250 lines through the truncated version.
The following pages were helpful in navigating this assignment.
A resource for all of the NLP packages in R:
(https://cran.r-project.org/web/views/NaturalLanguageProcessing.html)[https://cran.r-project.org/web/views/NaturalLanguageProcessing.html]
The vignette illustrating the syuzhet package:
(https://CRAN.R-project.org/package=syuzhet)[https://CRAN.R-project.org/package=syuzhet]
A blog post by Julia Silge, co-author of Text Mining in
R:
(https://juliasilge.com/blog/if-i-loved-nlp-less/)[https://juliasilge.com/blog/if-i-loved-nlp-less/]
A Medium article by Namitha Deshpande (2020):
(https://medium.com/analytics-vidhya/text-mining-with-r-d5606b3d7bec)[https://medium.com/analytics-vidhya/text-mining-with-r-d5606b3d7bec]