The assignment is to re-create the R code from Chapter 2 of the textbook:
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly. https://www.tidytextmining.com/sentiment.html.
Here is a re-creation of that code.
sentiments datasetImport three sentiment lexicons. The first is the “afinn” lexicon,
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.3
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
Second is the “bing” lexicon.
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
The third is the “nrc” dataset, published by Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.’’ Computational Intelligence, 29(3): 436-465.
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
Create a dataset that includes the works of Jane Austen.
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.0.3
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE
)))
) %>%
ungroup() %>%
unnest_tokens(word, text)
Create a dataset containing only the “joy” words from the “nrc” lexicon, and use the new lexicon dataset to evaluate the work “Emma”, counting the “joy” words in it.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
Compare the sentiment across all of the works of Jane Austen by joining to the “bing” lexicon, chopping the works into 80-line indexed sections, counting the positive and negative sentiments within each section, and taking the difference of the sentiments to derive a net sentiment for each section.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(jane_austen_sentiment)
## # A tibble: 6 x 5
## book index negative positive sentiment
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
Plot the sentiments for each work.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Compare how the three different lexicons perform against the work Pride and Prejudice.
# filter Jane Austin's works for Pride & Prejudice
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
# join to the afinn lexicon
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
head(afinn)
## # A tibble: 6 x 3
## index sentiment method
## <dbl> <dbl> <chr>
## 1 0 29 AFINN
## 2 1 0 AFINN
## 3 2 20 AFINN
## 4 3 30 AFINN
## 5 4 62 AFINN
## 6 5 66 AFINN
# join to the bing and nrc lexicons
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c(
"positive",
"negative"
))) %>%
mutate(method = "NRC")
) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(bing_and_nrc)
## # A tibble: 6 x 5
## method index negative positive sentiment
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bing et al. 0 7 21 14
## 2 Bing et al. 1 20 19 -1
## 3 Bing et al. 2 16 20 4
## 4 Bing et al. 3 19 31 12
## 5 Bing et al. 4 23 47 24
## 6 Bing et al. 5 15 49 34
Visualize the net sentiment trend based on the lexicon used to analyze it.
bind_rows(
afinn,
bing_and_nrc
) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
Compare the overall sentiment levels in two of the lexicons (“nrc” and “bing”).
nrc_sentiment <- get_sentiments("nrc") %>%
filter(sentiment %in% c(
"positive",
"negative"
)) %>%
count(sentiment)%>%
mutate(lexicon = "nrc")
bing_sentiment <- get_sentiments("bing") %>%
count(sentiment)%>%
mutate(lexicon = "bing")
lexicon_compare <- bind_rows(nrc_sentiment
,bing_sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(pct_neg = negative/(negative + positive),
pct_pos = positive/(negative + positive))
lexicon_compare
## # A tibble: 2 x 5
## lexicon negative positive pct_neg pct_pos
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 bing 4781 2005 0.705 0.295
## 2 nrc 3324 2312 0.590 0.410
Identify the most common words in the “bing” lexicon and identify which sentiment defines the word. Graph the top 10 results for each sentiment.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
Add the word ‘miss’ to the lexicon as a custom sentiment.
custom_stop_words <- bind_rows(
tibble(
word = c("miss"),
lexicon = c("custom")
),
stop_words
)
custom_stop_words
## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ... with 1,140 more rows
Use Wordclouds with the works of Jane Austen.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.0.3
## Warning: package 'RColorBrewer' was built under R version 4.0.3
library(tidytext)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
Create a sentiment analysis Wordcloud using comparison.cloud(). Negative words are in darker font, positive words are in lighter font.
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.0.3
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("gray20", "gray80"),
max.words = 100
)
### Looking Beyond Words
Tokenize at the sentence level instead of the word level.
PandP_sentences <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE
)))
) %>%
ungroup() %>%
unnest_tokens(sentence, text, token = "sentences")
Tokenize at the chapter level.
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text,
token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]"
) %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
Find the number and ratio of negative words in the most negative chapter of each book.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords / words) %>%
filter(chapter != 0) %>%
top_n(1) %>%
ungroup()
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
Project Gutenberg offer readers (and researchers) the full text of over 60,000 eBooks for free. For the purposes of this assginment, I chose a sample of the works of Charles Dickens.
Download the sampling of books by Charles Dickens from the Project Gutenberg website.
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.0.3
dickens786 <- gutenberg_download(c(786)) %>%
mutate(book = "Hard Times")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
dickens1400 <- gutenberg_download(c(1400)) %>%
mutate(book = "Great Expectations")
dickens730 <- gutenberg_download(c(730)) %>%
mutate(book = "Oliver Twist")
dickens766 <- gutenberg_download(c(766)) %>%
mutate(book = "David Copperfield")
dickens1023 <- gutenberg_download(c(1023)) %>%
mutate(book = "Bleak House")
dickens564 <- gutenberg_download(c(564)) %>%
mutate(book = "The Mystery of Edwin Drood")
dickens <- bind_rows(dickens786, dickens1400, dickens730, dickens766, dickens1023, dickens564)
dickens <- subset(dickens, select = c(text,book))
Index the books by chapter and line number and tokenize the words into a tidy dataset.
tidy_dickens <- dickens %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE
)))
) %>%
ungroup() %>%
unnest_tokens(word, text)
Remove the stop words from the tidy_dickens dataset, then identify the most common words in the works of Dickens by wordcount.
data(stop_words)
tidy_dickens <- tidy_dickens %>%
anti_join(stop_words)
tidy_dickens %>%
count(word, sort = TRUE)
## # A tibble: 25,701 x 2
## word n
## <chr> <int>
## 1 time 2385
## 2 sir 2238
## 3 dear 2045
## 4 miss 1956
## 5 hand 1710
## 6 head 1626
## 7 night 1442
## 8 house 1394
## 9 day 1363
## 10 looked 1243
## # ... with 25,691 more rows
Identify a new lexicon for sentiment analysis; the loughran package is part of the tidytext package.
get_sentiments("loughran")
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # ... with 4,140 more rows
Compare the sentiment across all of the works of Dickens by joining to the “loughran” lexicon, chopping the works into 80-line indexed sections, counting the positive and negative sentiments within each section, and taking the difference of the sentiments to derive a net sentiment for each section.
dickens_sentiment <- tidy_dickens %>%
inner_join(get_sentiments("loughran")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative - constraining - litigious - uncertainty)
head(dickens_sentiment)
## # A tibble: 6 x 9
## book index constraining litigious negative positive superfluous uncertainty
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Blea~ 0 0 3 1 1 0 0
## 2 Blea~ 1 0 10 19 4 0 4
## 3 Blea~ 2 0 2 5 0 0 0
## 4 Blea~ 3 0 16 16 0 0 1
## 5 Blea~ 4 0 8 17 0 0 1
## 6 Blea~ 5 2 7 2 2 0 3
## # ... with 1 more variable: sentiment <dbl>
The loughran sentiment index appears to have six categories of words: constraining, litigious, negative, positive, superfluous, and uncertainty. All of the categories seem to be negative with the exception of “positive”, which could potentially skew the overall sentiment heavily into the negative.
Let’s look further into the details of the sentiment lexicon itself to see what might be happening.
loughran_word_counts <- tidy_dickens %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
loughran_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
There appear to be words such as “obliged” and “committed” in the constraining category that could have had positive meanings at the time Dickens wrote his works. For example, “I am much obliged, madam.” and “I have committed all of my love to you” that could be construed as positive. Likewise, the words “justice” and “consent” in the legal category could be used in Dickens’ work in the context of “Justice has been done” and “I consent to grant you the hand of my daughter in marriage.” These would have been construed at the time of their use as positive phrases.
Plot the sentiments for each work.
ggplot(dickens_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
As expected through the examination of the loughran lexicon, it appears that most of the works of Dickens result in an overall negative sentiment in this analysis.
The process of sentiment analysis is not a straightforward one. The choice of which sentiment lexicon to use in the context of a specific analysis is vital to the success or failure of the sentiment analysis. Using the wrong lexicon for the context can result in disastrous conclusions. So the moral of this story is: choose your lexicons wisely!
Overall for this assignment, I found the NRC lexicon to be the most useful, since it was the one most balanced between positive and negative words.