Primary Example Code - Jane Austen Corpus
Loading the three sentiment lexicons used in the example
library(textdata)
## Warning: package 'textdata' was built under R version 3.6.3
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.3
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
Loading the example texts - six novels published by 19th century author Jane Austen. Then, the books are converted to a tidy format - grouped originally by book, then a line of mutate code to keep track of the original line number in a new column, then unnesting each word from the text.
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 3.6.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.3
library(stringr)
austen_books()
## # A tibble: 73,422 x 2
## text book
## * <chr> <fct>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility
## 2 "" Sense & Sensibility
## 3 "by Jane Austen" Sense & Sensibility
## 4 "" Sense & Sensibility
## 5 "(1811)" Sense & Sensibility
## 6 "" Sense & Sensibility
## 7 "" Sense & Sensibility
## 8 "" Sense & Sensibility
## 9 "" Sense & Sensibility
## 10 "CHAPTER 1" Sense & Sensibility
## # ... with 73,412 more rows
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Here is the corresponding list of positive sentiment words in Austen’s novel Emma.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
Using the bing lexicon, the six books are plotted according to the sentiments of each line. The affected words are identified using an inner join, then the net sentimenet is calculated by substracting the magnitude of negative sentiment from positive sentiment. Last, the net sentiment per line of the six novels is graphed using ggplot and the facet_wrap function.
library(tidyr)
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Generally, it appears that Austen novels have more positive than negative sentiment. Some especially negative patterns are evident halfway through Pride and Prejudice as well as at the end of Mansfield Park.
Next, using the afinn lexicon, the example looked at the sentiment of Pride and Prejudice. Again, using a similar technique, the lexicon is inner joined to the particular book.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ... with 122,194 more rows
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
The sentiments can then be plotted for comparison of each lexicon.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
It appears that NRC estimates greater positive sentiment for this particular book, while only Bing predicts a net negative area halfway through the book.
Finally, using the Bing lexicon, the example obtains a wordcount of Austen’s works. The resulting data frame indicates the positive or negative senetiment in addition to the frequency.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
The results can then be compared graphically. One outlier is the use of the word ‘miss’, which is part of the negative sentiment lexicon because it would indicate the opposite of a ‘hit’ or unfulfilled expectations. In Austen’s novels, it’s more commonly used as a title of an unmarried woman - Miss Bingley for example. While being an unmarried woman generally seen as a negative in Jane Austen’s novels, for the purposes of this example it’s an anomalous result.
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
This bit of code allows us to edit the stop words and add ‘miss’ to the list of words not counted when generating a word count.
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ... with 1,140 more rows
Below is a word cloud with the word ‘miss’ omitted.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
tidy_books %>%
anti_join(custom_stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
## Warning in wordcloud(word, n, max.words = 100): happiness could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): happy could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): spirits could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): suppose could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): hope could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): heard could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): hear could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): subject could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): people could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): character could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): minutes could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): left could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): letter could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): comfort could not be fit on
## page. It will not be plotted.