tidytext
packageThis experiment/ playground activity is based on the content of chapter 3 of the “Tidy Text Mining with R” book and it is related to a blog post available at my personal site. The code is available in the following repository.
The required packages used for this activity are
require(tidytext)
require(tidyverse)
require(janeaustenr)
require(stringr)
require(wordcloud)
require(reshape2)
sentiment
datasetThe tidytext
package contains 3 sentiment lexicons in the sentiments
dataset.
`Three lexicons for sentiment analysis are combined here in a tidy data frame. The lexicons are the NRC Emotion Lexicon from Saif Mohammad and Peter Turney, the sentiment lexicon from Bing Liu and collaborators, and the lexicon of Finn Arup Nielsen. Words with non-ASCII characters were removed from the lexicons.`
The sentiments
dataset includes the following features
word
, an english word (unigram)sentiment
, one of either positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, trust, or NA.
lexicon
, the source either “nrc”, “bing” or “AFINN”score
, A numerical score for the sentiment. This value is NA for the Bing and NRC lexicons, and runs between -5 and 5 for the AFINN lexicon.The bing
lexicon categorizes words into a positive
or negative
flag; the nrc
lexicon categorizes words into emotions like anger, disgust, fear, joy, ...
; the AFINN
lexicon categorizes words using a score, with negative scores indicating a negative sentiment. Lexicons do not take into account qualifiers before a word, such as in “no good” or “not true”.
`These dictionary-based methods find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text. Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only. For many kinds of text (like the narrative examples below), there are not sustained sections of sarcasm or negated text, so this is not an important effect.`
Some more information about the dataset and its content…
#dataset structure
str(sentiments)
## Classes 'tbl_df', 'tbl' and 'data.frame': 23165 obs. of 4 variables:
## $ word : chr "abacus" "abandon" "abandon" "abandon" ...
## $ sentiment: chr "trust" "fear" "negative" "sadness" ...
## $ lexicon : chr "nrc" "nrc" "nrc" "nrc" ...
## $ score : int NA NA NA NA NA NA NA NA NA NA ...
#no of entries for the different sources
sentiments %>%
group_by(lexicon) %>%
summarise(noOfWords = n())
## # A tibble: 3 × 2
## lexicon noOfWords
## <chr> <int>
## 1 AFINN 2476
## 2 bing 6788
## 3 nrc 13901
#NRC & BING lexicons use the `sentiment` feature
#no of words by sentiment
sentiments %>%
filter(lexicon %in% c("nrc", "bing")) %>%
group_by(lexicon, sentiment) %>%
summarise(noOfWords = n())
## Source: local data frame [12 x 3]
## Groups: lexicon [?]
##
## lexicon sentiment noOfWords
## <chr> <chr> <int>
## 1 bing negative 4782
## 2 bing positive 2006
## 3 nrc anger 1247
## 4 nrc anticipation 839
## 5 nrc disgust 1058
## 6 nrc fear 1476
## 7 nrc joy 689
## 8 nrc negative 3324
## 9 nrc positive 2312
## 10 nrc sadness 1191
## 11 nrc surprise 534
## 12 nrc trust 1231
#AFINN lexicon uses the `score` feature
#no of words by score
sentiments %>%
filter(lexicon %in% c("AFINN")) %>%
group_by(lexicon) %>%
count(cut_width(score, 1))
## Source: local data frame [11 x 3]
## Groups: lexicon [?]
##
## lexicon `cut_width(score, 1)` n
## <chr> <fctr> <int>
## 1 AFINN [-5.5,-4.5] 16
## 2 AFINN (-4.5,-3.5] 43
## 3 AFINN (-3.5,-2.5] 264
## 4 AFINN (-2.5,-1.5] 965
## 5 AFINN (-1.5,-0.5] 309
## 6 AFINN (-0.5,0.5] 1
## 7 AFINN (0.5,1.5] 208
## 8 AFINN (1.5,2.5] 448
## 9 AFINN (2.5,3.5] 172
## 10 AFINN (3.5,4.5] 45
## 11 AFINN (4.5,5.5] 5
get_sentiments()
functionThe get_sentiments()
function can be used ot get a tidy dataset for a specific lexicon…
#AFINN
afinn_lexicon <- get_sentiments("afinn")
head(afinn_lexicon)
## # A tibble: 6 × 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
#NRC
nrc_lexicon <- get_sentiments("nrc")
head(nrc_lexicon)
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
#BING
bing_lexicon <- get_sentiments("bing")
head(bing_lexicon)
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
Assumptions:
Emma
?Emma
book and transform it into a tidy datasetnrc
lexicon and consider only the words tagged as joy
joy
words used in Emma
book#Get the emma book and transform it into a tidy dataset
tidy_books <- austen_books() %>%
filter(book == "Emma") %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
#Using the nrc lexicon, only the words that are associated to a sentiment of `joy`
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
#Summarize the usage of `joy` words
tidy_books %>%
semi_join(nrc_joy) %>%
count(word, sort = T)
## # A tibble: 303 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
bing
lexicon and use for each word a positive or negative contributionindex = row_number %/% 100
, index 0 represents the first 100 lines, index 1 represents the 2nd 100 lines, etc. - and use the index as the atomic unit to evaluate the sentimentno of positive words - no of negative words
, and use this index to represent the sentiment changes within the novel.tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
janeaustensentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 100, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
head(janeaustensentiment)
## Source: local data frame [6 x 5]
## Groups: book, index [6]
##
## book index negative positive sentiment
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 Sense & Sensibility 0 20 47 27
## 2 Sense & Sensibility 1 22 54 32
## 3 Sense & Sensibility 2 16 35 19
## 4 Sense & Sensibility 3 20 45 25
## 5 Sense & Sensibility 4 21 63 42
## 6 Sense & Sensibility 5 25 63 38
ggplot(data = janeaustensentiment, mapping = aes(x = index, y = sentiment, fill = book)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(facets = ~ book, ncol = 2, scales = "free_x")
The plot for each novel allows us to see how sentiment changes toward more positive or negative sentiment during the narration.
It all depends on the purposes of the sentiment analysis…
Pride & Prejudice
and transform into a tidy datasetafinn
way, an overall score per index based on words used in each indexbing
& nrc
ways, an overall index based on different nuomber of words used in each index positive - negative)pride_and_prejudice <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
afinn <- pride_and_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 100) %>%
summarise(sentiment = sum(score)) %>%
mutate(method = "AFINN")
bing <- pride_and_prejudice %>%
inner_join(get_sentiments("bing")) %>%
group_by(index = linenumber %/% 100) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(method = "BING", sentiment = positive - negative)
nrc <- pride_and_prejudice %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(index = linenumber %/% 100) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(method = "NRC", sentiment = positive - negative)
cols_idx = c("index", "sentiment", "method")
sentiment_evaluations <- bind_rows(afinn[,cols_idx], bing[,cols_idx], nrc[,cols_idx])
ggplot(data = sentiment_evaluations, mapping = aes(x = index, y = sentiment, fill = method)) +
geom_bar(alpha = 0.7, stat = "identity", show.legend = F) +
facet_wrap(facets = ~ method, ncol = 1, scales = "free_y")
`The three different lexicons for calculating sentiment give results that are different in an absolute sense but have similar relative trajectories through the novel. We see similar dips and peaks in sentiment at about the same places in the novel, but the absolute values are significantly different.`
Something to keep in mind when comparing different lexicons, especially when using BING
or NRC
…
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2006
`Both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon. This will contribute to the effect we see in the plot above, as will any systematic difference in word matches, e.g. if the negative words in the NRC lexicon do not match the words that Jane Austen uses very well.`
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = T) %>%
ungroup()
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
tmp <- bing_word_counts %>%
filter(n > 150) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n))
ggplot(data = tmp, mapping = aes(x = word, y = n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
Visualization is a powerful tool for investigation even when working with unstructured data. It helps to spot anomalies in the analysis with a glance. For example
`the word “miss” is coded as negative but it is used as a title for young, unmarried women in Jane Austen’s works. If it were appropriate for our purposes, we could easily add “miss” to a custom stop-words list.`
Another way to visualize information connected with text mining is to use wordclouds (alias tag clouds).
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word,n,max.words = 100))
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(max.words = 150)