The task at hand was to get the primary example code from Chapter 2 about sentiment analysis in Text Mining with R1. Then, we had to extend the code by either working with a different corpus or another sentiment lexicon.
We are loading the sentiments data set with the AFINN, bing, and nrc lexicon. They are based on single words and are associated with positivity or negativity, along with some emotions.
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
Sentiment analysis can be done with inner join. We are able to look at Jane Austen books and create columns to denote the line numbers and chapter that each word belongs to. In order to find words that are valued as “joy” in Emma, we have to first filter out the nrc lexicon and filter our data set to only one book. Then we can perform an inner_join and find the count for each joy word. We can remove some common words performing an anti_join with another dictionary data set. We can also find the sentiment score and how it changes from one section to another by calculating the net sentiment. This way we are able to see how the emotions change throughout the novel visually.
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE
)))
) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
It is also interesting to see how the lexicons compare between one another. Focusing on Pride and Prejudice, we are able to see how the three lexicons show changes in sentiment throughout the book. AFINN has the largest values and more variance. NRC has more positivity with high sentiment. Bing has lower values and finds longer sections of text with positive or negative associations. Overall, the three find similar sentimental trends throughout the novel.Bing was able to have larger negative values because it has more negative words in its lecixon compared to the others.
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ... with 122,194 more rows
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` ungrouping output (override with `.groups` argument)
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c(
"positive",
"negative"
))) %>%
mutate(method = "NRC")
) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
bind_rows(
afinn,
bing_and_nrc
) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
get_sentiments("nrc") %>%
filter(sentiment %in% c(
"positive",
"negative"
)) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
We are also able to find which words provide the most to each sentiment. Some words can also further be removed from the list by identifying custom stop words. For example, “miss” was considered a negative word, but in Jane Austen’s books, it refers to a female. We can remove it in order to obtain a better sentiment value.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
## Selecting by n
custom_stop_words <- bind_rows(
tibble(
word = c("miss"),
lexicon = c("custom")
),
stop_words
)
custom_stop_words
## # A tibble: 1,150 x 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ... with 1,140 more rows
We are also able to map the word frequencies using the wordcloud package. Furthermore, we can use it to show a comparison between negative and positive word frequency with comparison.cloud.
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("gray20", "gray80"),
max.words = 100
)
## Joining, by = "word"
Instead of looking at words alone, it may be beneficial to looks at n-grams, sentences, or paragraphs. Sometimes, the words will have a different meaning, especially with negation. For example, “bad” has a negative meaning but “not bad at all” has a positive meaning. We can also split the tokens with regex. Furthermore, we can take a look to find chapters that have the most negative sentiment value or most positive value. This can denote that something tragic happened.
PandP_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text,
token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]"
) %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords / words) %>%
filter(chapter != 0) %>%
top_n(1) %>%
ungroup()
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
## Selecting by ratio
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
Similar to Jane Austen, we can look at the books by Charles Dickens by using the gutenbergr package. We are looking at the books A Tale of Two Cities, Oliver Twist, Our Mutual Friend, David Copperfield, Bleak House, and Little Dorrit. We can then tidy the data and seperate by word tokens while removing the stop words. I used the custom_stop_words because Dickens also used “miss” frequently.
dickens <- gutenberg_download(c(98, 730, 766, 883, 1023, 963))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_dickens <- dickens %>%
group_by(gutenberg_id) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE
)))
) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
anti_join(custom_stop_words) %>%
mutate(book = case_when(
gutenberg_id == 98 ~ "A Tale of Two Cities",
gutenberg_id == 730 ~ "Oliver Twist",
gutenberg_id == 766 ~ "David Copperfield",
gutenberg_id == 883 ~ "Our Mutual Friend",
gutenberg_id == 1023 ~ "Bleak House",
gutenberg_id == 963 ~ "Little Dorrit")) %>%
select(-gutenberg_id) %>%
select(book, everything())
## Joining, by = "word"
tidy_dickens
## # A tibble: 534,491 x 4
## book linenumber chapter word
## <chr> <int> <int> <chr>
## 1 A Tale of Two Cities 1 0 tale
## 2 A Tale of Two Cities 1 0 cities
## 3 A Tale of Two Cities 3 0 story
## 4 A Tale of Two Cities 3 0 french
## 5 A Tale of Two Cities 3 0 revolution
## 6 A Tale of Two Cities 5 0 charles
## 7 A Tale of Two Cities 5 0 dickens
## 8 A Tale of Two Cities 8 0 contents
## 9 A Tale of Two Cities 11 0 book
## 10 A Tale of Two Cities 11 0 recalled
## # ... with 534,481 more rows
Here are some of the most frequent words in these novels.
tidy_dickens %>%
count(word, sort = TRUE)
## # A tibble: 27,833 x 2
## word n
## <chr> <int>
## 1 time 3008
## 2 sir 2861
## 3 dear 2779
## 4 hand 2383
## 5 head 2156
## 6 day 1814
## 7 night 1811
## 8 house 1808
## 9 eyes 1799
## 10 looked 1644
## # ... with 27,823 more rows
Here is a visualization of the sentiment values across the plots in each book by Charles Dickens. We are able to see that there are more negative sections than positive.
tidy_dickens %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(., aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
## Joining, by = "word"
Here are the top negative and positibe words that Dickens uses in these 6 novels.
tidy_dickens %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(
y = "Contribution to sentiment",
x = NULL
) +
coord_flip()
## Joining, by = "word"
## Selecting by n
tidy_dickens %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(
colors = c("gray20", "gray80"),
max.words = 100
)
## Joining, by = "word"
At first, I thought that AFINN would be the best lexicon to use as it has numeric sentiment values but I enjoyed using bing lexicon as we don’t know to the extent the of how positive or negative the word was used in the context. It was also interesting to see how to pull novels and search in the gutenbergr package.
Silge, J., & Robinson, D. (2017). Chapter 2: Sentiment Analysis with Tidy Data. In Text mining with R: A tidy approach (pp. 13-30). Sebastopol, CA: O’Reilly.↩︎