Overview: The assignment for this week is related to Text Mining. In this assignment, I start by getting the primary example code from chapter 2 of Text Mining with R working in an R Markdown document. I will then extend the code in two ways:
- Work with a different corpus of my choosing, and
- Incorporate at least one additional sentiment lexicon
All the code below are from the primary example code from the book: Text Mining with R - A Tidy Approach Julia Silge and David Robinson 2020-03-07 Creative Commons License This work by Julia Silge and David Robinson is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License.
Load all the required packages.
library(tidyverse)
library(tidytext)
library(textdata)
library(tidyr)
library(gutenbergr)
library(wordcloud)Load all sentiments
As I don’t want to repeat the same corpus and data from “Text Mining with R - A Tidy Approach”, I just use the example to run through a new corpus. There is a Gutenberg project that we can use for corpus, I use one of my favourite book - Gulliver’s Travels into Several Remote Nations of the World by Jonathan Swift as my corpus
gulliver <- gutenberg_download(829)
gbooks <- gulliver %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
unnest_tokens(word, text)I also use the following example codes to generate sentiment comparison on 3 lexicons
afinn2 <- gbooks %>%
inner_join(afinn) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
bing_and_nrc2 <- bind_rows(gbooks %>%
inner_join(bing) %>%
mutate(method = "Bing et al."),
gbooks %>%
inner_join(nrc) %>%
filter(sentiment %in% c("positive",
"negative")) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn2,
bing_and_nrc2) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")It looks like most parts of the book have positive sentiment except at the index of 93 to 95.
There is also a Loughran lexicon that was not used in the example code
loughran <- get_sentiments("loughran")
loughran2 <- gbooks %>%
inner_join(loughran) %>%
mutate(method = "Loughran")%>%
count(method, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn2,
bing_and_nrc2, loughran2) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")Loughran has a lot more negative sentiments than the other 3 lexicons just by looking at the plot.
I would like to remove all the stop_words and see the most frequenct word I can find in the book. I also want to see the wordcloud generated from the data.
data(stop_words)
gbooksStop <- gbooks %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
gbooksStop ## # A tibble: 7,766 x 2
## word n
## <chr> <int>
## 1 country 203
## 2 time 168
## 3 people 141
## 4 master 134
## 5 feet 121
## 6 majesty 111
## 7 found 110
## 8 hundred 109
## 9 court 99
## 10 _yahoos_ 97
## # … with 7,756 more rows
It’s really interesting to see country, time and people are the top 3 in the book!!! It is also interesting to see the wordcloud function generates different patterns and words everytime I re-run the function and the visual result may not align with the count.