In this week’s assignment we will be working with sentiment analyses.
first, we will start by reading Text Mining with R, Chapter 2 looks at Sentiment Analysis. The goal of this assignment is to start by getting the primary example code from chapter 2 and replicate it in R-Markdown.
Then we extend the code in two ways:
Work with a different corpus of your choosing,
and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).
At the end, an.Rmd file will be posted in your GitHub repository and to rpubs.com. ## Code Initiation
Here I load the required libraries and ensure all the required packages are installed before running the following blocks of codes.
## [1] "All required packages are installed"
In this section, we replicate the codes in chapter 2 of the above-mentioned book. I have chosen Mark Twain’s books and used Project Gutenberg to download his books. I have chosen his books that have only one title and among them, I have chosen his books that were stories.
I have also entertained the idea of downloading of his famous and best books, but I had difficulty downloading all from Gutenburg and stopped after some time.
In our analysis, we followed the sentiment analysis approach outlined in the textbook (Silge & Robinson, 2017).
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
library(janeaustenr)
# Load Mark Twain books from Project Gutenberg
mark_twain_books <- gutenberg_works(author == "Twain, Mark")
# Make a new column named book and only keep the titles without ", Chapter", ", Part", or "-"
mark_twain_books <- mark_twain_books %>%
mutate(book = gsub("[—:,].*", "", title))
# Select only the books that have one title
mark_twain_books_sel <- mark_twain_books %>%
count(book) %>%
filter(n == 1) %>%
inner_join(mark_twain_books, by = "book")
# Choose 6 random samples of Mark Twain books to download, choosing from single title books
set.seed(2014)
MT_best_book_list <- c("The Innocents Abroad",
"Life on the Mississippi",
"A Connecticut Yankee in King Arthur's Court",
"The Prince and the Pauper",
"Adventures of Huckleberry Finn",
"The Adventures of Tom Sawyer")
mark_twain_books_sel_6 <- mark_twain_books %>%
filter(grepl(paste(MT_best_book_list, collapse = "|"), title)) %>%
select(book, title, everything())
MT_one_book_list <- c("A Horse's Tale",
"The Adventures of Huckleberry Finn \\(Tom Sawyer's Comrade\\)",
"The Man That Corrupted Hadleyburg",
"Tom Sawyer Abroad",
"Tom Sawyer, Detective",
"The American Claimant")
mark_twain_books_sel_6 <- mark_twain_books %>%
filter(grepl(paste(MT_one_book_list, collapse = "|"), title)) %>%
select(book, title, everything())
#mark_twain_books_sel_6 <- sample_n(mark_twain_books_sel, 6)
mark_t_books_downloaded <- list()
DF_book <- tibble(book = character(0), title = character(0), text = character(0))
# Run for each of the 6 selected books and download them
for (i in seq_along(mark_twain_books_sel_6$gutenberg_id)) {
#get the ID and titl of the book
book_id <- mark_twain_books_sel_6$gutenberg_id[i]
book_title <- mark_twain_books_sel_6$title[i]
#downlaod from gutenberg
mark_t_books_downloaded[[i]] <- gutenberg_download(book_id)
rep_size <- length(mark_t_books_downloaded[[i]]$text)
# Combine all text paragraphs into a single text
#book_text <- paste(mark_t_books_downloaded[[i]]$text, collapse = " ")
DF_book <- rbind(DF_book, tibble(book = rep(book_id,rep_size), title = rep(book_title,rep_size), text = mark_t_books_downloaded[[i]]$text))
}
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_MT_book <- DF_book %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
#Get sentences
tidy_MT_book_sentence <- DF_book %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(sentence, text, token = "sentences")
#tidy_books <- austen_books() %>%
# group_by(book) %>%
# mutate(
# linenumber = row_number(),
# chapter = cumsum(str_detect(text,
# regex("^chapter [\\divxlc]",
# ignore_case = TRUE)))) %>%
# ungroup() %>%
# unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
# chose a random book
random_book <- sample(unique(tidy_MT_book$title),1)
tidy_MT_book %>%
filter(title == random_book) %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 149 × 2
## word n
## <chr> <int>
## 1 good 45
## 2 child 28
## 3 mother 23
## 4 beautiful 16
## 5 music 12
## 6 glad 9
## 7 kind 9
## 8 love 8
## 9 pretty 8
## 10 salute 8
## # ℹ 139 more rows
mark_twain_sentiment <- tidy_MT_book %>%
inner_join(get_sentiments("bing")) %>%
count(title, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 80518 of `x` matches multiple rows in `y`.
## ℹ Row 5229 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
mark_twain_sentiment_2 <- tidy_MT_book %>%
group_by(title) %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")))%>%
mutate(method = "NRC")%>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1169 of `x` matches multiple rows in `y`.
## ℹ Row 4848 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
ggplot(mark_twain_sentiment, aes(index, sentiment, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x")
#use nrc_get sentiment to evalate all books
MT_sentiment_scores <- tidy_MT_book %>%
group_by(title) %>%
summarize(words = toString(word)) %>%
ungroup() %>%
mutate(sentiment = get_nrc_sentiment(words, language = "english"))
# Flatten the nested data
flattened_data <- MT_sentiment_scores %>%
unnest(cols = sentiment)
barplot(
colSums(prop.table(flattened_data[, 3:12])),
space = 0.2,
horiz = FALSE,
las = 1,
cex.names = 0.7,
col = brewer.pal(n = 8, name = "Set3"),
main = "A few Mark Twain's Books",
sub = "Analysis by KP",
xlab="emotions", ylab = NULL)
# First, let's reshape the data into a long format
flattened_data_long <- flattened_data %>%
pivot_longer(cols = 3:12, names_to = "sentiment", values_to = "score")
# Now, create the plot
ggplot(flattened_data_long, aes(x = title, y = score, fill = title)) +
geom_bar(stat = "identity") +
facet_wrap(~ sentiment, scales = "free_y", ncol = 2) +
labs(x = "Title", y = "Sentiment Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This part copied code from this book. The code has been replicated and modified as needed.
Huckleberry_Finn <- tidy_MT_book %>%
filter(title == "The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)")
afinn <- Huckleberry_Finn %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
Huckleberry_Finn %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
Huckleberry_Finn %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 6045 of `x` matches multiple rows in `y`.
## ℹ Row 1441 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")+
labs(title ="The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)") +
xlab("Sentiment over pages")
#get_sentiments("nrc") %>%
# filter(sentiment %in% c("positive", "negative")) %>%
# count(sentiment)
#get_sentiments("bing") %>%
# count(sentiment)
The code is mostly copied over from the tidytextmining and the sentiment analyses. Changes have been implemented to analyze some of Mark Twain’s body of works.
bing_word_counts <- tidy_MT_book %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 80518 of `x` matches multiple rows in `y`.
## ℹ Row 5229 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
bing_word_counts %>%
filter(n > 80) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
coord_flip() +
labs(y = "Contribution to sentiment")
custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
The code is mostly copied over from the tidytextmining and the sentiment analyses. Changes have been implemented to analyze some of Mark Twain’s body of works.
library(wordcloud)
tidy_MT_book %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
tidy_MT_book %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 80518 of `x` matches multiple rows in `y`.
## ℹ Row 5229 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
The code is mostly copied over from the tidytextmining and the sentiment analyses. Changes have been implemented to analyze some of Mark Twain’s body of works.
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
MT_chapters <- DF_book %>%
group_by(title) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
MT_chapters %>%
group_by(title) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## title chapters
## <chr> <int>
## 1 A Horse's Tale 1
## 2 The Adventures of Huckleberry Finn (Tom Sawyer's Comrade) 44
## 3 The American Claimant 51
## 4 The Man That Corrupted Hadleyburg 1
## 5 Tom Sawyer Abroad 27
## 6 Tom Sawyer, Detective 23
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_MT_book %>%
group_by(title, chapter) %>%
summarize(words = n())
## `summarise()` has grouped output by 'title'. You can override using the
## `.groups` argument.
tidy_MT_book %>%
semi_join(bingnegative) %>%
group_by(title, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("title", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'title'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 5
## title chapter negativewords words ratio
## <chr> <int> <int> <int> <dbl>
## 1 The Adventures of Huckleberry Finn (Tom Sa… 13 76 2057 0.0369
## 2 The American Claimant 22 3 29 0.103
## 3 Tom Sawyer Abroad 12 1 5 0.2
## 4 Tom Sawyer, Detective 5 1 7 0.143
In this section, we use the emotion listed in “NRC” to learn more about the emotion changes in the books.
This section I have used code from proraminghistogram.
Huckleberry_Finn <- tidy_MT_book %>%
filter(title == "The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)")
text_words <- get_tokens(Huckleberry_Finn$word)
sentiment_scores_sum <- get_nrc_sentiment(toString(text_words), language = "english")
sentiment_scores <- get_nrc_sentiment(text_words, language = "english")
barplot(
colSums(prop.table(sentiment_scores[, 1:8])),
space = 0.2,
horiz = FALSE,
las = 1,
cex.names = 0.7,
col = brewer.pal(n = 8, name = "Set3"),
main = "The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)",
sub = "Analysis by KP",
xlab="emotions", ylab = NULL)
sad_words <- text_words[sentiment_scores$sadness> 0]
sad_word_order <- sort(table(unlist(sad_words)), decreasing = TRUE)
head(sad_word_order, n = 12)
##
## dark widow bad awful kill leave sick black shot steal
## 71 47 40 34 33 33 33 32 32 30
## runaway broke
## 28 26
cloud_emotions_data <- c(
paste(text_words[sentiment_scores$sadness> 0], collapse = " "),
paste(text_words[sentiment_scores$joy > 0], collapse = " "),
paste(text_words[sentiment_scores$anger > 0], collapse = " "),
paste(text_words[sentiment_scores$fear > 0], collapse = " "))
cloud_corpus <- Corpus(VectorSource(cloud_emotions_data))
cloud_tdm <- TermDocumentMatrix(cloud_corpus)
cloud_tdm <- as.matrix(cloud_tdm)
head(cloud_tdm)
## Docs
## Terms 1 2 3 4
## _bang 1 0 1 1
## _bang_ 2 0 2 2
## _beg_ 1 0 0 0
## _case 1 0 0 1
## _dark 1 0 0 0
## _leave_ 2 0 0 0
colnames(cloud_tdm) <- c('sadness', 'happiness', 'anger', 'joy')
head(cloud_tdm)
## Docs
## Terms sadness happiness anger joy
## _bang 1 0 1 1
## _bang_ 2 0 2 2
## _beg_ 1 0 0 0
## _case 1 0 0 1
## _dark 1 0 0 0
## _leave_ 2 0 0 0
set.seed(2014) # this can be set to any integer
comparison.cloud(cloud_tdm, random.order = FALSE,
colors = c("green", "red", "orange", "blue"),
title.size = 1.0, max.words = 60, scale = c(2.5, 0.8), rot.per =0.3)
sentiment_valence <- (sentiment_scores$negative *-1) + sentiment_scores$positive
simple_plot(sentiment_valence)
Reference:
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. Retrieved from https://www.tidytextmining.com/sentiment.html