library(gutenbergr)
library(dplyr)
library(tidyverse)
library(tidytext)
library(textdata)
library(stringr)The Data
The novels that I found interesting are in project gutenberg call “The Secret Garden” and “The Scarlet Letter”(by its name). So, first what we need to do is the obtain the data. There are more than one book with the same title, however, each book with the same title will have the same author, so it should be fine if we pick any one of them.
# get the novel ID from gutenberg data
gutenberg_metadata %>%
filter(title %in% c("The Secret Garden", "The Scarlet Letter")) %>%
dplyr::select(gutenberg_id, title, author)## # A tibble: 6 x 3
## gutenberg_id title author
## <int> <chr> <chr>
## 1 33 The Scarlet Letter Hawthorne, Nathaniel
## 2 113 The Secret Garden Burnett, Frances Hodgson
## 3 8812 The Secret Garden Burnett, Frances Hodgson
## 4 17396 The Secret Garden Burnett, Frances Hodgson
## 5 21585 The Secret Garden Burnett, Frances Hodgson
## 6 25344 The Scarlet Letter Hawthorne, Nathaniel
# base on ID, we get the novels
novels <- gutenberg_download(c(25344,113))Tidy: one token per row
what we are going to do here is to split sentence into tokens and remove stop words.
scarlet_letter <- novels %>%
filter(gutenberg_id==25344) %>%
mutate(linenumber = row_number()) %>% # have novel with line number
unnest_tokens(word, text) %>% # tokenization
anti_join(stop_words) # remove stop words## Joining, by = "word"
secret_garden <- novels %>%
filter(gutenberg_id == 113) %>%
mutate(linenumber = row_number())%>%
unnest_tokens(word, text) %>%
anti_join(stop_words)## Joining, by = "word"
# remove special characters and numbers
scarlet_letter$word<-as.character(scarlet_letter$word %>%
map(function(x) str_replace_all(x, "[-|_|*]|[0-9]|[:space:]", "")))
secret_garden$word<-as.character(secret_garden$word %>%
map(function(x) str_replace_all(x, "[-|_|*]|[0-9]|[:space:]|^[mdclxvi]+$", "")))
# remove empty words
scarlet_letter <- scarlet_letter %>% filter(word != "")
secret_garden <- secret_garden %>% filter(word != "")Sentiment analysis
the word count of both novels
as we can see the word counts from them, the names and the synonym of professions are the lead.
# scarlet letter
scarlet_letter %>% count(word, sort = TRUE)## # A tibble: 8,108 x 2
## word n
## <chr> <int>
## 1 hester 368
## 2 thou 242
## 3 pearl 225
## 4 child 197
## 5 life 162
## 6 minister 159
## 7 prynne 152
## 8 letter 142
## 9 heart 136
## 10 mother 133
## # … with 8,098 more rows
# secret garden
secret_garden %>% count(word, sort = TRUE)## # A tibble: 4,467 x 2
## word n
## <chr> <int>
## 1 mary 674
## 2 colin 304
## 3 dickon 281
## 4 garden 232
## 5 tha 228
## 6 looked 224
## 7 martha 182
## 8 it’s 142
## 9 answered 136
## 10 eyes 135
## # … with 4,457 more rows
joy words in both novels
surprise that both of novels have joy words like mother and child in the top 10
# get the good words
good_words <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
scarlet_letter %>%
inner_join(good_words) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 331 x 2
## word n
## <chr> <int>
## 1 child 197
## 2 mother 133
## 3 smile 53
## 4 reverend 46
## 5 found 38
## 6 love 38
## 7 true 38
## 8 sunshine 36
## 9 art 35
## 10 friend 33
## # … with 321 more rows
secret_garden %>%
inner_join(good_words) %>%
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 190 x 2
## word n
## <chr> <int>
## 1 garden 232
## 2 mother 103
## 3 found 92
## 4 tree 65
## 5 child 60
## 6 green 51
## 7 grow 45
## 8 alive 40
## 9 white 39
## 10 sun 38
## # … with 180 more rows
positive and negative words in these two novels
using Bing lexicon to calculate the positive and negative words, we can see from the plot that both novels have more negative words than positive ones. compared gutenberg id 113 (The secret garden), the 25344 (the scarlet letter) shows three peaks in the positive words, the secret garden shows that the positive words appear roughly in the 1/3 toward the end.
# data with two novels
both_novel <- rbind(scarlet_letter, secret_garden)
# use bing sentiment lexicon
both_novel <- both_novel %>%
inner_join(get_sentiments("bing")) %>%
count(gutenberg_id, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
# plot the result
ggplot(both_novel, aes(x = index, y = sentiment, fill = gutenberg_id)) +
geom_col(show.legend = FALSE)+
facet_wrap(~gutenberg_id, scales = "free_x")would you like these novels based on word analysis?
base on the positive word rate, these two novels seem talk about sad stories. So I would not like these two novels.
# sum of positive and negative word in scarlet letter
both_stat <- both_novel %>%
group_by(gutenberg_id) %>%
summarize(sum_pos = sum(positive),
sum_neg = sum(negative)) %>%
mutate(pos_rate = round(sum_pos/(sum_pos+sum_neg), 3),
novel = ifelse(gutenberg_id == 113, "The Secret Garden", "The Scarlet Letter"))
DT::datatable(both_stat)what if you have to pick one of them to read, which one would you pick?
based on the statistic we’ve calculated in the previous chunk, I would like the secret garden more than the scarlet letter. From the plot, we can see that the positive rate between these two novels are very close. if I must pick one to read, I would like to have the secret garden.
ggplot(both_stat, aes(x = novel, y = pos_rate, fill = gutenberg_id)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(x = "novel names",
y = "positive rate",
title = "comparison between two novels")