a_10: Sentiment analysis

library(gutenbergr)
library(dplyr)
library(tidyverse)
library(tidytext)
library(textdata)
library(stringr)

The Data

The novels that I found interesting are in project gutenberg call “The Secret Garden” and “The Scarlet Letter”(by its name). So, first what we need to do is the obtain the data. There are more than one book with the same title, however, each book with the same title will have the same author, so it should be fine if we pick any one of them.

# get the novel ID from gutenberg data
gutenberg_metadata %>%
  filter(title %in% c("The Secret Garden", "The Scarlet Letter")) %>%
  dplyr::select(gutenberg_id, title, author)

## # A tibble: 6 x 3
##   gutenberg_id title              author                  
##          <int> <chr>              <chr>                   
## 1           33 The Scarlet Letter Hawthorne, Nathaniel    
## 2          113 The Secret Garden  Burnett, Frances Hodgson
## 3         8812 The Secret Garden  Burnett, Frances Hodgson
## 4        17396 The Secret Garden  Burnett, Frances Hodgson
## 5        21585 The Secret Garden  Burnett, Frances Hodgson
## 6        25344 The Scarlet Letter Hawthorne, Nathaniel

# base on ID, we get the novels
novels <- gutenberg_download(c(25344,113))

Tidy: one token per row

what we are going to do here is to split sentence into tokens and remove stop words.

scarlet_letter <- novels %>%
  filter(gutenberg_id==25344) %>%
  mutate(linenumber = row_number()) %>%  # have novel with line number
  unnest_tokens(word, text) %>%  # tokenization
  anti_join(stop_words)  # remove stop words

## Joining, by = "word"

secret_garden <- novels %>%
  filter(gutenberg_id == 113) %>%
  mutate(linenumber = row_number())%>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

# remove special characters and numbers
scarlet_letter$word<-as.character(scarlet_letter$word %>%
  map(function(x) str_replace_all(x, "[-|_|*]|[0-9]|[:space:]", "")))

secret_garden$word<-as.character(secret_garden$word %>%
  map(function(x) str_replace_all(x, "[-|_|*]|[0-9]|[:space:]|^[mdclxvi]+$", "")))

# remove empty words
scarlet_letter <- scarlet_letter %>% filter(word != "")
secret_garden <- secret_garden %>% filter(word != "")

Sentiment analysis

the word count of both novels

as we can see the word counts from them, the names and the synonym of professions are the lead.

# scarlet letter
scarlet_letter %>% count(word, sort = TRUE)

## # A tibble: 8,108 x 2
##    word         n
##    <chr>    <int>
##  1 hester     368
##  2 thou       242
##  3 pearl      225
##  4 child      197
##  5 life       162
##  6 minister   159
##  7 prynne     152
##  8 letter     142
##  9 heart      136
## 10 mother     133
## # … with 8,098 more rows

# secret garden
secret_garden %>% count(word, sort = TRUE)

## # A tibble: 4,467 x 2
##    word         n
##    <chr>    <int>
##  1 mary       674
##  2 colin      304
##  3 dickon     281
##  4 garden     232
##  5 tha        228
##  6 looked     224
##  7 martha     182
##  8 it’s       142
##  9 answered   136
## 10 eyes       135
## # … with 4,457 more rows

joy words in both novels

surprise that both of novels have joy words like mother and child in the top 10

# get the good words
good_words <- get_sentiments("nrc") %>%
  filter(sentiment == "joy")

scarlet_letter %>% 
  inner_join(good_words) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 331 x 2
##    word         n
##    <chr>    <int>
##  1 child      197
##  2 mother     133
##  3 smile       53
##  4 reverend    46
##  5 found       38
##  6 love        38
##  7 true        38
##  8 sunshine    36
##  9 art         35
## 10 friend      33
## # … with 321 more rows

secret_garden %>%
  inner_join(good_words) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 190 x 2
##    word       n
##    <chr>  <int>
##  1 garden   232
##  2 mother   103
##  3 found     92
##  4 tree      65
##  5 child     60
##  6 green     51
##  7 grow      45
##  8 alive     40
##  9 white     39
## 10 sun       38
## # … with 180 more rows

positive and negative words in these two novels

using Bing lexicon to calculate the positive and negative words, we can see from the plot that both novels have more negative words than positive ones. compared gutenberg id 113 (The secret garden), the 25344 (the scarlet letter) shows three peaks in the positive words, the secret garden shows that the positive words appear roughly in the 1/3 toward the end.

# data with two novels
both_novel <- rbind(scarlet_letter, secret_garden)

# use bing sentiment lexicon
both_novel <- both_novel %>%
  inner_join(get_sentiments("bing")) %>%
  count(gutenberg_id, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

# plot the result
ggplot(both_novel, aes(x = index, y = sentiment, fill = gutenberg_id)) +
  geom_col(show.legend = FALSE)+
  facet_wrap(~gutenberg_id, scales = "free_x")

would you like these novels based on word analysis?

base on the positive word rate, these two novels seem talk about sad stories. So I would not like these two novels.

# sum of positive and negative word in scarlet letter
both_stat <- both_novel %>%
  group_by(gutenberg_id) %>%
  summarize(sum_pos = sum(positive),
            sum_neg = sum(negative)) %>%
  mutate(pos_rate = round(sum_pos/(sum_pos+sum_neg), 3),
         novel = ifelse(gutenberg_id == 113, "The Secret Garden", "The Scarlet Letter"))

DT::datatable(both_stat)

what if you have to pick one of them to read, which one would you pick?

based on the statistic we’ve calculated in the previous chunk, I would like the secret garden more than the scarlet letter. From the plot, we can see that the positive rate between these two novels are very close. if I must pick one to read, I would like to have the secret garden.

ggplot(both_stat, aes(x = novel, y = pos_rate, fill = gutenberg_id)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(x = "novel names",
       y = "positive rate",
       title = "comparison between two novels")