Week 10th Assignment intro

In this week’s assignment we will be working with sentiment analyses.

first, we will start by reading Text Mining with R, Chapter 2 looks at Sentiment Analysis. The goal of this assignment is to start by getting the primary example code from chapter 2 and replicate it in R-Markdown.

Then we extend the code in two ways:

  1. Work with a different corpus of your choosing,

  2. and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

At the end, an.Rmd file will be posted in your GitHub repository and to rpubs.com. ## Code Initiation

Here I load the required libraries and ensure all the required packages are installed before running the following blocks of codes.

## [1] "All required packages are installed"

Chapter Two Code replication

In this section, we replicate the codes in chapter 2 of the above-mentioned book. I have chosen Mark Twain’s books and used Project Gutenberg to download his books. I have chosen his books that have only one title and among them, I have chosen his books that were stories.

I have also entertained the idea of downloading of his famous and best books, but I had difficulty downloading all from Gutenburg and stopped after some time.

In our analysis, we followed the sentiment analysis approach outlined in the textbook (Silge & Robinson, 2017).
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows
library(janeaustenr)


# Load Mark Twain books from Project Gutenberg
mark_twain_books <- gutenberg_works(author == "Twain, Mark")

# Make a new column named book and only keep the titles without ", Chapter", ", Part", or "-"
mark_twain_books <- mark_twain_books %>%
  mutate(book = gsub("[—:,].*", "", title))

# Select only the books that have one title
mark_twain_books_sel <- mark_twain_books %>%
  count(book) %>%
  filter(n == 1) %>%
  inner_join(mark_twain_books, by = "book")

# Choose 6 random samples of Mark Twain books to download, choosing from single title books
set.seed(2014)

MT_best_book_list <- c("The Innocents Abroad", 
                       "Life on the Mississippi",
                       "A Connecticut Yankee in King Arthur's Court",
                       "The Prince and the Pauper",
                       "Adventures of Huckleberry Finn",
                       "The Adventures of Tom Sawyer")

mark_twain_books_sel_6 <- mark_twain_books %>%
  filter(grepl(paste(MT_best_book_list, collapse = "|"), title)) %>%
  select(book, title, everything())


MT_one_book_list <- c("A Horse's Tale", 
                      "The Adventures of Huckleberry Finn \\(Tom Sawyer's Comrade\\)",
                      "The Man That Corrupted Hadleyburg",
                      "Tom Sawyer Abroad",
                      "Tom Sawyer, Detective",
                      "The American Claimant")

mark_twain_books_sel_6 <- mark_twain_books %>%
  filter(grepl(paste(MT_one_book_list, collapse = "|"), title)) %>%
  select(book, title, everything())


#mark_twain_books_sel_6 <- sample_n(mark_twain_books_sel, 6)


mark_t_books_downloaded <- list() 
DF_book <- tibble(book = character(0), title = character(0), text = character(0))

# Run for each of the 6 selected books and download them
for (i in seq_along(mark_twain_books_sel_6$gutenberg_id)) {
  #get the ID and titl of the book 
  book_id <- mark_twain_books_sel_6$gutenberg_id[i]
  book_title <- mark_twain_books_sel_6$title[i]
  #downlaod from gutenberg 
  mark_t_books_downloaded[[i]] <- gutenberg_download(book_id)
  
  rep_size <- length(mark_t_books_downloaded[[i]]$text)
  # Combine all text paragraphs into a single text
  #book_text <- paste(mark_t_books_downloaded[[i]]$text, collapse = " ")
  
  DF_book <- rbind(DF_book, tibble(book = rep(book_id,rep_size), title = rep(book_title,rep_size), text = mark_t_books_downloaded[[i]]$text))
}
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
tidy_MT_book <- DF_book %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

#Get sentences 
tidy_MT_book_sentence <- DF_book %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(sentence, text, token = "sentences") 



#tidy_books <- austen_books() %>%
#  group_by(book) %>%
#  mutate(
#    linenumber = row_number(),
#    chapter = cumsum(str_detect(text, 
#                                regex("^chapter [\\divxlc]", 
#                                      ignore_case = TRUE)))) %>%
#  ungroup() %>%
#  unnest_tokens(word, text)


nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

# chose a random book 
random_book <- sample(unique(tidy_MT_book$title),1)

tidy_MT_book %>%
  filter(title == random_book) %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 149 × 2
##    word          n
##    <chr>     <int>
##  1 good         45
##  2 child        28
##  3 mother       23
##  4 beautiful    16
##  5 music        12
##  6 glad          9
##  7 kind          9
##  8 love          8
##  9 pretty        8
## 10 salute        8
## # ℹ 139 more rows
mark_twain_sentiment <- tidy_MT_book %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 80518 of `x` matches multiple rows in `y`.
## ℹ Row 5229 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
mark_twain_sentiment_2 <- tidy_MT_book %>%
  group_by(title) %>%
  inner_join(get_sentiments("nrc") %>% 
               filter(sentiment %in% c("positive","negative")))%>%
  mutate(method = "NRC")%>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1169 of `x` matches multiple rows in `y`.
## ℹ Row 4848 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
ggplot(mark_twain_sentiment, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x")

#use nrc_get sentiment to evalate all books
MT_sentiment_scores <- tidy_MT_book %>%
  group_by(title) %>%
  summarize(words = toString(word)) %>%
  ungroup() %>%
  mutate(sentiment = get_nrc_sentiment(words, language = "english"))

# Flatten the nested data
flattened_data <- MT_sentiment_scores %>%
  unnest(cols = sentiment)


barplot(
  colSums(prop.table(flattened_data[, 3:12])),
  space = 0.2,
  horiz = FALSE,
  las = 1,
  cex.names = 0.7,
  col = brewer.pal(n = 8, name = "Set3"),
  main = "A few Mark Twain's Books",
  sub = "Analysis by KP",
  xlab="emotions", ylab = NULL)

# First, let's reshape the data into a long format
flattened_data_long <- flattened_data %>%
  pivot_longer(cols = 3:12, names_to = "sentiment", values_to = "score")

# Now, create the plot
ggplot(flattened_data_long, aes(x = title, y = score, fill = title)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ sentiment, scales = "free_y", ncol = 2) +
  labs(x = "Title", y = "Sentiment Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

2.3 Comparing the three sentiment dictionaries

This part copied code from this book. The code has been replicated and modified as needed.

Huckleberry_Finn <- tidy_MT_book %>% 
  filter(title == "The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)")


afinn <- Huckleberry_Finn %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
  Huckleberry_Finn %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  Huckleberry_Finn %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 6045 of `x` matches multiple rows in `y`.
## ℹ Row 1441 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")+
  labs(title ="The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)") + 
  xlab("Sentiment over pages")

#get_sentiments("nrc") %>% 
#  filter(sentiment %in% c("positive", "negative")) %>% 
#  count(sentiment)

#get_sentiments("bing") %>% 
#  count(sentiment)

2.4 Most common positive and negative words

The code is mostly copied over from the tidytextmining and the sentiment analyses. Changes have been implemented to analyze some of Mark Twain’s body of works.

bing_word_counts <- tidy_MT_book %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() 
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 80518 of `x` matches multiple rows in `y`.
## ℹ Row 5229 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)

bing_word_counts %>%
  filter(n > 80) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  labs(y = "Contribution to sentiment")

custom_stop_words <- bind_rows(tibble(word = c("miss"),  
                                      lexicon = c("custom")), 
                               stop_words)

2.5 Wordclouds

The code is mostly copied over from the tidytextmining and the sentiment analyses. Changes have been implemented to analyze some of Mark Twain’s body of works.

library(wordcloud)

tidy_MT_book %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`

tidy_MT_book %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 80518 of `x` matches multiple rows in `y`.
## ℹ Row 5229 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

2.6 Looking at units beyond just words

The code is mostly copied over from the tidytextmining and the sentiment analyses. Changes have been implemented to analyze some of Mark Twain’s body of works.

p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")


MT_chapters <- DF_book %>%
  group_by(title) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

MT_chapters %>% 
  group_by(title) %>% 
  summarise(chapters = n())
## # A tibble: 6 × 2
##   title                                                     chapters
##   <chr>                                                        <int>
## 1 A Horse's Tale                                                   1
## 2 The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)       44
## 3 The American Claimant                                           51
## 4 The Man That Corrupted Hadleyburg                                1
## 5 Tom Sawyer Abroad                                               27
## 6 Tom Sawyer, Detective                                           23
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_MT_book %>%
  group_by(title, chapter) %>%
  summarize(words = n())
## `summarise()` has grouped output by 'title'. You can override using the
## `.groups` argument.
tidy_MT_book %>%
  semi_join(bingnegative) %>%
  group_by(title, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("title", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup()
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'title'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 5
##   title                                       chapter negativewords words  ratio
##   <chr>                                         <int>         <int> <int>  <dbl>
## 1 The Adventures of Huckleberry Finn (Tom Sa…      13            76  2057 0.0369
## 2 The American Claimant                            22             3    29 0.103 
## 3 Tom Sawyer Abroad                                12             1     5 0.2   
## 4 Tom Sawyer, Detective                             5             1     7 0.143

Analyzing Emotion in one Mark Twain’s Book

In this section, we use the emotion listed in “NRC” to learn more about the emotion changes in the books.

This section I have used code from proraminghistogram.

Huckleberry_Finn <- tidy_MT_book %>% 
  filter(title == "The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)")

text_words <- get_tokens(Huckleberry_Finn$word)

sentiment_scores_sum <- get_nrc_sentiment(toString(text_words), language = "english")

sentiment_scores <- get_nrc_sentiment(text_words, language = "english")


barplot(
  colSums(prop.table(sentiment_scores[, 1:8])),
  space = 0.2,
  horiz = FALSE,
  las = 1,
  cex.names = 0.7,
  col = brewer.pal(n = 8, name = "Set3"),
  main = "The Adventures of Huckleberry Finn (Tom Sawyer's Comrade)",
  sub = "Analysis by KP",
  xlab="emotions", ylab = NULL)

sad_words <- text_words[sentiment_scores$sadness> 0]

sad_word_order <- sort(table(unlist(sad_words)), decreasing = TRUE)
head(sad_word_order, n = 12)
## 
##    dark   widow     bad   awful    kill   leave    sick   black    shot   steal 
##      71      47      40      34      33      33      33      32      32      30 
## runaway   broke 
##      28      26
cloud_emotions_data <- c(
  paste(text_words[sentiment_scores$sadness> 0], collapse = " "),
  paste(text_words[sentiment_scores$joy > 0], collapse = " "),
  paste(text_words[sentiment_scores$anger > 0], collapse = " "),
  paste(text_words[sentiment_scores$fear > 0], collapse = " "))

cloud_corpus <- Corpus(VectorSource(cloud_emotions_data))

cloud_tdm <- TermDocumentMatrix(cloud_corpus)
cloud_tdm <- as.matrix(cloud_tdm)
head(cloud_tdm)
##          Docs
## Terms     1 2 3 4
##   _bang   1 0 1 1
##   _bang_  2 0 2 2
##   _beg_   1 0 0 0
##   _case   1 0 0 1
##   _dark   1 0 0 0
##   _leave_ 2 0 0 0
colnames(cloud_tdm) <- c('sadness', 'happiness', 'anger', 'joy')
head(cloud_tdm)
##          Docs
## Terms     sadness happiness anger joy
##   _bang         1         0     1   1
##   _bang_        2         0     2   2
##   _beg_         1         0     0   0
##   _case         1         0     0   1
##   _dark         1         0     0   0
##   _leave_       2         0     0   0
set.seed(2014) # this can be set to any integer
comparison.cloud(cloud_tdm, random.order = FALSE,
                 colors = c("green", "red", "orange", "blue"),
                 title.size = 1.0, max.words = 60, scale = c(2.5, 0.8), rot.per =0.3)

sentiment_valence <- (sentiment_scores$negative *-1) + sentiment_scores$positive

simple_plot(sentiment_valence)

Citation:

Reference:

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. Retrieved from https://www.tidytextmining.com/sentiment.html

Programming Historian: Sentiment Analysis with Syuzhet