Today’s Assignment is to work on chapter two of Welcome to Text Mining with R and extend the code in two ways:

  1. Work with a different corpus of your choosing, and
  2. Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

The code can be found in Chapter Two of the book.

Citation: * Title: Sentiment analysis with tidy data * Author: Julia Silge and David Robinson * Date: 4/6/21 * Code version: a912425 * Availability: https://github.com/dgrtwo/tidy-text-mining/blob/master/02-sentiment-analysis.Rmd

Step one

The first thing we are going to need to do is install the sentiment datasets

library(tidytext)
#install.packages('textdata')
library('textdata')
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

The next step is to pick an author. I like F. Scott Fitzgerald, and as a cynical writer, I want to see if the sentiment analysis picks up on that. I will download the books from Gutenberg (the ones out of copyright at least).

library(stringr)
#install.packages('gutenbergr')
library('gutenbergr')

## This shows us the works by F Scott
gutenberg_works(str_detect(author, "Fitzgerald, F. Scott"))
## # A tibble: 4 x 8
##   gutenberg_id title  author   gutenberg_autho~ language gutenberg_books~ rights
##          <int> <chr>  <chr>               <int> <chr>    <chr>            <chr> 
## 1          805 This ~ Fitzger~              420 en       <NA>             Publi~
## 2         4368 Flapp~ Fitzger~              420 en       <NA>             Publi~
## 3         6695 Tales~ Fitzger~              420 en       <NA>             Publi~
## 4         9830 The B~ Fitzger~              420 en       <NA>             Publi~
## # ... with 1 more variable: has_text <lgl>
paradise <- gutenberg_download(805)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
paradise$book <- 'This Side of Paradise'

flappers <- gutenberg_download(4368)
flappers$book <- 'Flappers and Philosophers'

tales <- gutenberg_download(6695)
tales$book <- 'Tales of the Jazz Age'

beautiful <- gutenberg_download(9830)
beautiful$book <- 'The Beautiful and Damned'

#this one wasn't shown in the dataset, so I had to find it online
gatsby <- gutenberg_download(64317)
gatsby$book <- 'The Great Gatsby'


books <- rbind(
                paradise,
                flappers,
                tales,
                beautiful,
                gatsby
                )

Now that we have the books in a dataframe, let’s add in line number and chapter number for each word.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
tidy_books <- books %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Let’s see how this looks and get all the angry/bitter words that are used in the Great Gatsby

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "anger")

tidy_books %>%
  filter(book == "The Great Gatsby") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 225 x 2
##    word         n
##    <chr>    <int>
##  1 hot         25
##  2 money       23
##  3 words       15
##  4 crazy       14
##  5 bad         11
##  6 broken      10
##  7 darkness     9
##  8 feeling      9
##  9 terrible     9
## 10 strained     8
## # ... with 215 more rows

Now we can assign a sentiment for every 50 since these books are typically smaller than the books of Jane Austen

library(tidyr)

f_scott_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 50, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"

Let’s graph the results

library(ggplot2)

ggplot(f_scott_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

I’m curious, did F. Scott become more cynical over time?

f_scott_sentiment %>% group_by(book) %>% summarise(total_sentiment = sum(sentiment)) %>% arrange(desc(total_sentiment))
## # A tibble: 5 x 2
##   book                      total_sentiment
##   <chr>                               <int>
## 1 Flappers and Philosophers            -164
## 2 The Great Gatsby                     -282
## 3 Tales of the Jazz Age                -287
## 4 This Side of Paradise                -297
## 5 The Beautiful and Damned             -797

I would have thought yes, but according to this model, F. Scott actually became less cynical over time. His two early books being the most negative.

Now, let’s look at the Great Gatsby individually

great_gastby <- tidy_books %>% 
  filter(book == "The Great Gatsby")

Let’s see how all the models compare when looking at the book

afinn <- great_gastby %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(
  great_gastby %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  great_gastby %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"

And let’s see how that looks mapped out

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

And let’s add a wordcloud to see the most popular words in the book

library(wordcloud)
## Loading required package: RColorBrewer
great_gastby %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
great_gastby %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"