Overview

The base code for this assignment is originally from “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson, Chapter 2: https://www.tidytextmining.com/sentiment.html#sentiment

This assignment focuses on sentiment analysis. To quote the original text, “We can use the tools of text mining to approach the emotional content of text programmatically”.

Base Code

library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.3.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows
library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

New Corpus

As a lover of Tolkien, I’m curious on the sentiment analysis behind his most popular, “The Lord of the Rings”. Lets add the loughran sentiment lexicon and use that for our analysis. I’ve pulled the text from an existing source I found on github.

library(tibble)
library(RCurl)
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
lotr <- as.tibble(getURI("https://raw.githubusercontent.com/wess/iotr/master/lotr.txt"))
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
lotrWords <- lotr %>% unnest_tokens(word,value) %>% anti_join(stop_words)
## Joining with `by = join_by(word)`
head(lotrWords)
## # A tibble: 6 × 1
##   word        
##   <chr>       
## 1 special     
## 2 note        
## 3 reprint     
## 4 minor       
## 5 inaccuracies
## 6 noted
loughran_pos <- get_sentiments("loughran") %>% 
  filter(sentiment == "positive")
lotrWords %>% 
  inner_join(loughran_pos) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 121 × 2
##    word          n
##    <chr>     <int>
##  1 strong      167
##  2 strength    161
##  3 dream        84
##  4 beautiful    77
##  5 easy         73
##  6 leading      69
##  7 smooth       47
##  8 pleased      45
##  9 stronger     41
## 10 pleasant     38
## # ℹ 111 more rows
lotrWords$linenumber <- 1:nrow(lotrWords)
lotrSentiment <- lotrWords %>% 
  inner_join(get_sentiments("loughran")) %>%
  count(index = linenumber %/% 500, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 89 of `x` matches multiple rows in `y`.
## ℹ Row 2173 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
ggplot(lotrSentiment, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)

I honestly would have expected more positive sentiment towards the beginning, though I suppose less negative sentiment will have to do. Compared to Jane Austen, Tolkien is much more negative…