library(tidytext)
library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(gutenbergr)
library(wordcloud)
## Loading required package: RColorBrewer

Introduction

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.

In this assignment I will extend on the primary codes provided from chapter 2 in Text Mining with R, by using the sentiment dictionaries bing, nrc, loughran, on Alice in wonderland adventure from the gutenberg package in R.

Using get sentiment function in R I was able to download specific sentiment lexicons with the appropriate measures, where some lexicons requested I agreed to license before downloading. I downloaded sentiments for AFINN from Finn Årup Nielsen, with agreement: http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html.

get_sentiments("afinn")
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows

Downloaded bing from Bing Liu and collaborators: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

get_sentiments("bing")
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows

Downloaded nrc from Saif Mohammad and Peter Turney, with agreement: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

get_sentiments("nrc")
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows

The janeaustenr package in R has the six completed books frome Jane Austen. To find the most common joy words in in the book “Emma” by Austen, first the text were unnested to form a tidy format, and the functions grouped by and mutate were used to construct columns for each line and chapter.

For more information: https://github.com/juliasilge/janeaustenr

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

head(tidy_books)
## # A tibble: 6 × 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

Secondly, the filtered joy word from “Emma” using the filter function to filter joy words, inner join function to do a sentiment analysis, and the count function to get the count on how many times each word was used.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ℹ 291 more rows
head(nrc_joy)
## # A tibble: 6 × 2
##   word          sentiment
##   <chr>         <chr>    
## 1 absolution    joy      
## 2 abundance     joy      
## 3 abundant      joy      
## 4 accolade      joy      
## 5 accompaniment joy      
## 6 accomplish    joy

Used bing to find the negative and positive words in the each book by Austen and calculated the net sentiment.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(jane_austen_sentiment)
## # A tibble: 6 × 5
##   book                index negative positive sentiment
##   <fct>               <dbl>    <int>    <int>     <int>
## 1 Sense & Sensibility     0       16       32        16
## 2 Sense & Sensibility     1       19       53        34
## 3 Sense & Sensibility     2       12       31        19
## 4 Sense & Sensibility     3       15       31        16
## 5 Sense & Sensibility     4       16       34        18
## 6 Sense & Sensibility     5       16       51        35

Data visualization for the net sentiment for each book, the plot was against the index on the x axis which allows us to see how the sentiment changes over trajectory.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Compared the three sentiment dictionaries

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice") #filter words only from the the book "Pride and Prejudice"

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
  pride_prejudice %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  pride_prejudice %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Visual of the three compared sentiment. The three sentiment dictionaries give different results, FINN gives the highest positive values more variance, Bing has the lowest positive values, and NRC has the least negative value.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

comparing nrc to bing, bing has higher negative words and nrc has higher positive words.

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

Extended Analysis

Looked up a list of books from Lewis Caroll in gutenberg package, the package is a public domain with a collevct of works taht can be used to download and process.

read more: https://github.com/ropensci/gutenbergr

gutenberg_works(author== "Carroll, Lewis")
## # A tibble: 15 × 8
##    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
##           <int> <chr>    <chr>                <int> <chr>    <chr>              
##  1           11 "Alice'… Carro…                   7 en       "Children's Litera…
##  2           13 "The Hu… Carro…                   7 en       "Children's Litera…
##  3          620 "Sylvie… Carro…                   7 en       ""                 
##  4         4763 "The Ga… Carro…                   7 en       "Philosophy"       
##  5        19551 "Alice … Carro…                   7 en       ""                 
##  6        28696 "Symbol… Carro…                   7 en       "Philosophy"       
##  7        28885 "Alice'… Carro…                   7 en       "Banned Books from…
##  8        29042 "A Tang… Carro…                   7 en       "Mathematics"      
##  9        29888 "The Hu… Carro…                   7 en       ""                 
## 10        33582 "Rhyme?… Carro…                   7 en       ""                 
## 11        35497 "Three … Carro…                   7 en       ""                 
## 12        35535 "Feedin… Carro…                   7 en       ""                 
## 13        35688 "Alice … Carro…                   7 en       ""                 
## 14        36308 "Songs … Carro…                   7 en       ""                 
## 15        38065 "Eight … Carro…                   7 en       ""                 
## # ℹ 2 more variables: rights <chr>, has_text <lgl>

Downloaded ALICE’S ADVENTURES IN WONDERLAND by Lewis Carroll from the gutenberg package in R.

Alice_in_wonderland_Adv<-gutenberg_download(28885)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

Tidy formart for the words in the text to be analyzed

Alice_in_wonderland_Adv_tidy<- Alice_in_wonderland_Adv %>%
  mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  unnest_tokens(word, text)

Compared the three sentiment NRC, bing, and Afinn in a plot for Alice and wonderland adventure book. The NRC had the least negative words, afinn had the highest positive value, and the bing had the highest amount of negative words.

afinn2 <- Alice_in_wonderland_Adv_tidy %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc2 <- bind_rows(
  Alice_in_wonderland_Adv_tidy %>% 
    inner_join(get_sentiments("bing")) %>%
    mutate(method = "Bing et al."),
  Alice_in_wonderland_Adv_tidy %>% 
    inner_join(get_sentiments("nrc") %>% 
                 filter(sentiment %in% c("positive", 
                                         "negative"))
    ) %>%
    mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2229 of `x` matches multiple rows in `y`.
## ℹ Row 5004 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
bind_rows(afinn2, 
          bing_and_nrc2) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Used loughran as the additional lexicon, used the loughran to look into positive and negative words. For Loughran I had to agree to license before downloading.

loughran_posneg <- get_sentiments("loughran") %>% 
  filter(sentiment == "positive" | sentiment =="negative")
AIWA_loughran <- Alice_in_wonderland_Adv_tidy %>%
  inner_join(loughran_posneg) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`

Data visualization for loughran

par(mfrow=c(1,2))

ggplot(AIWA_loughran, aes(index, sentiment)) +
  geom_col(show.legend = FALSE) 

Loughran sentiment has way more negative words than positive words, which explains why the plot for the Alice in wonderland adv loughran has so many negative sentiment values. The bing plot had more positive values for Alice in wonderland adv compared to the Loughran plot.

get_sentiments("loughran") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   2355
## 2 positive    354

Conclusion

Using sentiment lexicon I was able to analyze words that are most frequently used in documents that are catergorized as positive or negative. In my opinion I would use nrc because the amount of words they have for positive and negative are close and I feel this would help in the avoiding a bias of words. For example Loughran has a way higher count for negative words than positives words, therefore using Loughran most of the time will have more negative sentiment.

Citation:

for base code: Silge, J. & Robinson, D. (2016). Welcome to Text Mining with R. O’Reilly Media.

Extented Analysis:

citation('gutenbergr')
## To cite package 'gutenbergr' in publications use:
## 
##   Johnston M, Robinson D (2023). _gutenbergr: Download and Process
##   Public Domain Works from Project Gutenberg_. R package version 0.2.4,
##   <https://CRAN.R-project.org/package=gutenbergr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {gutenbergr: Download and Process Public Domain Works from Project Gutenberg},
##     author = {Myfanwy Johnston and David Robinson},
##     year = {2023},
##     note = {R package version 0.2.4},
##     url = {https://CRAN.R-project.org/package=gutenbergr},
##   }