Introduction

For the Week 10 assignment using ‘In Text Mining with R’, Chapter 2 looks at Sentiment Analysis.
In this assignment, we started by getting the primary example code from chapter 2 working in an R Markdown document.
You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Later in the demonstration we work with a different corpus of our choosing, and we ncorporate at least one additional sentiment lexicon

Section 1 - Using Lexicons with the Tidytext Package in R [Re-create base analysis]

Source: AFINN from Finn Årup Nielsen bing from Bing Liu and collaborators nrc from Saif Mohammad and Peter Turney

The Tidytext package draws upon three main lexicons for sentiment analysis: “Bing,” “AFINN,” and “NRC.”

#The AFINN lexicon grades words between -5 and 5 (positive scores indicate positive sentiments).
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
#The NRC lexicon categorizes sentiment words into positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise and trust
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
#The Bing lexicon uses a binary categorization model that sorts words into positive or negative positions
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

Section 2 - janeaustenr - An R Package for Jane Austen’s Complete Novels

This package provides access to the full texts of Jane Austen’s 6 completed, published novels. The UTF-8 plain text for each novel was sourced from Project Gutenberg, processed a bit, and is ready for text analysis. Each text is in a character vector with elements of about 70 characters.

Below we use some functionality of dplyr to organize and tidy the data for analysis. In the case below we see the output helps us create a dataframe displaying sentiment count

Emma, by Jane Austen, is a novel about youthful hubris and romantic misunderstandings. It is set in the fictional country village of Highbury and the surrounding estates of Hartfield, Randalls and Donwell Abbey, and involves the relationships among people from a small number of families.

tidy_books123 <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books123 %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE) %>%
  head()
## Joining, by = "word"
## # A tibble: 6 x 2
##   word       n
##   <chr>  <int>
## 1 good     359
## 2 young    192
## 3 friend   166
## 4 hope     143
## 5 happy    125
## 6 love     117

Please note the creation of jane_austen_sentiment1 using functionality of sentiments[bing]

jane_austen_sentiment1 <- tidy_books123 %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"

Graphical depiction using ggplot of frame jane_austen_sentiment1 displaying sentiment high an low levels for the novels

ggplot(jane_austen_sentiment1, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Section 3 - Pride & Prejudice

Demonstration using dplyr/filter for Pride & Prejudice. We are able to depict using the head function some of the line number and word distribution

Pride and Prejudice is a novel of manners by Jane Austen, first published in 1813. The story follows the main character, Elizabeth Bennet, as she deals with issues of manners, upbringing, morality, education, and marriage in the society of the landed gentry of the British Regency.

pride_prejudice1 <- tidy_books123 %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice1%>%
  head()
## # A tibble: 6 x 4
##   book              linenumber chapter word     
##   <fct>                  <int>   <int> <chr>    
## 1 Pride & Prejudice          1       0 pride    
## 2 Pride & Prejudice          1       0 and      
## 3 Pride & Prejudice          1       0 prejudice
## 4 Pride & Prejudice          3       0 by       
## 5 Pride & Prejudice          3       0 jane     
## 6 Pride & Prejudice          3       0 austen

Using a few techniques of bindings, mutate and inner_join to compile sentiment values denoting ‘positive’ and ‘negative’. As you can see our information indicates the total sentiment values of negative 3324 and positive 2312

afinn1 <- pride_prejudice1 %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` ungrouping output (override with `.groups` argument)
bing_and_nrc1 <- bind_rows(pride_prejudice1 %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice1 %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
get_sentiments("nrc") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312

Utilizing a facetwrap and ggplot to bindrows of values afinn1 and bing_and_nrc1 to display the levels of sentiment for each Lexicon

bind_rows(afinn1, 
          bing_and_nrc1) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

Section 4 New corpus - Philosopher’s Stone[Harry Potter library]

Extend analysis to new corpus and new lexicon We identified and implemented a different corpus to perform sentiment analysis - Philosopher’s Stone We identified and implement an additional lexicon for sentiment analysis

Creating a tibble below to view content in format for philosophers stone

str(philosophers_stone)
##  chr [1:17] "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfect"| __truncated__ ...
tibble(philosophers_stone)
## # A tibble: 17 x 1
##    philosophers_stone                                                           
##    <chr>                                                                        
##  1 "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were~
##  2 "THE VANISHING GLASS  Nearly ten years had passed since the Dursleys had wok~
##  3 "THE LETTERS FROM NO ONE  The escape of the Brazilian boa constrictor earned~
##  4 "THE KEEPER OF THE KEYS  BOOM. They knocked again. Dudley jerked awake. \"Wh~
##  5 "DIAGON ALLEY  Harry woke early the next morning. Although he could tell it ~
##  6 "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS  Harry's last month with ~
##  7 "THE SORTING HAT  The door swung open at once. A tall, black-haired witch in~
##  8 "THE POTIONS MASTER  There, look.\"  \"Where?\"  \"Next to the tall kid with~
##  9 "THE MIDNIGHT DUEL  Harry had never believed he would meet a boy he hated mo~
## 10 "HALLOWEEN  Malfoy couldn't believe his eyes when he saw that Harry and Ron ~
## 11 "QUIDDITCH  As they entered November, the weather turned very cold. The moun~
## 12 "THE MIRROR OF ERISED  Christmas was coming. One morning in mid-December, Ho~
## 13 "NICOLAS FLAMEL  Dumbledore had convinced Harry not to go looking for the Mi~
## 14 "NORBERT THE NORWEGIAN RIDGEBACK  Quirrell, however, must have been braver t~
## 15 "THE FORIBIDDEN FOREST  Things couldn't have been worse.  Filch took them do~
## 16 "THROUGH THE TRAPDOOR  In years to come, Harry would never quite remember ho~
## 17 "THE MAN WITH TWO FACES  It was Quirrell.  \"You!\" gasped Harry.  Quirrell ~
Formatting the content of philosophers stone to organize in the manner of displaying word by sequence in the novel


```r
titles1 <- c("philosophers_stone")
books1 <- list(philosophers_stone)
series1 <- tibble()

for(i in seq_along(titles1)) {
  
  temp1 <- tibble(chapter = seq_along(books1[[i]]),
                  text = books1[[i]]) %>%
    unnest_tokens(word, text) %>%
    mutate(book = titles1[i]) %>%
    select(book, everything())
  
  series1 <- rbind(series1, temp1)
}
# set factor to keep books in order of publication
series1$book <- factor(series1$book, levels = rev(titles1))


series1
## # A tibble: 77,875 x 3
##    book               chapter word   
##    <fct>                <int> <chr>  
##  1 philosophers_stone       1 the    
##  2 philosophers_stone       1 boy    
##  3 philosophers_stone       1 who    
##  4 philosophers_stone       1 lived  
##  5 philosophers_stone       1 mr     
##  6 philosophers_stone       1 and    
##  7 philosophers_stone       1 mrs    
##  8 philosophers_stone       1 dursley
##  9 philosophers_stone       1 of     
## 10 philosophers_stone       1 number 
## # ... with 77,865 more rows
afinn1 <- series1 %>%
        group_by(book) %>% 
        mutate(word_count = 1:n(),
               index = word_count %/% 500 + 1) %>% 
        inner_join(get_sentiments("afinn")) %>%
        group_by(book, index) %>%
        summarise(sentiment = sum(value)) %>%
        mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)

Final compilation and organiazation of the content displaying the sentiment and mode

afinn1 <- series1 %>%
        group_by(book) %>% 
        mutate(word_count = 1:n(),
               index = word_count %/% 500 + 1) %>% 
        inner_join(get_sentiments("afinn")) %>%
        group_by(book, index) %>%
        summarise(sentiment = sum(value)) %>%
        mutate(method = "AFINN")
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
afinn1
## # A tibble: 156 x 4
## # Groups:   book [1]
##    book               index sentiment method
##    <fct>              <dbl>     <dbl> <chr> 
##  1 philosophers_stone     1        11 AFINN 
##  2 philosophers_stone     2         4 AFINN 
##  3 philosophers_stone     3         7 AFINN 
##  4 philosophers_stone     4        12 AFINN 
##  5 philosophers_stone     5         3 AFINN 
##  6 philosophers_stone     6        21 AFINN 
##  7 philosophers_stone     7        -7 AFINN 
##  8 philosophers_stone     8         8 AFINN 
##  9 philosophers_stone     9        17 AFINN 
## 10 philosophers_stone    10        13 AFINN 
## # ... with 146 more rows

WordCloud demonstrating the frequency and most seen words throughout philosophers stone

series1 %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 105))
## Joining, by = "word"

Section 4 Lexicon - loughran

A dataset containing a character vector of Loughran & McDonald’s (2016) constraining words list.

A tibble display of sentiment values and count

loughran1 <- series1 %>%
  right_join(get_sentiments("loughran")) %>%
  filter(!is.na(sentiment)) %>%
  count(sentiment, sort = TRUE)
## Joining, by = "word"
loughran1
## # A tibble: 6 x 2
##   sentiment        n
##   <chr>        <int>
## 1 negative      3027
## 2 litigious      919
## 3 uncertainty    888
## 4 positive       767
## 5 constraining   205
## 6 superfluous     56
loughran1 <- bind_rows(series1 %>%
                  group_by(book) %>% 
                  mutate(word_count = 1:n(),
                         index = word_count %/% 500 + 1) %>%
                  inner_join(get_sentiments("loughran") %>%
                                     filter(sentiment %in% c("positive", "negative"))) %>%
                  mutate(method = "Loughran")) %>%
        count(book, method, index = index , sentiment) %>%
        ungroup() %>%
        spread(sentiment, n, fill = 0) %>%
        mutate(sentiment = positive - negative) %>%
        select(book, index, method, sentiment)
## Joining, by = "word"

A graphical/visual representation of the content utilizing the affin1 and loughran1. We see the lows and highs vary between methods. Afinn showing drastic higher and lowe levels of sentiment spread throughout the index. Loughran demonstrating a more compact distribution of the sentiment values.

bind_rows(afinn1,
          loughran1) %>%
        ungroup() %>%
        mutate(book = factor(book, levels = titles1)) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_grid(book ~ method)

Citation - APA Style

Austen, J., & Stafford, F. (2003). Emma (Penguin Classics) (Reissue ed.). Penguin Classics.

Austen, J., & Tanner, T. (2002). Pride and Prejudice (Reprint. ed.). Penguin Books.

Rowling, J. K. (2018). Harry Potter and the Philosopher’s Stone: Slytherin Edition; Black and Green (Anniversary ed.). Educa Books.

Conclusion

In conclusion, we see sentiment analysis enables us to make sense of qualitative data such as novels, tweets, product reviews, and support tickets, and extract insights. By detecting positive, neutral, and negative opinions within text, you can understand how general feeling about a novel, brand, product, or service, and make data-driven decisions. With a variety of Lexicons, we utilize the english words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).