Instructions

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Work with a different corpus of your choosing, and oncorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.

Primary Example Code - Jane Austen Corpus

Loading the three sentiment lexicons used in the example

library(textdata)
## Warning: package 'textdata' was built under R version 3.6.3
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.3
get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Loading the example texts - six novels published by 19th century author Jane Austen. Then, the books are converted to a tidy format - grouped originally by book, then a line of mutate code to keep track of the original line number in a new column, then unnesting each word from the text.

library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 3.6.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.3
library(stringr)

austen_books()
## # A tibble: 73,422 x 2
##    text                    book               
##  * <chr>                   <fct>              
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility
##  2 ""                      Sense & Sensibility
##  3 "by Jane Austen"        Sense & Sensibility
##  4 ""                      Sense & Sensibility
##  5 "(1811)"                Sense & Sensibility
##  6 ""                      Sense & Sensibility
##  7 ""                      Sense & Sensibility
##  8 ""                      Sense & Sensibility
##  9 ""                      Sense & Sensibility
## 10 "CHAPTER 1"             Sense & Sensibility
## # ... with 73,412 more rows
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Here is the corresponding list of positive sentiment words in Austen’s novel Emma.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

Using the bing lexicon, the six books are plotted according to the sentiments of each line. The affected words are identified using an inner join, then the net sentimenet is calculated by substracting the magnitude of negative sentiment from positive sentiment. Last, the net sentiment per line of the six novels is graphed using ggplot and the facet_wrap function.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Generally, it appears that Austen novels have more positive than negative sentiment. Some especially negative patterns are evident halfway through Pride and Prejudice as well as at the end of Mansfield Park.

Next, using the afinn lexicon, the example looked at the sentiment of Pride and Prejudice. Again, using a similar technique, the lexicon is inner joined to the particular book.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice
## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"

The sentiments can then be plotted for comparison of each lexicon.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

It appears that NRC estimates greater positive sentiment for this particular book, while only Bing predicts a net negative area halfway through the book.

Finally, using the Bing lexicon, the example obtains a wordcount of Austen’s works. The resulting data frame indicates the positive or negative senetiment in addition to the frequency.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows

The results can then be compared graphically. One outlier is the use of the word ‘miss’, which is part of the negative sentiment lexicon because it would indicate the opposite of a ‘hit’ or unfulfilled expectations. In Austen’s novels, it’s more commonly used as a title of an unmarried woman - Miss Bingley for example. While being an unmarried woman generally seen as a negative in Jane Austen’s novels, for the purposes of this example it’s an anomalous result.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

This bit of code allows us to edit the stop words and add ‘miss’ to the list of words not counted when generating a word count.

custom_stop_words <- bind_rows(tibble(word = c("miss"), 
                                          lexicon = c("custom")), 
                               stop_words)

custom_stop_words
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Below is a word cloud with the word ‘miss’ omitted.

library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(custom_stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
## Warning in wordcloud(word, n, max.words = 100): happiness could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): happy could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): spirits could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): suppose could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): hope could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): heard could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): hear could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): subject could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): people could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): character could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): minutes could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): left could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): letter could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100): comfort could not be fit on
## page. It will not be plotted.

Extension - Victor Hugo Sentiment Analysis

For the sentiment analysis, I used Project Gutenberg, which is a library of ebooks that are in the public domain. I’m going to look at two books from an author of the same time, Victor Hugo. They are available in several formats here: http://www.gutenberg.org/ebooks/135, http://www.gutenberg.org/ebooks/2610.

For purposes of using the same general methods as the example, I will start out by importing the .txt files and preparing in the same format as austen_books

les_mis <- read.delim("https://raw.githubusercontent.com/hillt5/DATA607-Assignment-4-5-20/master/Hugo-Les-Mis.txt", stringsAsFactors = FALSE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
les_mis_title <- rep("Les Miserables", 30128)
les_mis["book"] <- cbind(les_mis_title)
names(les_mis)[names(les_mis) == "The.Project.Gutenberg.EBook.of.Les.MisÃ.rables..by.Victor.Hugo"] <- "text"

hunchback <- read.delim("https://raw.githubusercontent.com/hillt5/DATA607-Assignment-4-5-20/master/Hugo-Hunchback.txt", stringsAsFactors = FALSE)
hunchback_title <- rep("The Hunchback of Notre Dame", 18975)
hunchback["book"] <- cbind(hunchback_title)
names(hunchback)[names(hunchback) == "ï.."] <- "text"
hugo_books <- rbind(les_mis, hunchback)
tidy_hugo <- hugo_books %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
tidy_hugo$book <-as.factor(tidy_hugo$book)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_hugo %>%
  filter(book == "Les Miserables") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 493 x 2
##    word       n
##    <chr>  <int>
##  1 good     744
##  2 child    429
##  3 young    412
##  4 saint    394
##  5 love     364
##  6 god      332
##  7 mother   319
##  8 found    276
##  9 white    241
## 10 garden   231
## # ... with 483 more rows
library(tidyr)

hugo_sentiment <- tidy_hugo %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
library(ggplot2)

ggplot(hugo_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

There appears to be a very negative portion of Les Miserables in the last act. Lets look at this further in the next step:

hugo_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
hugo_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows
hugo_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

Again, it appears that miss is a commonly used word that may be throwing off the sentiment. We can reuse the same custom word stops from earlier and generate a word cloud.

library(wordcloud)

tidy_hugo %>%
  anti_join(custom_stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"

This word cloud also indicates that there’s a unique French character that is very prevalent. I’ll add this to our custom list.

custom_stop_words <- bind_rows(tibble(word = c("â"), 
                                          lexicon = c("custom")), 
                               custom_stop_words)
tidy_hugo %>%
  anti_join(custom_stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 150))
## Joining, by = "word"
## Warning in wordcloud(word, n, max.words = 150): jean could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): marius could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): chapter could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): quasimodo could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): love could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): day could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): mother could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): word could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): black could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): church could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): bishop could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): child could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): hundred could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): people could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): exclaimed could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): fauchelevent could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): garden could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): twenty could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): feet could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): madame could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): poor could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): beheld could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): priest could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): corner could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): idea could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): gringoire could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): master could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): human could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): madeleine could not be fit on
## page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): raised could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): father could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): eye could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): death could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): heart could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(word, n, max.words = 150): children could not be fit on
## page. It will not be plotted.

hugo_sentiment_update <- tidy_hugo %>%
  inner_join(get_sentiments("bing")) %>%
  anti_join(custom_stop_words) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
ggplot(hugo_sentiment_update, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

It appears that Les Miserables has become even more negative after filtering out stop words. As a final step, lets compare sentiments for Les Miserables for the three sentiment lexicons provided, plus the lexicon SentiWordNet from the lexicon library.

library(lexicon)
## Warning: package 'lexicon' was built under R version 3.6.3
tidy_les_mis <- tidy_hugo %>% 
  filter(book == "Les Miserables")


afinn_hugo <- tidy_les_mis %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
bing_and_nrc_hugo <- bind_rows(tidy_les_mis %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          tidy_les_mis %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
sentiword <- hash_sentiment_sentiword

names(sentiword)[names(sentiword) == "x"] <- "word"
names(sentiword)[names(sentiword) == "y"] <- "score"

sentiword_hugo <- tidy_les_mis %>%
  inner_join(sentiword, by = 'word') %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(score)) %>%
  mutate(method = "SentiWordNet")
bind_rows(afinn_hugo, 
          bing_and_nrc_hugo, sentiword_hugo) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")