Loading Libraries

library(tidyverse)
library(tidytext)
library(janeaustenr)
library(stringr)
library(wordcloud)
library(reshape2)
library(gutenbergr)

Part I - Sentiment analysis with tidy data

The following section is code from Chapter 2 of Text Mining with R: A Tidy Approach by Julia Silge and David Robinson1

2.1 The sentiments dataset

get_sentiments("afinn")
## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

2.2 Sentiment analysis with inner join

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

2.3 Comparing the three sentiment dictionaries

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

pride_prejudice
## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # ... with 122,194 more rows
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(pride_prejudice %>%
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive",
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
bind_rows(afinn,
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", 
                          "negative")) %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

2.4 Most common positive and negative words

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # ... with 2,575 more rows
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

custom_stop_words <- bind_rows(tibble(word = c("miss"),
                                      lexicon = c("custom")),
                               stop_words)

custom_stop_words
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

2.5 Wordclouds

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

2.6 Looking at units beyond just words

PandP_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

PandP_sentences$sentence[2]
## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

Part II - Jane Austen vs Jules Verne

In this part I want to see if Jules Verne was a more positive or negative writer than Jane Austen. To do this I will look at the same analysis that was done in Chapter 2 of Text Mining with R but substitute the authors. I also want to see if there is a difference in using the loughran lexicon instead of one of the lexicons used in chapter two.

Get the top 6 works of Jules Verne2 and a new lexicon

#Jules Verne works
julesverne <- gutenberg_download(c(164,103,18857,3526,1268,2083))

#Verne metadata
verne_metadata <- gutenberg_metadata[
    which(gutenberg_metadata$gutenberg_id %in% c(164,103,18857,3526,1268,2083)),
    c("gutenberg_id","title")]

#Adding book to title to each jules verne work
verne_books <- merge(julesverne,verne_metadata,by="gutenberg_id")

#Rename title to book
verne_books <- rename(verne_books,c("book" = "title"))

#New lexicon
loughran_sent <- get_sentiments("loughran") %>%
  filter(sentiment %in% c("positive","negative"))

Tidy the Verne library and only select the top six books

#Creating the tidy jules verne data set
tidy_verne <- verne_books[,c("text","book")] %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(
           str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

# Updating titles for three books
tidy_verne$book <-
  gsub("In Search of the Castaways;.*","In Search of the Castaways",tidy_verne$book)
tidy_verne$book <- gsub("Five Weeks in a Balloon.*","Five Weeks in a Balloon",tidy_verne$book)
tidy_verne$book <- gsub("Twenty Thousand Leagues.*","Twenty Thousand Leagues",tidy_verne$book)

Comparing lexicons using a Verne novel

twenty_leagues <- tidy_verne %>%
  filter(book == "Twenty Thousand Leagues")

afinn2 <- twenty_leagues %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "AFINN")

bing_and_nrc2 <-
  bind_rows(
    twenty_leagues %>%
      inner_join(get_sentiments("bing")) %>%
      mutate(method = "Bing et al."),
    twenty_leagues %>%
      inner_join(get_sentiments("nrc") %>%
                   filter(sentiment %in% c("positive","negative"))) %>%
      mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

loughran <- twenty_leagues %>%
  inner_join(loughran_sent) %>%
  mutate(method = "Loughran-McDonald") %>%
  count(method,index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive-negative)

Comparing Vernes to Austen book sentiments using new lexicon

Because the new lexicon seems to be very negative, using it in addition to a lexicon like NRC which seems to be very positive might be helpful.

#Combine NRC and loughan
loughran_nrc <- rbind(
  get_sentiments("nrc") %>%
    filter(sentiment %in% c("positive","negative")),
  loughran_sent)

#Remove duplicate rows
loughran_nrc <- loughran_nrc %>%
  distinct()
# Adding the bing sentiment to verne
jules_verne_sentiment <- tidy_verne %>%
  anti_join(stop_words) %>%
  inner_join(loughran_nrc) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

When comparing the Austen sentiment plots and the Verne sentiment plots by book, we see that the Verne books seem to be more negative than the Austen books. If we dig deeper and use the sentiment column we can get a sense of the negative and positive sentiments in the book by assigning a value from -47 to 62 for every 80 lines based on the difference between number of positive and number of negative words per 80 lines.

#Create combined verne-austen loughran data frame
verne_austen_sentiment <- rbind(
    jules_verne_sentiment %>%
      mutate(author="Jules Verne"),
    jane_austen_sentiment_2 %>%
      mutate(author="Jane Austen"))

Below we can see that the assigned values for Jane Austen novels are generally positive while Jules Verne novels are generally negative. Additionally, on average every 80 lines of a Jane Austen novel holds a positive sentiment of 17.9 while Jules Verne’s novels holds a positive sentiment of 4.22.

Conclusion

When using the Loughran-McDonald lexicon to compare Jane Austin novels with Jules Verne novels, we see that Jules Verne was much more negative in his stories. While both authors showed an average positive sentiment per everu 80 lines, Jane Austen’s was about 4 times more positive than Jules Verne’s. Is this something to do with the genre they wrote in. Jules Verne is known for his science-fiction / adventure novels, while Jane Austen is known for her romantic novels. I don’t believe that the protagonist in Jane Austen’s novels went through less negative feelings than protagonists in Jules Verne’s novels. That would be a grave simplification of each genre and each author. It’s possible that the sentiment of authors during the late 19th century was different than those in the late 18th century. To answer why these authors used different sentiments throughout their novels would require more data and research.



  1. Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach. , 2017. Internet resource.↩︎

  2. Selected from the following website: Top 10 Books by Jules Verne↩︎