Sentiment Analysis

This assignment will look at sentiment analysis using examples given by Julia Silge and David Robinson on Chapter 2 of their book Text Mining with R. Main examples center around Jane Austen’s book.

I will also complement this assignment using a book downloaded from the Gutenberg Project, which is a library of over 60,000 free eBooks. Choose among free epub and Kindle eBooks, download them or read them online. You will find the world’s great literature here, with focus on older works for which U.S. copyright has expired. Thousands of volunteers digitized and diligently proofread the eBooks, for enjoyment and education.

To start with, the usual installation and library loading.

options(repos=structure(c(CRAN="http://cloud.r-project.org/")))
install.packages("textdata")
## 
## The downloaded binary packages are in
##  /var/folders/9x/w6h9t9dn5fv3j_2_c_57wh5r0000gn/T//RtmpOEUE5U/downloaded_packages
install.packages("gutenbergr")
## 
## The downloaded binary packages are in
##  /var/folders/9x/w6h9t9dn5fv3j_2_c_57wh5r0000gn/T//RtmpOEUE5U/downloaded_packages
library("gutenbergr")
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(tidytext)
library(textdata)
library(janeaustenr)

Jane Austen sentiment analysis

Here I reproduce the examples given in the book.

First, authors take the text of all Jane Austen’s novels and convert the text to the tidy format using unnest_tokens().

# conversion using tidy function unnest_tokens
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Authors now join use the NRC lexicon. Examining this lexicon, there are ten distinct sentiment groups:

# sentiment groups
(nrc_joy <- get_sentiments("nrc") %>%
  group_by(sentiment) %>%
  summarize(no_words=n())  %>%
  arrange(desc(no_words)))
## # A tibble: 10 x 2
##    sentiment    no_words
##    <chr>           <int>
##  1 negative         3324
##  2 positive         2312
##  3 fear             1476
##  4 anger            1247
##  5 trust            1231
##  6 sadness          1191
##  7 disgust          1058
##  8 anticipation      839
##  9 joy               689
## 10 surprise          534

Analysis will be done filtering for joy words. Next, let’s they use filter() again in the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

(emma_tidy<- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE))
## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # … with 293 more rows

Next, they count up how many positive and negative words there are in defined sections of each book. They define an index to keep track of where they are in the narrative; this index (using integer division) counts up sections of 80 lines of text.

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"

Now they plot these sentiment scores across the plot trajectory of each novel.

(ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x"))

Analysis of Dracula

I’ll perform a similar analysis on Dracula by Bram Stoker, downloaded from the Gutenberg Project R package.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

install.packages("gutenbergr")
## 
## The downloaded binary packages are in
##  /var/folders/9x/w6h9t9dn5fv3j_2_c_57wh5r0000gn/T//RtmpOEUE5U/downloaded_packages
library("gutenbergr")

y<-gutenberg_metadata
head(y)
## # A tibble: 6 x 8
##   gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
##          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
## 1            0  <NA> <NA>                 NA en       <NA>             Publi…
## 2            1 "The… Jeffe…             1638 en       United States L… Publi…
## 3            2 "The… Unite…                1 en       American Revolu… Publi…
## 4            3 "Joh… Kenne…             1666 en       <NA>             Publi…
## 5            4 "Lin… Linco…                3 en       US Civil War     Publi…
## 6            5 "The… Unite…                1 en       American Revolu… Publi…
## # … with 1 more variable: has_text <lgl>

This grouping shows all authors who have a book written in English (language==“en”), which contains text (has_text == TRUE). Summary shows number of books available:

# group by authors

(tidy_books_authors <- y %>%
  filter(language=="en", has_text == TRUE) %>%
  group_by(author) %>%
  summarize(n=n()) %>%
  arrange(desc(n)))
## # A tibble: 13,080 x 2
##    author                                  n
##    <chr>                               <int>
##  1 Various                              2855
##  2 <NA>                                 2835
##  3 Anonymous                             583
##  4 Lytton, Edward Bulwer Lytton, Baron   215
##  5 Shakespeare, William                  176
##  6 Ebers, Georg                          164
##  7 Twain, Mark                           147
##  8 Kingston, William Henry Giles         132
##  9 Parker, Gilbert                       132
## 10 Fenn, George Manville                 128
## # … with 13,070 more rows

Here we filter by title and get all Dracula books available and ultimately choose one of the them to download and analyse.

# Dracula books
(books_dracula <- y %>%
  filter(language =="en", has_text == TRUE, title == "Dracula"))
## # A tibble: 3 x 8
##   gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
##          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
## 1          345 Drac… Stoke…              190 en       Gothic Fiction/… Publi…
## 2        19797 Drac… Stoke…              190 en       Horror/Movie Bo… Publi…
## 3        45839 Drac… Stoke…              190 en       <NA>             Publi…
## # … with 1 more variable: has_text <lgl>
# download dracula book
dracula_book<-gutenberg_download(345)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org

From here on, I replicate the analysis performed above and later compare Dracula with Jane Austen’s sentiments:

# create additional column named book and deleting id column to make dataframes similar

drac_book <- dracula_book %>%
  mutate(book = "Dracula", linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  filter(linenumber>161) %>%
  ungroup() %>%
  unnest_tokens(word, text)

drac_book$gutenberg_id<-NULL

# bing sentiments
dracula_sentiment <- drac_book %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"

In the plot below, it seems negative sentiments are more prominent than positive ones.

(ggplot(dracula_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) )

Merging data frames

Tables are merged and Dracula sentiment is compared to JA’s books. Not unexpectedly, Dracula has many more negative sentiments than JA’s books.

sent<- full_join(jane_austen_sentiment, dracula_sentiment)  
## Joining, by = c("book", "index", "negative", "positive", "sentiment")
## Warning: Column `book` joining factor and character vector, coercing into
## character vector
(ggplot(sent, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x"))

Conclusion

I was able to replicate the primary code provided in chapter 2 of Text Mining with R and I also compared sentiments found in Dracula’s with those of Jane Austen’s books.

As expected, Dracula contains way more negative sentiments than Austen’s books.