This assignment will look at sentiment analysis using examples given by Julia Silge and David Robinson on Chapter 2 of their book Text Mining with R. Main examples center around Jane Austen’s book.
I will also complement this assignment using a book downloaded from the Gutenberg Project, which is a library of over 60,000 free eBooks. Choose among free epub and Kindle eBooks, download them or read them online. You will find the world’s great literature here, with focus on older works for which U.S. copyright has expired. Thousands of volunteers digitized and diligently proofread the eBooks, for enjoyment and education.
To start with, the usual installation and library loading.
options(repos=structure(c(CRAN="http://cloud.r-project.org/")))
install.packages("textdata")
##
## The downloaded binary packages are in
## /var/folders/9x/w6h9t9dn5fv3j_2_c_57wh5r0000gn/T//RtmpOEUE5U/downloaded_packages
install.packages("gutenbergr")
##
## The downloaded binary packages are in
## /var/folders/9x/w6h9t9dn5fv3j_2_c_57wh5r0000gn/T//RtmpOEUE5U/downloaded_packages
library("gutenbergr")
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(tidytext)
library(textdata)
library(janeaustenr)
Here I reproduce the examples given in the book.
First, authors take the text of all Jane Austen’s novels and convert the text to the tidy format using unnest_tokens().
# conversion using tidy function unnest_tokens
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
Authors now join use the NRC lexicon. Examining this lexicon, there are ten distinct sentiment groups:
# sentiment groups
(nrc_joy <- get_sentiments("nrc") %>%
group_by(sentiment) %>%
summarize(no_words=n()) %>%
arrange(desc(no_words)))
## # A tibble: 10 x 2
## sentiment no_words
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
## 3 fear 1476
## 4 anger 1247
## 5 trust 1231
## 6 sadness 1191
## 7 disgust 1058
## 8 anticipation 839
## 9 joy 689
## 10 surprise 534
Analysis will be done filtering for joy words. Next, let’s they use filter() again in the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis.
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
(emma_tidy<- tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE))
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # … with 293 more rows
Next, they count up how many positive and negative words there are in defined sections of each book. They define an index to keep track of where they are in the narrative; this index (using integer division) counts up sections of 80 lines of text.
jane_austen_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
Now they plot these sentiment scores across the plot trajectory of each novel.
(ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x"))
I’ll perform a similar analysis on Dracula by Bram Stoker, downloaded from the Gutenberg Project R package.
The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:
install.packages("gutenbergr")
##
## The downloaded binary packages are in
## /var/folders/9x/w6h9t9dn5fv3j_2_c_57wh5r0000gn/T//RtmpOEUE5U/downloaded_packages
library("gutenbergr")
y<-gutenberg_metadata
head(y)
## # A tibble: 6 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 0 <NA> <NA> NA en <NA> Publi…
## 2 1 "The… Jeffe… 1638 en United States L… Publi…
## 3 2 "The… Unite… 1 en American Revolu… Publi…
## 4 3 "Joh… Kenne… 1666 en <NA> Publi…
## 5 4 "Lin… Linco… 3 en US Civil War Publi…
## 6 5 "The… Unite… 1 en American Revolu… Publi…
## # … with 1 more variable: has_text <lgl>
This grouping shows all authors who have a book written in English (language==“en”), which contains text (has_text == TRUE). Summary shows number of books available:
# group by authors
(tidy_books_authors <- y %>%
filter(language=="en", has_text == TRUE) %>%
group_by(author) %>%
summarize(n=n()) %>%
arrange(desc(n)))
## # A tibble: 13,080 x 2
## author n
## <chr> <int>
## 1 Various 2855
## 2 <NA> 2835
## 3 Anonymous 583
## 4 Lytton, Edward Bulwer Lytton, Baron 215
## 5 Shakespeare, William 176
## 6 Ebers, Georg 164
## 7 Twain, Mark 147
## 8 Kingston, William Henry Giles 132
## 9 Parker, Gilbert 132
## 10 Fenn, George Manville 128
## # … with 13,070 more rows
Here we filter by title and get all Dracula books available and ultimately choose one of the them to download and analyse.
# Dracula books
(books_dracula <- y %>%
filter(language =="en", has_text == TRUE, title == "Dracula"))
## # A tibble: 3 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 345 Drac… Stoke… 190 en Gothic Fiction/… Publi…
## 2 19797 Drac… Stoke… 190 en Horror/Movie Bo… Publi…
## 3 45839 Drac… Stoke… 190 en <NA> Publi…
## # … with 1 more variable: has_text <lgl>
# download dracula book
dracula_book<-gutenberg_download(345)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
From here on, I replicate the analysis performed above and later compare Dracula with Jane Austen’s sentiments:
# create additional column named book and deleting id column to make dataframes similar
drac_book <- dracula_book %>%
mutate(book = "Dracula", linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
filter(linenumber>161) %>%
ungroup() %>%
unnest_tokens(word, text)
drac_book$gutenberg_id<-NULL
# bing sentiments
dracula_sentiment <- drac_book %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
In the plot below, it seems negative sentiments are more prominent than positive ones.
(ggplot(dracula_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) )
Tables are merged and Dracula sentiment is compared to JA’s books. Not unexpectedly, Dracula has many more negative sentiments than JA’s books.
sent<- full_join(jane_austen_sentiment, dracula_sentiment)
## Joining, by = c("book", "index", "negative", "positive", "sentiment")
## Warning: Column `book` joining factor and character vector, coercing into
## character vector
(ggplot(sent, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x"))
I was able to replicate the primary code provided in chapter 2 of Text Mining with R and I also compared sentiments found in Dracula’s with those of Jane Austen’s books.
As expected, Dracula contains way more negative sentiments than Austen’s books.