In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Silge, Julia, and David Robinson. “2 Sentiment Analysis with Tidy Data: Text Mining with R.” 2 Sentiment Analysis with Tidy Data | Text Mining with R, O’Rielly, 2017, www.tidytextmining.com/sentiment.html.
library(tidytext)
get_sentiments("afinn")
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,875 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,865 more rows
library(janeaustenr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 301 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ... with 291 more rows
The get_sentiments
function from the tidytext package contains 4 lexicons c(“bing”, “afinn”, “loughran”, “ncr”). The textbook example used 3 out of the 4 available lexicons in this package (“bing”, “afinn”, “ncr”). I will implement the remaining available lexicon in this package, “loughran” in my analysis.
get_sentiments("loughran")
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # ... with 4,140 more rows
I intend to conduct a text/sentiment analysis on the horror book classic, Dracula. We tend to consider words that are scary to be negative. I would like to see if this book uses very “negative” language.
To acquire the text of Dracula, I will use the gutenbergr package. This package contains a plethora of public domain works from the Project Gutenberg collection. This package allows you to download desired texts from the Project Gutenberg collection. Dracula is id number 345 which we can use to download using gutenberg_download()
.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v readr 2.0.1
## v tibble 3.1.4 v purrr 0.3.4
## v tidyr 1.1.3 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gutenbergr)
# metadata that contains a plethra of books
books <- gutenberg_metadata
# reorder data to more easily find books of interest for analysis
books1 <- books[order(books[,'title']),]
# Book of interest:
# id 345: Dracula
# download book
dracula <- gutenberg_download(345)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
I am removing rows that preface the begining of the book ‘Chapter 1’. I am also assigning line numbers for each row and storing which chapter the text is from.
dracula1 <- dracula %>%
slice(-c(1:79)) %>%
mutate(line_num = row_number()) %>%
mutate(chapter = cumsum(str_detect(text, regex("^CHAPTER [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
glimpse(dracula)
## Rows: 15,486
## Columns: 2
## $ gutenberg_id <int> 345, 345, 345, 345, 345, 345, 345, 345, 345, 345, 345, 34~
## $ text <chr> " DRACULA", "", "", "", ""~
I am creating a column where each row represents one word.
dracula_tidy <- dracula1 %>%
unnest_tokens(word, text) %>%
mutate(word = str_replace(word, "_", ""))
Removing stop words from data. (Words like ‘a’, ‘the’, ‘is’, etc.)
dracula.data <- dracula_tidy %>%
anti_join(stop_words, by = "word")
Generate the Loughran lexicon sentiment results
loughran.data <- dracula.data %>%
mutate(word_count = 1:n(),
index = word_count %/% 80) %>%
inner_join(get_sentiments("loughran")) %>%
filter(sentiment %in% c("positive", "negative")) %>%
mutate(method = "Loughran") %>%
count(method, index = index , sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
select(index, method, sentiment)
## Joining, by = "word"
Plot the Loughran Sentiment Analysis
ggplot(loughran.data, aes(x = index, sentiment)) +
geom_col(aes(color = sentiment)) +
scale_color_gradient(low = "red", high = "green") +
ggtitle("Dracula: Sentiment Analysis using Loughran Lexicon") +
xlab("Index") +
ylab("Sentiment") +
theme_minimal()
As expected, Dracula contains a large amount of negative sentiment throughout the novel with sparse moments or positive sentiment. This creates the horror atmosphere that is expected in scary books.
Generate the AFINN lexicon sentiment results
afinn.data <- dracula.data %>%
mutate(word_count = 1:n(),
index = word_count %/% 80) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining, by = "word"
Plot the AFINN Sentiment Analysis
ggplot(afinn.data, aes(x = index, sentiment)) +
geom_col(aes(color = sentiment)) +
scale_color_gradient(low = "red", high = "green") +
ggtitle("Dracula: Sentiment Analysis using AFINN Lexicon") +
xlab("Index") +
ylab("Sentiment") +
theme_minimal()
There is a large amount of negative sentiment throughout the book with a moderate number of positive sentiment spikes. This indicates that Dracula uses a substantial amount of negative words to convey the horror element throughout the book.
Though the results for the ‘afinn’ and ‘loughran’ lexicons appear to be drastically different in the absolute sense, the results follow a similar relative sentiment trajectory throughout the book. As expected, Dracula is comprised mostly with words that have a scary or negative sentiment. The differences in the lexicons are likely due to the fact that the lexicons contain a vast difference in vocabulary.