Task

In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

Work with a different corpus of your choosing, and
Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

Recreating Base Analysis from Textbook

Silge, Julia, and David Robinson. “2 Sentiment Analysis with Tidy Data: Text Mining with R.” 2 Sentiment Analysis with Tidy Data | Text Mining with R, O’Rielly, 2017, www.tidytextmining.com/sentiment.html.

library(tidytext)

get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,875 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,865 more rows

library(janeaustenr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 301 x 2
##    word          n
##    <chr>     <int>
##  1 good        359
##  2 friend      166
##  3 hope        143
##  4 happy       125
##  5 love        117
##  6 deal         92
##  7 found        92
##  8 present      89
##  9 kind         82
## 10 happiness    76
## # ... with 291 more rows

Lexicon: Loughran

The get_sentiments function from the tidytext package contains 4 lexicons c(“bing”, “afinn”, “loughran”, “ncr”). The textbook example used 3 out of the 4 available lexicons in this package (“bing”, “afinn”, “ncr”). I will implement the remaining available lexicon in this package, “loughran” in my analysis.

get_sentiments("loughran")

## # A tibble: 4,150 x 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # ... with 4,140 more rows

Corpus: Dracula by Bram Stoker

I intend to conduct a text/sentiment analysis on the horror book classic, Dracula. We tend to consider words that are scary to be negative. I would like to see if this book uses very “negative” language.

To acquire the text of Dracula, I will use the gutenbergr package. This package contains a plethora of public domain works from the Project Gutenberg collection. This package allows you to download desired texts from the Project Gutenberg collection. Dracula is id number 345 which we can use to download using gutenberg_download().

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v readr   2.0.1
## v tibble  3.1.4     v purrr   0.3.4
## v tidyr   1.1.3     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gutenbergr)

# metadata that contains a plethra of books
books <- gutenberg_metadata

# reorder data to more easily find books of interest for analysis
books1 <- books[order(books[,'title']),]

# Book of interest:
# id 345: Dracula

# download book
dracula <- gutenberg_download(345)

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

Tidy Data

I am removing rows that preface the begining of the book ‘Chapter 1’. I am also assigning line numbers for each row and storing which chapter the text is from.

dracula1 <- dracula %>%
  slice(-c(1:79)) %>%
  mutate(line_num = row_number()) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^CHAPTER [\\divxlc]", ignore_case = TRUE)))) %>% 
  ungroup()

glimpse(dracula)

## Rows: 15,486
## Columns: 2
## $ gutenberg_id <int> 345, 345, 345, 345, 345, 345, 345, 345, 345, 345, 345, 34~
## $ text         <chr> "                                DRACULA", "", "", "", ""~

I am creating a column where each row represents one word.

dracula_tidy <- dracula1 %>% 
  unnest_tokens(word, text) %>%
  mutate(word = str_replace(word, "_", ""))

Removing stop words from data. (Words like ‘a’, ‘the’, ‘is’, etc.)

dracula.data <- dracula_tidy %>%
  anti_join(stop_words, by = "word")

Sentiment Analysis

Loughran Lexicon

Generate the Loughran lexicon sentiment results

loughran.data <- dracula.data %>% 
      mutate(word_count = 1:n(),
      index = word_count %/% 80) %>% 
      inner_join(get_sentiments("loughran")) %>%
      filter(sentiment %in% c("positive", "negative")) %>%
      mutate(method = "Loughran") %>%
      count(method, index = index , sentiment) %>%
      spread(sentiment, n, fill = 0) %>%
      mutate(sentiment = positive - negative) %>%
      select(index, method, sentiment)

## Joining, by = "word"

Plot the Loughran Sentiment Analysis

ggplot(loughran.data, aes(x = index, sentiment)) +
  geom_col(aes(color = sentiment)) +
  scale_color_gradient(low = "red", high = "green") +
  ggtitle("Dracula: Sentiment Analysis using Loughran Lexicon") +
  xlab("Index") +
  ylab("Sentiment") +
  theme_minimal()

Conclusion: Loughran Lexicon

As expected, Dracula contains a large amount of negative sentiment throughout the novel with sparse moments or positive sentiment. This creates the horror atmosphere that is expected in scary books.

AFINN (for comparison)

Generate the AFINN lexicon sentiment results

afinn.data <- dracula.data %>% 
        mutate(word_count = 1:n(),
        index = word_count %/% 80) %>% 
        inner_join(get_sentiments("afinn")) %>%
        group_by(index) %>%
        summarise(sentiment = sum(value)) %>%
        mutate(method = "AFINN")

## Joining, by = "word"

Plot the AFINN Sentiment Analysis

ggplot(afinn.data, aes(x = index, sentiment)) +
  geom_col(aes(color = sentiment)) +
  scale_color_gradient(low = "red", high = "green") +
  ggtitle("Dracula: Sentiment Analysis using AFINN Lexicon") +
  xlab("Index") +
  ylab("Sentiment") +
  theme_minimal()

Conclusion: AFINN Lexicon

There is a large amount of negative sentiment throughout the book with a moderate number of positive sentiment spikes. This indicates that Dracula uses a substantial amount of negative words to convey the horror element throughout the book.

Conclusion:

Though the results for the ‘afinn’ and ‘loughran’ lexicons appear to be drastically different in the absolute sense, the results follow a similar relative sentiment trajectory throughout the book. As expected, Dracula is comprised mostly with words that have a scary or negative sentiment. The differences in the lexicons are likely due to the fact that the lexicons contain a vast difference in vocabulary.

DATA 607: Week 10 Assignment

Eric Lehmphul

10/30/2021