Edgar Allen Poe’s short stories invoked readers in newspapers and beyond. And Poe wrote many short stories surrounding gloom and death. This project aims to analyze the happiness of his work throughout three classical texts.
I argured that throughout three of Poe’s stories, the words he used may of skewed the mean value score up higher or lower then it seems. So I made three column charts for each story: a positive sentiment, a negative sentiment and a n-grams sentiment using either the word ‘not’ or ‘no’.
The short stories included are The Fall of House of Usher(1839), The Masque of the Red Death(1843) and Amontiladdo (1846). These stories were converted as follow:
library(devtools)
## Loading required package: usethis
library(tm)
## Loading required package: NLP
library(rmarkdown)
library(readr)
library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.7
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.2.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr)
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(textdata)
library(dplyr)
library(stringr)
library(ggthemes)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
To start this projected I collected and downloaded each story from the Gutenburg Project. I choose these stories as some of the most dreary from https://www.sparknotes.com/blog/edgar-allan-poe-stories-ranked-by-how-creepy-they-are/. Then I downloaded each story from the following links in the Gutenburg Project:
The Fall of the House of Usher: https://www.gutenberg.org/ebooks/932
The Masque of the Red Death: https://www.gutenberg.org/ebooks/1064
The Cask of Amontillado: https://www.gutenberg.org/ebooks/1063
Usher <- read_csv("Usher.txt",
col_names = FALSE)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 470 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Usher %>%
unnest_tokens(word, X1) -> usher_words
count(usher_words)
## # A tibble: 1 × 1
## n
## <int>
## 1 7198
usher_words %>%
inner_join(get_sentiments('afinn')) %>%
arrange(desc(value)) -> usher_sentiment
## Joining, by = "word"
MasqueRedDeath <- read_csv("MasqueRedDeath.txt",
col_names = FALSE)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 191 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(MasqueRedDeath)
MasqueRedDeath %>%
unnest_tokens(word, X1) -> RedDeath_words
count(RedDeath_words)
## # A tibble: 1 × 1
## n
## <int>
## 1 2439
RedDeath_words %>%
inner_join(get_sentiments('afinn')) %>%
arrange(desc(value)) -> MasqueRedDeath_sentiment
## Joining, by = "word"
Amontillado <- read_csv("Amontillado.txt",
col_names = FALSE)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 212 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Amontillado %>%
unnest_tokens(word, X1) -> amontillado_words
count(amontillado_words)
## # A tibble: 1 × 1
## n
## <int>
## 1 2349
amontillado_words %>%
inner_join(get_sentiments('afinn')) %>%
arrange(desc(value)) -> amontillado_sentiment
## Joining, by = "word"
Usher: 7198
Red Death: 2439
Amontillado: 2349
Usher: -0.387012987012987
Red Death: -0.490909090909091
Amontillado: 0.258741258741259
As we can see here: Usher and Red Death are negative and Amontillado is positive. Note that Usher has over 7000 words compared to just around 2,500 words.
Next, I wanted to see the frequency for each word while also measuring it’s AFINN Value.
`
usher_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>%
filter(value <= 0) %>%
count(word, value, sort = TRUE)%>%
head(20) %>%
ggplot(aes(reorder(word, n), n, fill = value)) + geom_col() + coord_flip() + ylab("Number of Occurrences") +
xlab("Words") + ggtitle("The Fall of the House of Usher Negative Sentiment")
## Joining, by = "word"
## Joining, by = "word"
usher_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
filter(value > 0) %>%
count(word, value,sort = TRUE) -> usher_positive
## Joining, by = "word"
## Joining, by = "word"
usher_positive %>%
head(20) %>%
ggplot(aes(reorder(word, n,),n, fill = value)) + geom_col() + coord_flip() + ylab("Number of Occurrences") +
xlab("Words") + ggtitle("The Fall of the House of Usher Positive Sentiment")
The most negative and frequent words of the Fall of the House of Usher are “terror” and "terrible with a frequent use of 6 times and also both scoring an AFINN score of a -3 On the positive graph there is most common words are spirit, intense and earnest where they all stand between 1 and 2 on the AFINN value chart. This could suggest that the story has a more frequent use of really negative words thus earning them a sligtly negative mean value score (-1.38).
Usher %>%
unnest_tokens(bigram, X1, token="ngrams", n=2) -> Usher_bigrams
Usher_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 5,301 × 2
## bigram n
## <chr> <int>
## 1 of the 164
## 2 in the 41
## 3 upon the 30
## 4 of his 27
## 5 of a 24
## 6 and the 23
## 7 to the 22
## 8 it was 17
## 9 of my 16
## 10 had been 15
## # … with 5,291 more rows
Usher_filtered <- Usher_bigrams %>%
separate(bigram, c("word1", "word2"), sep=" ")
usher_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) -> UsherAFINN
## Joining, by = "word"
## Joining, by = "word"
no_words <- Usher_filtered %>%
filter(word1 == "no") %>%
inner_join(UsherAFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
Usher_filtered %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) -> Usher_filtered2
Usher_bigram_counts <- Usher_filtered2 %>%
count(word1, word2, sort = TRUE )
no_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"no\"")
The n-grams chart suggests that there is the use of not for a negative and positive word. Thus ‘evening’ out the sentiment score line for both. ## Masque of the Red Death
RedDeath_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>%
filter(value <= 0) %>%
count(word, value, sort = TRUE)%>%
head(20) %>%
ggplot(aes(reorder(word, n), n, fill = value)) + geom_col() + coord_flip() + ylab("Number of Occurrences") +
xlab("Words") + ggtitle("Masque of the Red Death Negative Sentiment")
## Joining, by = "word"
## Joining, by = "word"
RedDeath_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
count(word, value,sort = TRUE) %>%
filter(n >0) %>%
filter(value > 0) %>%
head(20) %>%
ggplot(aes(reorder(word, n,),n, fill = value)) + geom_col() + coord_flip() + ylab("Number of Occurrences") +
xlab("Words") + ggtitle("The Fall of the House of Usher Positive Sentiment")
## Joining, by = "word"
## Joining, by = "word"
## Explanation
The Masque of the Red Death most common word is death which is roughly a score of -2 on gthe AFINN scale. This is followed by terror and mad where both have an extreme negative score of -3 occurring 3 times each.
It also shares the same word in terror and death.
On the Positive side, the word dreams occurs the most but has the lowest positive value. This is followed by bold and strong which share a similar value score. Two words that stand out are excited and fantastic both ranging of a value score between 3 and 4.
MasqueRedDeath %>%
unnest_tokens(bigram, X1, token="ngrams", n=2) -> MasqueRedDeath_bigrams
MasqueRedDeath_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 1,842 × 2
## bigram n
## <chr> <int>
## 1 of the 60
## 2 in the 20
## 3 to the 15
## 4 from the 13
## 5 it was 12
## 6 and the 11
## 7 there were 10
## 8 the prince 9
## 9 there was 8
## 10 of his 7
## # … with 1,832 more rows
MasqueRedDeath_filtered <- MasqueRedDeath_bigrams %>%
separate(bigram, c("word1", "word2"), sep=" ")
RedDeath_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) -> MasqueRedDeathAFINN
## Joining, by = "word"
## Joining, by = "word"
not_words1 <- MasqueRedDeath_filtered %>%
filter(word1 == "not") %>%
inner_join(MasqueRedDeathAFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
MasqueRedDeath_filtered %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) -> MasqueRedDeath_filtered2
MasqueRedDeath_bigram_counts <- MasqueRedDeath_filtered2 %>%
count(word1, word2, sort = TRUE )
not_words1 %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"not\"")
The word approved is slightly positive, but is not substaintial enough to change it’s value.
amontillado_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>%
filter(value <= 0) %>%
count(word, value, sort = TRUE)%>%
head(20) %>%
ggplot(aes(reorder(word, n), n, fill = value)) + geom_col() + coord_flip() + ylab("Number of Occurrences") +
xlab("Words") + ggtitle("Amontillado Negative Sentiment")
## Joining, by = "word"
## Joining, by = "word"
amontillado_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments('afinn')) %>%
count(word, value,sort = TRUE) %>%
filter(n >0) %>%
filter(value > 0) %>%
head(20) %>%
ggplot(aes(reorder(word, n,),n, fill = value)) + geom_col() + coord_flip() + ylab("Number of Occurrences") +
xlab("Words") + ggtitle("Amontillado Positive Sentiment")
## Joining, by = "word"
## Joining, by = "word"
This stories most common word doubts is relatively low on the value scale as well cry which is tired for the 2nd most. Die and Arrested are the most negative words in this chart, but only share one occurrence.
The most common positive word in Amontillado is true which has an AFINN score of 2.0 and holding 4 spots each. Matter is up next, but is relatively small at a value of 1.0. The most common word with a high value is love followed by happy, excited, beloved, astounded and admired.
Amontillado %>%
unnest_tokens(bigram, X1, token="ngrams", n=2) -> Amontillado_bigrams
Amontillado_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 1,779 × 2
## bigram n
## <chr> <int>
## 1 of the 30
## 2 ugh ugh 14
## 3 he he 12
## 4 i said 12
## 5 it is 10
## 6 <NA> 10
## 7 and the 9
## 8 you are 9
## 9 he said 8
## 10 i had 8
## # … with 1,769 more rows
Amontillado_filtered <- Amontillado_bigrams %>%
separate(bigram, c("word1", "word2"), sep=" ")
amontillado_words %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) -> AmontilladoAFINN
## Joining, by = "word"
## Joining, by = "word"
no_words2 <- Amontillado_filtered %>%
filter(word1 == "no") %>%
inner_join(AmontilladoAFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
Amontillado_filtered %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) -> Amontillado_filtered2
Amontillado_bigram_counts <- Amontillado_filtered2 %>%
count(word1, word2, sort = TRUE )
no_words2 %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"no\"")
‘Matter’ is a slightly positive word, so this in turn would make it negtive.
I thought that some of these stories to a degree was substantial due to Poe’s creative use of language.
I saw three different scenarios in each of the stories. In the first bi-gram of The Fall of the House of Usher. The word ‘no’ was used both negatively and positively. However, in the word count we see this story had over 7000 words - increasing the amount of
In the second bi-gram using ‘not’ instead of ‘no’ I was able to see that it only occurred once before ‘approved’. Which according to the sentiment is a positive word thus becoming a negative word. The original mean score was already very negative, so that did not adjust the score.
The third bi-gram used ‘no’ again and ‘matter’ was the only thing that popped up. Thus becoming more negative then it really is. Since this mean value sentiment score is positive, perhaps this (and other words) exemplifies a decrease. However, more research would have to be done.
Poe used more harsh words in Death of the Red Masque and The Fall of the House of Usher then positive words. And in The Cask of Amontillado Poe used more positive (although not highly valued) words compared to negative words. The maximum positive word Poe used was valued a 4 and the most negative word Poe used was valued a -3. Thus, Poe never reached extreme (5 or -5) words in his texts. This could explain why his sentiment analysis does not ranges between 0.5 and -2.
Throughout my column charts no correlation could be inferred since measuring the bi-grams only slightly might of adjusted the mean value sentiment score. More research into n-grams would have to be done in order to analyze the sentimental text.