In this project I will analyze the novel The Scarlet Letter by Nathaniel Hawthorne, specifically, the books sentiment. The Scarlet Letter is a historical fiction novel written about the idea of sin and punishment after the main character, Hester Prynne, has to deal with the consequences of adultery.
I predict that this novel will have a negative sentiment. This is because Nathaniel Hawthorne has always been known for writing stories on the darker side regarding history, morality and romance. Hawthorne can be distinguished from other authors by the emphasis he places on human fallibility giving rise to characters lapses in judgement. This judgement often causes good men and women to drift toward sin and self-destruction. (https://americanliterature.com/author/nathaniel-hawthorne#:~:text=Along%20with%20Herman%20Melville%20and,toward%20sin%20and%20self%2Ddestruction.)
First, I installed all the packages necessary to completing this text analysis. Without these packages, the code would not work.
library(tidyverse)
library(tidytext)
library(ggplot2)
library(readr)
library(wordcloud2)
library(textdata)
library(ggthemes)
As stated earlier, I will be focusing on sentiment within this novel.
I will utilize 3 different sentiment data sets. These include bing, AFINN and nrc. All three of these are based on unigrams but have different purposes. (https://www.tidytextmining.com/sentiment.html)
The bing lexicon categorizes words in a binary fashion into positive and negative categories.
The AFINN lexicon assigns words with a score that runs between -4 and 4. A -4 classification means the word has a negative meaning/connotation and 4 means the word has a positive meaning/connotation.
The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into emotions of anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. It also categorizes the words into categories of positive and negative.
These three lexicons are crucial towards analyzing the sentiment in the duration of this novel.
One thing to keep in mind about the bing and nrc lexicons is that both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the bing lexicon than the nrc lexicon.
The bing lexicon has 4781 negative words and 2005 positive words. The nrc lexicon has 3318 negative words and 2308 positive words. (The Code below is able to give us the exact numbers)
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3318
## 2 positive 2308
I first had to install the entire novel. I obtained a .txt file from the Gutenberg Website. I then imported this .txt file to RStudio and made sure to go through the novel to remove any symbols, words or text that was not necessary for the analysis.
This is the code for importing the entire book and cleaning the data. The code anti_join(stop_words) takes out words like “and”, “the”, “a”, etc. These words would not be beneficial for the analysis, therefore, it was necessary to take them out.
library(readr)
scarletletter <- read_csv("scarletletter.txt",
col_names = FALSE)
scarletletter %>%
unnest_tokens(word, X1) %>%
anti_join(stop_words) -> scarletletterwords
Now that I have clean data for The Scarlet Letter, I will begin my analysis.
To start, I counted the number of words that are being analyzed in the book.
Scarlet Letter has 26,217 words (not including stop words). I did this in order to determine the rigor of analyzing this book.
count(scarletletterwords)
## # A tibble: 1 × 1
## n
## <int>
## 1 26217
This graph shows the most popular words within the book used over 75 times. I removed the words Hester, Pearl, Roger, Scarlet, Letter and Dimmesdale. These words would make my analysis inaccurate as they are only said many times because they are names of characters and the main topic in the book.
scarletletterwords %>%
group_by(word) %>%
count() %>%
arrange(desc(n)) %>%
filter(n>75) %>%
filter(!word %in% c('hester', 'pearl', 'prynne' , 'scarlet' , 'letter' , 'dimmesdale' , 'roger')) %>%
ggplot(aes(reorder(word, n), n, fill=word)) + geom_col() + coord_flip() + theme_light() + xlab("Word") +
ylab("Number of Times Word is Used") + ggtitle("Scarlet Letter Popular Words")
This table shows the specific number of times the popular words are used.
scarletletterwords %>%
group_by(word) %>%
count(10) %>%
arrange(desc(n)) %>%
filter(n>75) %>%
filter(!word %in% c('hester', 'pearl', 'prynne' , 'scarlet' , 'letter' , 'dimmesdale' , 'roger'))
## # A tibble: 11 × 3
## # Groups: word [11]
## word `10` n
## <chr> <dbl> <int>
## 1 thou 10 240
## 2 child 10 192
## 3 minister 10 156
## 4 mother 10 132
## 5 heart 10 127
## 6 life 10 125
## 7 hand 10 101
## 8 thee 10 95
## 9 eyes 10 91
## 10 woman 10 86
## 11 thy 10 85
This is a Wordcloud for all the words in the book. The bigger the word, the more times the word is used.
scarletletterwords %>%
anti_join(stop_words) %>%
filter(!word %in% c('hester', 'pearl', 'prynne' , 'scarlet' , 'letter', 'dimmesdale','roger')) %>%
group_by(word) %>%
count() %>%
arrange(desc(n)) %>%
wordcloud2()
As you can see, Nathaniel Hawthorne utilizes words like “though,” “child,” “minister,” “mother,” and “heart.” From these words it can be understood the Nathaniel Hawthorne seems to be an elegant writer of his time.
Now, onto sentiment. I first analyzed the sentiment in this novel using the bing Lexicon.
Using the bing lexicon, I was able to determine how many words are considered “positive” and how many words are considered “negative” in this novel.
scarletletterwords %>%
inner_join(get_sentiments('bing')) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3042
## 2 positive 2018
In Scarlet Letter there are 3042 negative words and 2018 positive words.
As you can tell, Nathaniel Hawthorne’s book seems to be more more negative than positive.
It is also important to keep in note what I said at the beginning - The bing lexicon has a higher ratio of negative to positive words compared to the nrc lexicon (which we will see later).
I then wanted to dive deeper and see the most common positive and negative words.
scarletletterwords %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## # A tibble: 1,386 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 smile positive 51
## 2 wild negative 46
## 3 sin negative 44
## 4 strange negative 43
## 5 poor negative 41
## 6 dark negative 40
## 7 evil negative 39
## 8 death negative 38
## 9 shame negative 38
## 10 love positive 36
## # … with 1,376 more rows
As shown by this table, there are definitely more negative words used. Although, it is interesting how “smile” is used the most with a positive sentiment!
The Bar Graphs below are able to demonstrate more clear on just how many more negative than positive words there are.
scarletletterwords %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() -> slgraph
slgraph %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
To further enrich my curiosity I will use the AFINN lexicon. AFINN measures words from -4 to 4. 0 represents a neutral word. In the code below I will determine how many words the lexicon analyzed, how many were positive, how many were negative, and the mean AFINN Value of all the words analyzed.
In this clean text, the AFINN lexicon analyzed 3,347 words of which 1550 were positive and 1797 were negative.
scarletletterwords %>%
inner_join(get_sentiments("afinn")) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 3347
scarletletterwords %>%
inner_join(get_sentiments("afinn")) %>%
filter(value > 0) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1550
scarletletterwords %>%
inner_join(get_sentiments("afinn")) %>%
filter(value < 0) %>%
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1797
scarletletterwords %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) -> scarletletterafinn
mean(scarletletterafinn$value)
## [1] -0.1855393
Additionally, with a mean AFINN value of -0.1855393 we are able to predict that this novel has a negative average sentiment.
The table below illustrates the most commonly used words in the AFINN lexicon within this text, and their values.
scarletletterwords %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(value) %>%
count(word, sort=TRUE)
## # A tibble: 742 × 3
## # Groups: value [9]
## value word n
## <dbl> <chr> <int>
## 1 2 smile 51
## 2 -1 strange 43
## 3 -2 poor 41
## 4 1 spirit 40
## 5 -3 evil 39
## 6 -2 death 38
## 7 -2 shame 38
## 8 3 love 36
## 9 -2 cried 34
## 10 2 true 31
## # … with 732 more rows
To look more closely, I then used AFINN to examine the positive and negative words and their specific values.
Positive Words - (Table and Wordcloud)
scarletletterwords %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments("afinn")) %>%
arrange(desc(value)) %>%
head(100)
## # A tibble: 100 × 3
## word n value
## <chr> <int> <dbl>
## 1 brilliant 9 4
## 2 heavenly 9 4
## 3 triumph 9 4
## 4 win 9 4
## 5 fantastic 7 4
## 6 miracle 5 4
## 7 triumphant 4 4
## 8 terrific 3 4
## 9 wonderful 3 4
## 10 rejoice 2 4
## # … with 90 more rows
scarletletterwords %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments("afinn")) %>%
arrange(desc(value)) %>%
head(100) %>%
wordcloud2()
Negative Words - (Table and Wordcloud)
scarletletterwords %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments("afinn")) %>%
arrange(desc(-value)) %>%
head(100)
## # A tibble: 100 × 3
## word n value
## <chr> <int> <dbl>
## 1 prick 1 -5
## 2 torture 15 -4
## 3 tortured 6 -4
## 4 tortures 1 -4
## 5 evil 39 -3
## 6 guilty 19 -3
## 7 miserable 18 -3
## 8 anguish 17 -3
## 9 dead 16 -3
## 10 lost 16 -3
## # … with 90 more rows
scarletletterwords %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments("afinn")) %>%
arrange(desc(-value)) %>%
head(100) %>%
wordcloud2()
The most common positive words were “brilliant,” “heavenly,” and “triumphant,” where as the common negative words were “prick,” “torture,” and “tortured.”
I think the negative words are interesting to look at. If you have ever read The Scarlet Letter before, these words describe how Hester is feeling and other people’s feelings towards Hester throughout the novel as a result of her sins.
The nrc Lexicon sorts the words into 8 different emotions and 2 categories. The 8 emotions are anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The 2 categories are positive and negative.
The table below shows the exact count of words in each category and emotion.
scarletletterwords %>%
inner_join(get_sentiments("nrc")) %>%
count(sentiment) %>%
arrange(desc(n))
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 positive 4051
## 2 negative 3031
## 3 trust 2153
## 4 anticipation 1886
## 5 sadness 1859
## 6 joy 1851
## 7 fear 1844
## 8 anger 1275
## 9 disgust 1043
## 10 surprise 824
Interesting enough, the nrc lexicon shows that The Scarlet Letter has more positive words than negative. Once again though, the nrc lexicon has a smaller negative to positive ration compared to the bing ratio.
Emotion wise the nrc lexicon ranks most words to be about trust and anticipation while the least amount to be about disgust or surprise.
These results make sense. The Scarlet Letter is about a character (Hester) who does not know what is going to happen in her future as a result of her adultery. This causes anticipation. Additionally, many words are related to trust because Hester does not know who to trust after she is forced to wear the letter A around. Both these emotions can cause sadness, which is the third most popular emotion.
Below is a bar graph to show these emotions and categories more clearly.
scarletletterwords %>%
inner_join(get_sentiments("nrc")) %>%
count(sentiment) %>%
arrange(desc(n)) %>%
ggplot(aes(sentiment, n,)) + geom_col() + theme_classic() +
xlab("Sentiment Category") + ylab("Number of Words") + ggtitle("Number of Words in Sentiment Categories (nrc)") + theme_light()
Using three different sentiment lexicons, I was able to achieve a good understanding of the overall sentiment in the novel The Scarlet Letter.
After running many tests, I can confidently say that this book has a more negative sentiment than positive sentiment. The book has a negative undertone and this is shown by the dark word choice and classification of words in categories like anticipation, trust and sadness. I think it is very interesting though how the nrc Lexicon tells us that there is more positive words than negative words. If I had more time, I would dive deeper into the nrc lexicon and try to find out the specific words that are increasing this positive count!
Thanks for reading my text analysis, I hope you enjoyed!
A work by Kendall Gilbert - Media Analytics Student