Introduction

In this project I will analyze the novel The Scarlet Letter by Nathaniel Hawthorne, specifically, the books sentiment. The Scarlet Letter is a historical fiction novel written about the idea of sin and punishment after the main character, Hester Prynne, has to deal with the consequences of adultery.

Prediction / Hypothesis

I predict that this novel will have a negative sentiment. This is because Nathaniel Hawthorne has always been known for writing stories on the darker side regarding history, morality and romance. Hawthorne can be distinguished from other authors by the emphasis he places on human fallibility giving rise to characters lapses in judgement. This judgement often causes good men and women to drift toward sin and self-destruction. (https://americanliterature.com/author/nathaniel-hawthorne#:~:text=Along%20with%20Herman%20Melville%20and,toward%20sin%20and%20self%2Ddestruction.)

ScarletLetter

First, I installed all the packages necessary to completing this text analysis. Without these packages, the code would not work.

library(tidyverse)
library(tidytext)
library(ggplot2)
library(readr)
library(wordcloud2)
library(textdata)
library(ggthemes)

Sentiment Lexicons

As stated earlier, I will be focusing on sentiment within this novel.

I will utilize 3 different sentiment data sets. These include bing, AFINN and nrc. All three of these are based on unigrams but have different purposes. (https://www.tidytextmining.com/sentiment.html)

The bing lexicon categorizes words in a binary fashion into positive and negative categories.

The AFINN lexicon assigns words with a score that runs between -4 and 4. A -4 classification means the word has a negative meaning/connotation and 4 means the word has a positive meaning/connotation.

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into emotions of anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. It also categorizes the words into categories of positive and negative.

These three lexicons are crucial towards analyzing the sentiment in the duration of this novel.

One thing to keep in mind about the bing and nrc lexicons is that both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the bing lexicon than the nrc lexicon.

The bing lexicon has 4781 negative words and 2005 positive words. The nrc lexicon has 3318 negative words and 2308 positive words. (The Code below is able to give us the exact numbers)

get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3318
## 2 positive   2308

Installing and Cleaning the Text

I first had to install the entire novel. I obtained a .txt file from the Gutenberg Website. I then imported this .txt file to RStudio and made sure to go through the novel to remove any symbols, words or text that was not necessary for the analysis.

This is the code for importing the entire book and cleaning the data. The code anti_join(stop_words) takes out words like “and”, “the”, “a”, etc. These words would not be beneficial for the analysis, therefore, it was necessary to take them out.

library(readr)
scarletletter <- read_csv("scarletletter.txt", 
                          col_names = FALSE)
scarletletter %>%
  unnest_tokens(word, X1) %>% 
  anti_join(stop_words) -> scarletletterwords

Now that I have clean data for The Scarlet Letter, I will begin my analysis.


Most Common Words

To start, I counted the number of words that are being analyzed in the book.

Scarlet Letter has 26,217 words (not including stop words). I did this in order to determine the rigor of analyzing this book.

count(scarletletterwords)
## # A tibble: 1 × 1
##       n
##   <int>
## 1 26217

This graph shows the most popular words within the book used over 75 times. I removed the words Hester, Pearl, Roger, Scarlet, Letter and Dimmesdale. These words would make my analysis inaccurate as they are only said many times because they are names of characters and the main topic in the book.

scarletletterwords %>% 
  group_by(word) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  filter(n>75) %>% 
  filter(!word %in% c('hester', 'pearl', 'prynne' , 'scarlet' , 'letter' , 'dimmesdale' , 'roger')) %>% 
  ggplot(aes(reorder(word, n), n, fill=word)) + geom_col() +  coord_flip() + theme_light() + xlab("Word") + 
  ylab("Number of Times Word is Used") + ggtitle("Scarlet Letter Popular Words")

This table shows the specific number of times the popular words are used.

scarletletterwords %>% 
  group_by(word) %>% 
  count(10) %>% 
  arrange(desc(n)) %>% 
  filter(n>75) %>% 
  filter(!word %in% c('hester', 'pearl', 'prynne' , 'scarlet' , 'letter' , 'dimmesdale' , 'roger'))
## # A tibble: 11 × 3
## # Groups:   word [11]
##    word      `10`     n
##    <chr>    <dbl> <int>
##  1 thou        10   240
##  2 child       10   192
##  3 minister    10   156
##  4 mother      10   132
##  5 heart       10   127
##  6 life        10   125
##  7 hand        10   101
##  8 thee        10    95
##  9 eyes        10    91
## 10 woman       10    86
## 11 thy         10    85

This is a Wordcloud for all the words in the book. The bigger the word, the more times the word is used.

scarletletterwords %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c('hester', 'pearl', 'prynne' , 'scarlet' , 'letter', 'dimmesdale','roger')) %>% 
  group_by(word) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  wordcloud2()

As you can see, Nathaniel Hawthorne utilizes words like “though,” “child,” “minister,” “mother,” and “heart.” From these words it can be understood the Nathaniel Hawthorne seems to be an elegant writer of his time.


Sentiment in the Scarlet Letter using bing

Now, onto sentiment. I first analyzed the sentiment in this novel using the bing Lexicon.

Using the bing lexicon, I was able to determine how many words are considered “positive” and how many words are considered “negative” in this novel.

scarletletterwords %>% 
  inner_join(get_sentiments('bing')) %>%  
  filter(sentiment %in% c("positive", "negative")) %>% 
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3042
## 2 positive   2018

In Scarlet Letter there are 3042 negative words and 2018 positive words.

As you can tell, Nathaniel Hawthorne’s book seems to be more more negative than positive.

It is also important to keep in note what I said at the beginning - The bing lexicon has a higher ratio of negative to positive words compared to the nrc lexicon (which we will see later).

I then wanted to dive deeper and see the most common positive and negative words.

scarletletterwords %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() 
## # A tibble: 1,386 × 3
##    word    sentiment     n
##    <chr>   <chr>     <int>
##  1 smile   positive     51
##  2 wild    negative     46
##  3 sin     negative     44
##  4 strange negative     43
##  5 poor    negative     41
##  6 dark    negative     40
##  7 evil    negative     39
##  8 death   negative     38
##  9 shame   negative     38
## 10 love    positive     36
## # … with 1,376 more rows

As shown by this table, there are definitely more negative words used. Although, it is interesting how “smile” is used the most with a positive sentiment!

The Bar Graphs below are able to demonstrate more clear on just how many more negative than positive words there are.

scarletletterwords %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() -> slgraph

slgraph %>% 
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>% 
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(x = "Contribution to sentiment",
       y = NULL)


Sentiment in the Scarlet Letter using AFINN

To further enrich my curiosity I will use the AFINN lexicon. AFINN measures words from -4 to 4. 0 represents a neutral word. In the code below I will determine how many words the lexicon analyzed, how many were positive, how many were negative, and the mean AFINN Value of all the words analyzed.

In this clean text, the AFINN lexicon analyzed 3,347 words of which 1550 were positive and 1797 were negative.

scarletletterwords %>% 
  inner_join(get_sentiments("afinn")) %>% 
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  3347
scarletletterwords %>% 
  inner_join(get_sentiments("afinn")) %>% 
  filter(value > 0) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1550
scarletletterwords %>% 
  inner_join(get_sentiments("afinn")) %>% 
  filter(value < 0) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1797
scarletletterwords %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("afinn")) -> scarletletterafinn

mean(scarletletterafinn$value)
## [1] -0.1855393

Additionally, with a mean AFINN value of -0.1855393 we are able to predict that this novel has a negative average sentiment.

The table below illustrates the most commonly used words in the AFINN lexicon within this text, and their values.

scarletletterwords %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(value) %>% 
  count(word, sort=TRUE)
## # A tibble: 742 × 3
## # Groups:   value [9]
##    value word        n
##    <dbl> <chr>   <int>
##  1     2 smile      51
##  2    -1 strange    43
##  3    -2 poor       41
##  4     1 spirit     40
##  5    -3 evil       39
##  6    -2 death      38
##  7    -2 shame      38
##  8     3 love       36
##  9    -2 cried      34
## 10     2 true       31
## # … with 732 more rows

To look more closely, I then used AFINN to examine the positive and negative words and their specific values.

Positive Words - (Table and Wordcloud)

scarletletterwords %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("afinn")) %>%
  arrange(desc(value)) %>% 
  head(100)
## # A tibble: 100 × 3
##    word           n value
##    <chr>      <int> <dbl>
##  1 brilliant      9     4
##  2 heavenly       9     4
##  3 triumph        9     4
##  4 win            9     4
##  5 fantastic      7     4
##  6 miracle        5     4
##  7 triumphant     4     4
##  8 terrific       3     4
##  9 wonderful      3     4
## 10 rejoice        2     4
## # … with 90 more rows
scarletletterwords %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("afinn")) %>%
  arrange(desc(value)) %>%
  head(100) %>% 
  wordcloud2()

Negative Words - (Table and Wordcloud)

scarletletterwords %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("afinn")) %>%
  arrange(desc(-value)) %>% 
  head(100)
## # A tibble: 100 × 3
##    word          n value
##    <chr>     <int> <dbl>
##  1 prick         1    -5
##  2 torture      15    -4
##  3 tortured      6    -4
##  4 tortures      1    -4
##  5 evil         39    -3
##  6 guilty       19    -3
##  7 miserable    18    -3
##  8 anguish      17    -3
##  9 dead         16    -3
## 10 lost         16    -3
## # … with 90 more rows
scarletletterwords %>% 
  count(word, sort = TRUE) %>% 
  inner_join(get_sentiments("afinn")) %>%
  arrange(desc(-value)) %>%
  head(100) %>% 
  wordcloud2()

The most common positive words were “brilliant,” “heavenly,” and “triumphant,” where as the common negative words were “prick,” “torture,” and “tortured.”

I think the negative words are interesting to look at. If you have ever read The Scarlet Letter before, these words describe how Hester is feeling and other people’s feelings towards Hester throughout the novel as a result of her sins.


Sentiment in the Scarlet Letter using nrc

The nrc Lexicon sorts the words into 8 different emotions and 2 categories. The 8 emotions are anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The 2 categories are positive and negative.

The table below shows the exact count of words in each category and emotion.

scarletletterwords %>%
  inner_join(get_sentiments("nrc")) %>%
  count(sentiment) %>% 
  arrange(desc(n))
## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 positive      4051
##  2 negative      3031
##  3 trust         2153
##  4 anticipation  1886
##  5 sadness       1859
##  6 joy           1851
##  7 fear          1844
##  8 anger         1275
##  9 disgust       1043
## 10 surprise       824

Interesting enough, the nrc lexicon shows that The Scarlet Letter has more positive words than negative. Once again though, the nrc lexicon has a smaller negative to positive ration compared to the bing ratio.

Emotion wise the nrc lexicon ranks most words to be about trust and anticipation while the least amount to be about disgust or surprise.

These results make sense. The Scarlet Letter is about a character (Hester) who does not know what is going to happen in her future as a result of her adultery. This causes anticipation. Additionally, many words are related to trust because Hester does not know who to trust after she is forced to wear the letter A around. Both these emotions can cause sadness, which is the third most popular emotion.

Below is a bar graph to show these emotions and categories more clearly.

   scarletletterwords %>%
    inner_join(get_sentiments("nrc")) %>%
    count(sentiment) %>% 
    arrange(desc(n)) %>% 
    ggplot(aes(sentiment, n,)) + geom_col() + theme_classic() + 
    xlab("Sentiment Category") + ylab("Number of Words") + ggtitle("Number of Words in Sentiment Categories (nrc)") + theme_light()


Conclusions

Using three different sentiment lexicons, I was able to achieve a good understanding of the overall sentiment in the novel The Scarlet Letter.

After running many tests, I can confidently say that this book has a more negative sentiment than positive sentiment. The book has a negative undertone and this is shown by the dark word choice and classification of words in categories like anticipation, trust and sadness. I think it is very interesting though how the nrc Lexicon tells us that there is more positive words than negative words. If I had more time, I would dive deeper into the nrc lexicon and try to find out the specific words that are increasing this positive count!

Thanks for reading my text analysis, I hope you enjoyed!


A work by Kendall Gilbert - Media Analytics Student