Jane Austen Novels: Word Frequency and Sentiment Analysis

Jane Austen was a British author who penned six major novels: Sense and Sensibility (1811), Pride and Prejudice (1813), Mansfield Park (1814), Emma (1815), Northanger Abbey (1818, posthumous), and Persuasion (1818, posthumous). Jane Austen authored my favorite novel, Pride and Prejudice. This was a way to delve a little further and explore the works of a favorite novelist from a linguistic and quantitative perspective. This project is the opportunity to look at the word frequency and analyze the sentiment across all six of the novels. Originally I planned on downloading the novels from Project Gutenberg, but fortunately for me, there is an R packages, called “janeaustenr”, which contains the six major Jane Austen novels in easy-to-use dataframes.

Word Frequency

One way to analyze the language that is used by a particular author is to look at the word frequency. Word frequency, simply put, is looking at how often a word appears in a collection of words.

To accomplish this, the total number of words per novel needs to be found.

all_austen_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

total_words <- all_austen_words %>%
  group_by(book) %>%
  summarize(total = sum(n))
total_words

The above table shows the total number of words in each Jane Austen novel.

Next, the number of occurrences of words are counted. “n” is the total count of each word. The table is ordered from most occurrences to least.

all_austen_words <- left_join(all_austen_words, total_words)
all_austen_words

The above table shows that the most commons words in Jane Austen novels are words like “the”, “of”, and other such closed case words.

This can also be shown in a histogram.

ggplot(all_austen_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE, bins = 30) +
  xlim(NA, 0.0009) +
  scale_fill_brewer() +
  theme_light() +
  ggtitle("Figure 1: Word Count across Total Words of each Jane Austen Novel") +
  facet_wrap(~book, ncol = 2)

Figure 1 is a histogram that shows the count of a word divided by the total number of words in the particular novel.

This shows a trend across all six novels in which there is a tail to the right. The figure above shows that not only do these more common words occur often in their own novels, but the trend is present throughout all six of the novels. This is to be expected, but the words found are not specifically related to the content of each story.

Word Frequency and TF-IDF

In order to look at the words that are more meaningful in terms of content of each novel, the data needs to be transformed.

Zipf’s Law states that the frequency of a word is inversely proportional to its rank. This means that the words that are most often used in a large set of words are not as important in terms of the meaning.

One way to find the words that occur frequently in a specific novel yet are not used throughout all of Jane Austen’s novels is to find the tf-idf values.

novel_tf_idf <- all_austen_words %>%
  bind_tf_idf(word, book, n)
novel_tf_idf

The above table shows that the highest term frequency (tf) for Austen novels.

tf = n/total

Now, the inverse document frequency (idf) needs to be calculated, which is the log of the inverse of tf. Then, the tf-idf value equals the term frequency multiplied by the inverse document frequency.

idf = ln(1/tf)

tf-idf = tf * idf

novel_tf_idf %>%
  arrange(desc(tf_idf))

After reversing the order so that the terms are listed in descending order, the words that have the highest tf-idf values are now seen in the table. \(~\)

novel_tf_idf %>%
  group_by(book) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_y") +
  labs(x = "tf-idf", y = NULL) +
  scale_fill_brewer() +
  theme_light() +
  ggtitle("Figure 2. Top 15 TF-IDF Values for each Austen Novel")

Figure 2 shows the words that are most common within each Austen novel that do not occur too frequently throughout the entirety of the novels. The terms that have the highest tf-idf values are character names and locations.

Sentiment Analysis

Conducting sentiment analysis on a corpus of work can provide insight into the tone that the author uses. In this project, the Bing sentiment lexicon was used. This lexicon, first published in 2004, contains more than 6700 words that have been tagged with as either positive or negative.

sentiment_bing = get_sentiments("bing")

books_sentiment <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text,
                                regex("^chapter [\\divxlc]",
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

After dividing the Austen novels into lines, the sentiment can be calculated by tagging the words that have positive or negative sentiment and keeping a cumulative sum over the span of the line.

bing_austen_sentiments <- books_sentiment %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)
bing_austen_sentiments

The sentiment value is calculated by subtracting the negative sentiment value from the positive sentiment value.

ggplot(bing_austen_sentiments, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free") +
  scale_fill_brewer()+
  theme_light()+
  ggtitle("Figure 3: Measure of Sentiment over the Course each Austen Novel")

Figure 3 shows how the sentiment changes over the progression of each Austen novel.

Overall, it appears that the six Jane Austen novels are generally positive over the course of the novel.

Word Clouds

Word clouds are a way to visualize words that are used most often. The larger the font of the word, the more often it is used.

austen_count_wordcloud <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text,
                                regex("^chapter [\\divxlc]",
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

austen_count_wordcloud %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 120))

This word cloud shows the words that more frequently used in Jane Austen novels, while filtering out stop words. Some of the more prominent words are “dear”, “lady”, “house”, which, given that Austen writes romances novels in the Regency era were important in the quest to happily married and settled. Other words in a larger font include prominent character names, especially those that appear in multiple novels like “fanny” and “elizabeth”.

bing_word_count <- books_sentiment %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
bing_word_count

books_sentiment %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(random.order = FALSE, colors = c("dark blue", "light blue"),
                   max.words = 120)

This word cloud shows the most frequently used words that are associated with sentiment that are used in Austen novels. The negative words are colored in dark blue toward the top of the cloud while positive words are in light blue towards the bottom. The word “miss” occurs with a much higher frequency in Jane Austen’s works because miss is not only a word associated with negative sentiment, such as missing someone, but was the common title used for unmarried women.

Conclusions

This was a fun little project aimed at exploring how to use natural language processing techniques and visualizing them with R Studio. It was valuable using word frequency and sentiment analysis to look at Jane Austen’s novels. This can provide more insight about what kind of language an author chooses to use and generalize about the author’s style.

Further research could be certainly be done. If there were more time to devote to this project, perhaps other sentiment lexicons could be utilized to analyze this set of novels. These methods could also be used to examine the work of other authors by comparing language styles.

Resources

R Packages

cite_packages()

  - Emil Hvitfeldt (2020). textdata: Download and Load Various Text Datasets. R package version 0.4.1. https://CRAN.R-project.org/package=textdata
  - Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
  - H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  - Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
  - Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
  - Hadley Wickham (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1. https://CRAN.R-project.org/package=forcats
  - Hadley Wickham (2021). tidyr: Tidy Messy Data. R package version 1.1.3. https://CRAN.R-project.org/package=tidyr
  - Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7. https://CRAN.R-project.org/package=dplyr
  - Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud
  - Julia Silge (2017). janeaustenr: Jane Austen's Complete Novels. R package version 0.1.5. https://CRAN.R-project.org/package=janeaustenr
  - Makowski, D., Ben-Shachar, M.S., Patil, I. & Lüdecke, D. (2020). Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption. CRAN. Available from https://github.com/easystats/report. doi: .
  - R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  - Silge J, Robinson D (2016). "tidytext: Text Mining and Analysis UsingTidy Data Principles in R." _JOSS_, *1*(3). doi: 10.21105/joss.00037(URL: https://doi.org/10.21105/joss.00037), <URL:http://dx.doi.org/10.21105/joss.00037>.

Other Sources

Austen, J. (1994, August). Emma. The Project Gutenberg eBook of Emma, by Jane Austen. Retrieved December 15, 2021, from https://www.gutenberg.org/cache/epub/158/pg158-images.html

Austen, J. (1994, June). Mansfield Park. The Project Gutenberg eBook of Mansfield Park, by Jane Austen. Retrieved December 15, 2021, from https://www.gutenberg.org/cache/epub/141/pg141-images.html

Austen, J. (1994, April). Northanger Abbey. The Project Gutenberg eBook of Northanger Abbey, by Jane Austen. Retrieved December 15, 2021, from https://www.gutenberg.org/cache/epub/121/pg121-images.html

Austen, J. (2019, February 15). Persuasion. The Project Gutenberg E-text of Persuasion, by Jane Austen. Retrieved December 15, 2021, from https://www.gutenberg.org/cache/epub/105/pg105-images.html

Austen, J. (1998, July). Pride and Prejudice. The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen. Retrieved December 15, 2021, from https://www.gutenberg.org/cache/epub/1342/pg1342-images.html

Austen, J. (1994, September). Sense and Sensibility. The Project Gutenberg eBook of Sense and Sensibility, by Jane Austen. Retrieved December 15, 2021, from https://www.gutenberg.org/cache/epub/161/pg161-images.html