Question

In the early 1800s when Charles Dickens was born language was much different than it is today. So with this project I wanted to look at the most common words from three of his most famous books: A Christmas Carol, Oliver Twist, and The Pickwick Papers.

Load Required Packages

library(ggplot2)
library(dplyr)
library(tidyverse)
library(wordcloud2)
library(tidytext)
library(ggthemes)
library(rmarkdown)
library(readr)
library(textdata)
library(gridExtra)
library(gtools)
#load("~/.RData")
pg19337 <- read.delim("https://www.gutenberg.org/cache/epub/19337/pg19337.txt", header=FALSE)

Oliver_Twist <- read.delim("https://www.gutenberg.org/files/730/730-0.txt", header = FALSE)
Pickwick <- read.delim("https://www.gutenberg.org/files/580/580-0.txt", header = FALSE)

Setup

To start this project I looked up Charles Dickens and his most famous books. I chose these three books because I knew them off hand so I hope they are common enough for everyone to know them.

First I stripped these three books from Gutenburg.com, binded the books into one data file, unnested each line, and removed the stop words.

Dickens <- rbind(pg19337, Oliver_Twist, Pickwick)

Exclude_words <- c( "don't", "project", "tm", "1", "electronic", "terms", "license", "foundation", "agreement", "copyright", "ebook", "copy", "trademark", "access", "paragraph", "literary", "archive", "donations", "copies", "laws", "ugh", "united", "fee", "set", "charge", "permission", "paid", "www.gutenberg.org", "3", "public", "ebooks", "domain", "distribute", "refund", "posted", "individual", "â", "replied")

Dickens %>% 
  unnest_tokens(word, V1) %>% 
  anti_join(stop_words)-> Dickens_words

## Joining, by = "word"

Dickens_words %>% 
  # anti_join(stop_words) %>% you aleady did this
  filter(!word %in% Exclude_words) %>% 
  group_by(word) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  filter(n > 100) -> Dickens_Save

Method

After cleaning all of the data above I wanted to figure out what the 15 most popular words were from the books chosen above. Since these books were written in the 1800s I want to see how the words compare to our everyday speech of today.

Dickens_Save %>% 
  head(15) %>% 
ggplot(aes(reorder(word,n), n))+ geom_col() + coord_flip() + 
  theme_light() + ggtitle("Dickens 15 Most Popular Words in the Three Books") + xlab("Word") + geom_bar(stat="identity", fill="#FFFF00")+
  ylab("Count") + geom_text(aes(label=n), hjust =2,vjust=0, color="black", size=3.5)

Dickens_Save %>% 
  wordcloud2()

Top Words

The top fifteen words seem to come out as a lot of names and common words from back in the day. Pickwick from the papers of Pickwick comes in topping the list at 1007 times mentioned in the papers. A close second and third round out as time and gentleman. Time is still a very common word but gentleman is definitely not as common as it once was back in the day and it is interesting to see that here. Another interesting word from the selection in Weller which is a name from the Pickwick papers but it helped come up with the name for Weller Kentucky bourbon a popular bourbon brand. One last word I find interesting is lady, Lady was commonly used as a way to address a woman back in the day but is not used as often now because it seems to formal or proper.

Sentiment

Dickens_words %>% 
  # anti_join(stop_words) %>%  YOU already did this
  inner_join(get_sentiments("afinn")) -> Dickens_afinn

## Joining, by = "word"

Dickens_afinn %>%   
  filter(!word %in% Exclude_words) %>% 
  group_by(word) %>% 
  count() %>% 
  arrange(desc(n)) -> Dickens_Sentiment

Dickens_Sentiment %>% 
 #  anti_join(stop_words) %>% You already did this
  filter(n > 75) %>% 
  head(15) %>% 
  ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() + theme_classic() +
  ggtitle("Charles Dickens Afinn Lexicon") + xlab("Word") + geom_bar(stat = "identity", fill="#0000CD") +
  ylab("Count") +geom_text(aes(label=n), hjust =2, vjust=0, color="white", size=3.5)

# Sentiment Analysis When looking at this graph I get more call back to old timey speech here. “Dear” has the largest count and dear again is a word I dont hear that often anymore. I find that word is only really used by your grandma when you walk in the house and no where else. Another interesting thing I have found is that the most popular word found here in this sentiment analysis seem to be negative. Cried, fire, poor, dead, leave, stop, and death all are in the top words found. This shows that life in these books wasnt that happy and that life in general back then was difficult.

Conclusion

To conclude the speech differences between the 1800s and today shows major differences in our speech patterns and what we talk about. It seems to me that life has become more postive within the last 200 years as life isn’t about where your next meal is coming from, its more about who called you out on instagram last week.

Charles Dickens Analysis

Ben Muse