The book “Moby Dick” by Herman Melville describes an epic battle of a gloomy captain against his personal nemesis, the white whale. Who of them is mentioned in the book more often?
The book is available from the Project Gutenberg web site.
We can use the gutenbergr
library to directly download our data.
# A tibble: 21,712 x 2
gutenberg_id text
<int> <chr>
1 2701 MOBY DICK; OR THE WHALE
2 2701 ""
3 2701 By Herman Melville
4 2701 ""
5 2701 ""
6 2701 ""
7 2701 ""
8 2701 Original Transcriber's Notes:
9 2701 ""
10 2701 This text is a combination of etexts, one from the now-de…
# ... with 21,702 more rows
First, we can remove all blank lines (having only "" as text).
We have to transform the dataframe, to respect the conditions of import a dataframe into a source (package tm
).
moby_df <- moby
# rename the columns as necesary
colnames(moby_df) <- c("doc_id", "text")
# provide a unique identifier to the column 'doc_id'
moby_df[, "doc_id"] <- rownames(moby_df)
head(moby_df)
# A tibble: 6 x 2
doc_id text
<chr> <chr>
1 1 MOBY DICK; OR THE WHALE
2 2 By Herman Melville
3 3 Original Transcriber's Notes:
4 4 This text is a combination of etexts, one from the now-defunct E…
5 5 project at Virginia Tech and one from Project Gutenberg's archiv…
6 6 proofreaders of this version are indebted to The University of A…
# 1- create a source form a dataframe
moby_source <- DataframeSource(moby_df)
# 2- create a corpus
moby_corpus <- VCorpus(moby_source)
moby_corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 18563
[1] "for the British pound, a unit of currency."
author : character(0)
datetimestamp: 2018-09-25 16:01:31
description : character(0)
heading : character(0)
id : 10
language : en
origin : character(0)
We will create a function that clean the corpus.
# clean the corpus, by removin punctuation, white spaces, numbers, and
# typical english words (the, if, ...)
tm_clean <- function(corpus) {
corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lower case
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, removeWords, c(stopwords("en"))) # remove typical english words
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stemDocument) # aggregate seem words (like 'whale' and 'whales')
corpus <- tm_map(corpus, stripWhitespace) # remove extra spaces
return(corpus)
}
# apply the function to our corpus
moby_clean <- tm_clean(moby_corpus)
# Compare the previous version and the new version modified
moby_corpus[[10]]$content
[1] "for the British pound, a unit of currency."
[1] "british pound unit currenc"
We now have a cleaned corpus.
For visualization, we need to tweak a little this TDM.
# transform the result into a matrix
moby_tdm_m <- as.matrix(moby_tdm)
# sum all the lines to have the frequency, and sort the results
term_frequency <- rowSums(moby_tdm_m)
term_frequency <- sort(term_frequency, decreasing = TRUE)
# and create a dataframe of it
term_frequency <- data.frame(term_frequency)
term_frequency$word <- rownames(term_frequency)
Plot the 25 most frequent terms for the book.
ggplot(term_frequency[1:25, ], aes(x = fct_reorder(word, term_frequency), y = term_frequency)) +
geom_col(width = 0.5, fill = "tan") + coord_flip() + labs(title = "Top 25 words used in `Moby Dick`",
subtitle = "", caption = "Data from Project Gutenberg (gutenberg.org)") +
geom_text(aes(label = term_frequency), hjust = -0.5, family = "Bookman",
fontface = "italic", size = 3) + scale_y_continuous(limits = c(0, 1600),
expand = c(0, 0)) + theme(axis.text.x = element_blank(), axis.title = element_blank(),
axis.ticks = element_blank(), axis.text.y = element_text(family = "Bookman",
face = "italic", size = 12), panel.grid = element_blank(), panel.background = element_blank(),
plot.title = element_text(family = "Bookman", face = "bold", size = 20,
hjust = 0.5), plot.caption = element_text(family = "Bookman", face = "italic",
size = 8))
term_frequency %>% filter(word %in% c("whale", "captain", "harpoon")) %>% select(word,
term_frequency) %>% knitr::kable(col.names = c("word", "frequency"), caption = "Which one appears most?")
word | frequency |
---|---|
whale | 1460 |
captain | 337 |
harpoon | 242 |