Coding Challenge: Count words in Moby Dick

The book “Moby Dick” by Herman Melville describes an epic battle of a gloomy captain against his personal nemesis, the white whale. Who of them is mentioned in the book more often?

Preparation

# load libraries
library(gutenbergr)  # download the book
library(tidyverse)  # data manipulation, plotting
library(qdap)  # text mining
library(tm)  # text mining

Data

The book is available from the Project Gutenberg web site.
We can use the gutenbergr library to directly download our data.

Download data

# download the book : 'Moby Dick'
moby <- gutenberg_download(gutenberg_id = 2701)
moby

# A tibble: 21,712 x 2
   gutenberg_id text                                                      
          <int> <chr>                                                     
 1         2701 MOBY DICK; OR THE WHALE                                   
 2         2701 ""                                                        
 3         2701 By Herman Melville                                        
 4         2701 ""                                                        
 5         2701 ""                                                        
 6         2701 ""                                                        
 7         2701 ""                                                        
 8         2701 Original Transcriber's Notes:                             
 9         2701 ""                                                        
10         2701 This text is a combination of etexts, one from the now-de…
# ... with 21,702 more rows

First, we can remove all blank lines (having only "" as text).

# remove blank lines
moby <- moby %>% filter(!text == "")

We have to transform the dataframe, to respect the conditions of import a dataframe into a source (package tm).

moby_df <- moby

# rename the columns as necesary
colnames(moby_df) <- c("doc_id", "text")

# provide a unique identifier to the column 'doc_id'
moby_df[, "doc_id"] <- rownames(moby_df)

head(moby_df)

# A tibble: 6 x 2
  doc_id text                                                             
  <chr>  <chr>                                                            
1 1      MOBY DICK; OR THE WHALE                                          
2 2      By Herman Melville                                               
3 3      Original Transcriber's Notes:                                    
4 4      This text is a combination of etexts, one from the now-defunct E…
5 5      project at Virginia Tech and one from Project Gutenberg's archiv…
6 6      proofreaders of this version are indebted to The University of A…

Corpus

# 1- create a source form a dataframe
moby_source <- DataframeSource(moby_df)

# 2- create a corpus
moby_corpus <- VCorpus(moby_source)
moby_corpus

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 18563

# Look at a sample
moby_corpus[[10]]$content

[1] "for the British pound, a unit of currency."

moby_corpus[[10]]$meta

  author       : character(0)
  datetimestamp: 2018-09-25 16:01:31
  description  : character(0)
  heading      : character(0)
  id           : 10
  language     : en
  origin       : character(0)

We will create a function that clean the corpus.

# clean the corpus, by removin punctuation, white spaces, numbers, and
# typical english words (the, if, ...)
tm_clean <- function(corpus) {
    corpus <- tm_map(corpus, content_transformer(tolower))  # convert to lower case
    corpus <- tm_map(corpus, removePunctuation)  # remove punctuation
    corpus <- tm_map(corpus, removeWords, c(stopwords("en")))  # remove typical english words
    corpus <- tm_map(corpus, removeNumbers)  # remove numbers
    corpus <- tm_map(corpus, stemDocument)  # aggregate seem words (like 'whale' and 'whales')
    corpus <- tm_map(corpus, stripWhitespace)  # remove extra spaces
    return(corpus)
}

# apply the function to our corpus
moby_clean <- tm_clean(moby_corpus)

# Compare the previous version and the new version modified
moby_corpus[[10]]$content

[1] "for the British pound, a unit of currency."

moby_clean[[10]]$content

[1] "british pound unit currenc"

We now have a cleaned corpus.

Term Document Matrix

# create a TDM (Term Document Matrix), to count words
moby_tdm <- TermDocumentMatrix(moby_clean)

For visualization, we need to tweak a little this TDM.

# transform the result into a matrix
moby_tdm_m <- as.matrix(moby_tdm)

# sum all the lines to have the frequency, and sort the results
term_frequency <- rowSums(moby_tdm_m)
term_frequency <- sort(term_frequency, decreasing = TRUE)

# and create a dataframe of it
term_frequency <- data.frame(term_frequency)
term_frequency$word <- rownames(term_frequency)

Plot

Plot the 25 most frequent terms for the book.

ggplot(term_frequency[1:25, ], aes(x = fct_reorder(word, term_frequency), y = term_frequency)) + 
    geom_col(width = 0.5, fill = "tan") + coord_flip() + labs(title = "Top 25 words used in `Moby Dick`", 
    subtitle = "", caption = "Data from Project Gutenberg (gutenberg.org)") + 
    geom_text(aes(label = term_frequency), hjust = -0.5, family = "Bookman", 
        fontface = "italic", size = 3) + scale_y_continuous(limits = c(0, 1600), 
    expand = c(0, 0)) + theme(axis.text.x = element_blank(), axis.title = element_blank(), 
    axis.ticks = element_blank(), axis.text.y = element_text(family = "Bookman", 
        face = "italic", size = 12), panel.grid = element_blank(), panel.background = element_blank(), 
    plot.title = element_text(family = "Bookman", face = "bold", size = 20, 
        hjust = 0.5), plot.caption = element_text(family = "Bookman", face = "italic", 
        size = 8))

term_frequency %>% filter(word %in% c("whale", "captain", "harpoon")) %>% select(word, 
    term_frequency) %>% knitr::kable(col.names = c("word", "frequency"), caption = "Which one appears most?")

Which one appears most?
word	frequency
whale	1460
captain	337
harpoon	242

‘Moby Dick’ word count

ATOR