Coding Challenge: Count words in Moby Dick

The book “Moby Dick” by Herman Melville describes an epic battle of a gloomy captain against his personal nemesis, the white whale. Who of them is mentioned in the book more often?

Data

The book is available from the Project Gutenberg web site.
We can use the gutenbergr library to directly download our data.

Download data

# A tibble: 21,712 x 2
   gutenberg_id text                                                      
          <int> <chr>                                                     
 1         2701 MOBY DICK; OR THE WHALE                                   
 2         2701 ""                                                        
 3         2701 By Herman Melville                                        
 4         2701 ""                                                        
 5         2701 ""                                                        
 6         2701 ""                                                        
 7         2701 ""                                                        
 8         2701 Original Transcriber's Notes:                             
 9         2701 ""                                                        
10         2701 This text is a combination of etexts, one from the now-de…
# ... with 21,702 more rows

First, we can remove all blank lines (having only "" as text).

We have to transform the dataframe, to respect the conditions of import a dataframe into a source (package tm).

# A tibble: 6 x 2
  doc_id text                                                             
  <chr>  <chr>                                                            
1 1      MOBY DICK; OR THE WHALE                                          
2 2      By Herman Melville                                               
3 3      Original Transcriber's Notes:                                    
4 4      This text is a combination of etexts, one from the now-defunct E…
5 5      project at Virginia Tech and one from Project Gutenberg's archiv…
6 6      proofreaders of this version are indebted to The University of A…