In this vignette, we shall use the Gutenbergr package for downloading “Little Women” by Louisa May Alcott, and Monkeylearn public modules to learn a bit about its contents without reading it.

Note that Monkeylearn modules we use here were not tested on books, so the results are not optimal.

library("monkeylearn")
library("gutenbergr")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.2.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
little_women <- gutenberg_download(c(514),
                                 meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://www.gutenberg.lib.md.us

We will now use the tidytext package for getting whole paragraphs.

Then we will paste them together to get one string and then split it by the word “chapter” in order to get a reasonable number of text fragments, that we hope to be able to send as a single text to Monkeylearn API, which we can if each string is smaller than 50kB.

library("tidytext")
little_women <- little_women %>%
  unnest_tokens(paragraph, text, token = "paragraphs") %>%
  summarize(whole_text = paste(paragraph, collapse = " "))

chapters <- strsplit( little_women$whole_text, "[Cc]hapter")[[1]]

little_women_chapters <- tibble::tibble(
  chapter = 1:length(chapters),
  text = chapters
)

all(nchar(little_women_chapters$text, type = "bytes") < 50000)
## [1] TRUE

All chapters have the right size to be sent to the API. The API accepts 20 texts per call, but monkeylearn functions can split a vector of text automatically so we can submit the whole vector little_women_chapters$text without further ado.

Entity extractor

A first question we could ask ourselves about the book is who its main characters are, and where it takes place.

entities <- monkeylearn_extract(request = little_women_chapters$text,
                              extractor_id = "ex_isnnZRbS",
                              verbose = TRUE)
## Processing request number 1 out of 3
## Processing request number 2 out of 3
## Processing request number 3 out of 3
entities %>%
 group_by(entity, tag) %>%
 summarize(n_occurences = n()) %>%
  arrange(desc(n_occurences)) %>%
  filter(n_occurences > 5) %>%
  knitr::kable()
entity tag n_occurences
amy PERSON 40
laurie PERSON 32
mr. laurence PERSON 14
mr. brooke PERSON 13
meg PERSON 9
paris LOCATION 9
american LOCATION 7
washington LOCATION 7
john PERSON 6

Keywords?

keywords <- monkeylearn_extract(request = little_women_chapters$text,
                                extractor_id = "ex_y7BPYzNG",
                                params = list(max_keywords = 3))
keywords %>%
  group_by(keyword) %>%
  summarize(n_occurences = sum(count)) %>%
  arrange(desc(n_occurences)) %>%
  filter(n_occurences > 10) %>%
  knitr::kable()
keyword n_occurences
meg 627
amy 468
laurie 460
beth 309
john 103
mr. bhaer 62
aunt march 54
demi 43
mr. brooke 43
mother 38
mrs. march 35
boys 33
table 31
old gentleman 23
mr. laurence 20
mr. dashwood 18
letters 17
professor 17
heart 16
limes 15
story 14
flo 13
hand 13
mr. davis 13
thing 12
young ladies 12
gloves 11
snodgrass 11
uncle 11

Interestingly here the keyword extraction is better at finding who the main characters are (yes, I have read the book).

In this table the number of occurences is the total count for the keyword in the book.

Topics?

topics <- monkeylearn_classify(little_women_chapters$text,
                     classifier_id = "cl_5icAVzKR")
topics %>%
  group_by(label) %>%
  summarize(n_occurences = n()) %>%
  filter(n_occurences > 1) %>%
  arrange(desc(n_occurences)) %>%
  knitr::kable()
label n_occurences
Society 48
Special Occasions 46
Entertainment & Recreation 3
Jokes 2

Here, occurences means number of times the topic was found in the table.

As a summary, using these three modules I was reminded of the book and of the movie, but I am less sure I could have been able to talk about the book using only these results while not having read it.

Further work: I could actually use tidytext a lot more, for instance to count words or to do a sentiment analysis as explained here.