In this vignette, we shall use the Gutenbergr package for downloading “Little Women” by Louisa May Alcott, and Monkeylearn public modules to learn a bit about its contents without reading it.
Note that Monkeylearn modules we use here were not tested on books, so the results are not optimal.
library("monkeylearn")
library("gutenbergr")
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
little_women <- gutenberg_download(c(514),
meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://www.gutenberg.lib.md.us
We will now use the tidytext package for getting whole paragraphs.
Then we will paste them together to get one string and then split it by the word “chapter” in order to get a reasonable number of text fragments, that we hope to be able to send as a single text to Monkeylearn API, which we can if each string is smaller than 50kB.
library("tidytext")
little_women <- little_women %>%
unnest_tokens(paragraph, text, token = "paragraphs") %>%
summarize(whole_text = paste(paragraph, collapse = " "))
chapters <- strsplit( little_women$whole_text, "[Cc]hapter")[[1]]
little_women_chapters <- tibble::tibble(
chapter = 1:length(chapters),
text = chapters
)
all(nchar(little_women_chapters$text, type = "bytes") < 50000)
## [1] TRUE
All chapters have the right size to be sent to the API. The API accepts 20 texts per call, but monkeylearn functions can split a vector of text automatically so we can submit the whole vector little_women_chapters$text without further ado.
A first question we could ask ourselves about the book is who its main characters are, and where it takes place.
entities <- monkeylearn_extract(request = little_women_chapters$text,
extractor_id = "ex_isnnZRbS",
verbose = TRUE)
## Processing request number 1 out of 3
## Processing request number 2 out of 3
## Processing request number 3 out of 3
entities %>%
group_by(entity, tag) %>%
summarize(n_occurences = n()) %>%
arrange(desc(n_occurences)) %>%
filter(n_occurences > 5) %>%
knitr::kable()
| entity | tag | n_occurences |
|---|---|---|
| amy | PERSON | 40 |
| laurie | PERSON | 32 |
| mr. laurence | PERSON | 14 |
| mr. brooke | PERSON | 13 |
| meg | PERSON | 9 |
| paris | LOCATION | 9 |
| american | LOCATION | 7 |
| washington | LOCATION | 7 |
| john | PERSON | 6 |
keywords <- monkeylearn_extract(request = little_women_chapters$text,
extractor_id = "ex_y7BPYzNG",
params = list(max_keywords = 3))
keywords %>%
group_by(keyword) %>%
summarize(n_occurences = sum(count)) %>%
arrange(desc(n_occurences)) %>%
filter(n_occurences > 10) %>%
knitr::kable()
| keyword | n_occurences |
|---|---|
| meg | 627 |
| amy | 468 |
| laurie | 460 |
| beth | 309 |
| john | 103 |
| mr. bhaer | 62 |
| aunt march | 54 |
| demi | 43 |
| mr. brooke | 43 |
| mother | 38 |
| mrs. march | 35 |
| boys | 33 |
| table | 31 |
| old gentleman | 23 |
| mr. laurence | 20 |
| mr. dashwood | 18 |
| letters | 17 |
| professor | 17 |
| heart | 16 |
| limes | 15 |
| story | 14 |
| flo | 13 |
| hand | 13 |
| mr. davis | 13 |
| thing | 12 |
| young ladies | 12 |
| gloves | 11 |
| snodgrass | 11 |
| uncle | 11 |
Interestingly here the keyword extraction is better at finding who the main characters are (yes, I have read the book).
In this table the number of occurences is the total count for the keyword in the book.
topics <- monkeylearn_classify(little_women_chapters$text,
classifier_id = "cl_5icAVzKR")
topics %>%
group_by(label) %>%
summarize(n_occurences = n()) %>%
filter(n_occurences > 1) %>%
arrange(desc(n_occurences)) %>%
knitr::kable()
| label | n_occurences |
|---|---|
| Society | 48 |
| Special Occasions | 46 |
| Entertainment & Recreation | 3 |
| Jokes | 2 |
Here, occurences means number of times the topic was found in the table.
As a summary, using these three modules I was reminded of the book and of the movie, but I am less sure I could have been able to talk about the book using only these results while not having read it.
Further work: I could actually use tidytext a lot more, for instance to count words or to do a sentiment analysis as explained here.