In this vignette, we shall use the Gutenbergr package for downloading “Little Women” by Louisa May Alcott, and Monkeylearn public modules to learn a bit about its contents without reading it.

Note that Monkeylearn modules we use here were not tested on books, so the results are not optimal.

library("monkeylearn")
library("gutenbergr")
library("dplyr")

## Warning: package 'dplyr' was built under R version 3.2.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

little_women <- gutenberg_download(c(514),
                                 meta_fields = "title")

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://www.gutenberg.lib.md.us

We will now use the tidytext package for getting whole paragraphs.

Then we will paste them together to get one string and then split it by the word “chapter” in order to get a reasonable number of text fragments, that we hope to be able to send as a single text to Monkeylearn API, which we can if each string is smaller than 50kB.

library("tidytext")
little_women <- little_women %>%
  unnest_tokens(paragraph, text, token = "paragraphs") %>%
  summarize(whole_text = paste(paragraph, collapse = " "))

chapters <- strsplit( little_women$whole_text, "[Cc]hapter")[[1]]

little_women_chapters <- tibble::tibble(
  chapter = 1:length(chapters),
  text = chapters
)

all(nchar(little_women_chapters$text, type = "bytes") < 50000)

## [1] TRUE

All chapters have the right size to be sent to the API. The API accepts 20 texts per call, but monkeylearn functions can split a vector of text automatically so we can submit the whole vector little_women_chapters$text without further ado.

Entity extractor

A first question we could ask ourselves about the book is who its main characters are, and where it takes place.

entities <- monkeylearn_extract(request = little_women_chapters$text,
                              extractor_id = "ex_isnnZRbS",
                              verbose = TRUE)

## Processing request number 1 out of 3

## Processing request number 2 out of 3

## Processing request number 3 out of 3

entities %>%
 group_by(entity, tag) %>%
 summarize(n_occurences = n()) %>%
  arrange(desc(n_occurences)) %>%
  filter(n_occurences > 5) %>%
  knitr::kable()

entity	tag	n_occurences
amy	PERSON	40
laurie	PERSON	32
mr. laurence	PERSON	14
mr. brooke	PERSON	13
meg	PERSON	9
paris	LOCATION	9
american	LOCATION	7
washington	LOCATION	7
john	PERSON	6

Keywords?

keywords <- monkeylearn_extract(request = little_women_chapters$text,
                                extractor_id = "ex_y7BPYzNG",
                                params = list(max_keywords = 3))
keywords %>%
  group_by(keyword) %>%
  summarize(n_occurences = sum(count)) %>%
  arrange(desc(n_occurences)) %>%
  filter(n_occurences > 10) %>%
  knitr::kable()

keyword	n_occurences
meg	627
amy	468
laurie	460
beth	309
john	103
mr. bhaer	62
aunt march	54
demi	43
mr. brooke	43
mother	38
mrs. march	35
boys	33
table	31
old gentleman	23
mr. laurence	20
mr. dashwood	18
letters	17
professor	17
heart	16
limes	15
story	14
flo	13
hand	13
mr. davis	13
thing	12
young ladies	12
gloves	11
snodgrass	11
uncle	11

Interestingly here the keyword extraction is better at finding who the main characters are (yes, I have read the book).

In this table the number of occurences is the total count for the keyword in the book.

Topics?

topics <- monkeylearn_classify(little_women_chapters$text,
                     classifier_id = "cl_5icAVzKR")
topics %>%
  group_by(label) %>%
  summarize(n_occurences = n()) %>%
  filter(n_occurences > 1) %>%
  arrange(desc(n_occurences)) %>%
  knitr::kable()

label	n_occurences
Society	48
Special Occasions	46
Entertainment & Recreation	3
Jokes	2

Here, occurences means number of times the topic was found in the table.

As a summary, using these three modules I was reminded of the book and of the movie, but I am less sure I could have been able to talk about the book using only these results while not having read it.

Further work: I could actually use tidytext a lot more, for instance to count words or to do a sentiment analysis as explained here.

Little Women and text analysis in R

M. Salmon

July 30, 2016

Entity extractor

Keywords?

Topics?