Text Mining in R had quite the boost in 2016. David Robinson’s fascinating analysis of Donald Trump’s real and ‘official’ tweets got a lot of publicity (something the president-elect was probably all too happy with) and his collaboration with Julia Slige resulted in one of the best books,Tidy text Mining with R yet published using the bookdown package
Professor Slige also released a couple of R packages * tidytext - useful for tidying text for subsequent analyses * janeaustenr - a dataset of Jane Austen’s novels
I’m not completely sold on the value of textual analysis, at least at its current stage of development for this purpose, though I’m prepared to be convinced otherwise. To me it is the equivalent of perusing the list of ingredients on a packaged good in order to assess its taste. When I want to know whether to read a novel, I’m interested in themes, settings, characters, quality of writing etc. which I doubt this can provide.
Nevertheless, it is now a lot easier (and fun) to process novels - at least those in the public domain and on Project Gutenberg, thanks again th David Robinson and his gutenberger package
An interesting comparison to Jane Austen is Charles Dickens. His books are more wide ranging than Austen’s and have many memorable characters mixed in with social comment on Victorian England.
First we load the libraries and see what titles are available
#load libraries
library(tidyverse)
Warning message:
running command '"C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS partial.utf8.md --to html --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --output pandoca283d8a404.html --smart --email-obfuscation none --self-contained --standalone --section-divs --template "C:\Users\Andrew\Documents\R\win-library\3.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable "theme:bootstrap" --include-in-header "C:\Users\Andrew\AppData\Local\Temp\RtmpwxSq9a\rmarkdown-stra2872187e54.html" --mathjax --variable "mathjax-url:https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --variable code_folding=show --variable source_embed=partial.Rmd --include-after-body "C:\Users\Andrew\AppData\Local\Temp\RtmpwxSq9a\filea28297f2745.html" --variable code_menu=1 --variable kable-scroll=1' had status 67
library(tidytext)
library(gutenbergr)
library(plotly)
library(stringr)
library(wordcloud2)
#(dickens <-gutenberg_works(author == "Dickens, Charles"))
dickens <-gutenberg_works(author == "Dickens, Charles")
glimpse(dickens)
Observations: 74
Variables: 8
$ gutenberg_id <int> 46, 98, 564, 580, 588, 644, 650, 653, 675, 678, 699, 700, 730, 766, 786, 807, 80...
$ title <chr> "A Christmas Carol in Prose; Being a Ghost Story of Christmas", "A Tale of Two C...
$ author <chr> "Dickens, Charles", "Dickens, Charles", "Dickens, Charles", "Dickens, Charles", ...
$ gutenberg_author_id <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, ...
$ language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "e...
$ gutenberg_bookshelf <chr> "Christmas/Children's Literature", "Historical Fiction", "Mystery Fiction", "Bes...
$ rights <chr> "Public domain in the USA.", "Public domain in the USA.", "Public domain in the ...
$ has_text <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
(unique(dickens$gutenberg_bookshelf))
[1] "Christmas/Children's Literature" "Historical Fiction" "Mystery Fiction"
[4] "Best Books Ever Listings" NA "Christmas"
[7] "Children's Literature" "Children's History/United Kingdom" "Harvard Classics"
[10] "Detective Fiction" "Children's Picture Books"
So, extremely prolific and wide-ranging. I will probably want to limit this analysis to his novels but will first start with one of his most highly-regarded, Great Expectations.
I probably read the book as a child but definitely remember a BBC version and the excellent 1946 film version, which differs somewhat from the novel
We can download it’s text, via the id, which takes barely a second. Then plagiarise follow the Tidy text book’s code to get it into a ‘tidy’ format
expectations <- gutenberg_download(1400)
glimpse(expectations)
Observations: 20,024
Variables: 2
$ gutenberg_id <int> 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 1400, 140...
$ text <chr> "GREAT EXPECTATIONS", "", "[1867 Edition]", "", "by Charles Dickens", "", "", "[Project...
tidy_expectations <- expectations %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text) %>% #186,000+
# remove most coomon words the, a etc
anti_join(stop_words) #55,000
tidy_expectations
We now have a tidy data frame with each row a single word by linenumber/chapter.
Interestingly the word ‘expectations’ does not first appear until Chapter 18, when the lawyer, Jaggers informs Joe Gargery and Pip that the latter ‘will come into’will come into a handsome property’
We can now visualize the most common words in a couple of ways. Hover plots for exact numbers
word_count <-
tidy_expectations %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n))
word_count %>%
head(10) %>%
plot_ly(x=~n, y=~word) %>%
layout(title="Most common words (excluding stop-words) in Great Expectations",
xaxis=list(title="Total Occurrences"),
yaxis=list(title="")
) %>%
config(displayModeBar = F, showLink = F)
word_count %>%
head(100) %>%
wordcloud2()
As is often the case in novels, character names predominate but it is of interest that Joe is so well in the lead. ‘expectations’ ranks in the low 200’s and ‘great’ is a stop word.
Let’s have a look at the occurrence of Joe throughout the story
tidy_expectations %>%
filter(word=="joe") %>%
group_by(chapter) %>%
count(word) %>%
plot_ly(x=~chapter,y=~n) %>%
add_bars(color=I("blue"), alpha=0.5) %>%
layout(title="Occurrences of word 'Joe' by Chapter",
yaxis=list(title="Occurrences"),
yaxis=list(title="Chapter")
) %>%
config(displayModeBar = F, showLink = F)
As you may recall, Joe is Pip’s brother-in-law and surrogate father. He is a strong, positive, influence on Pip as a boy Chapter 27 is when Joe visits a mortified Pip in London, which brings out the worst in our ‘hero’ and Ch 57 is when Joe comforts Pip, who now realizes how badly he has treated a true friend, in his illness
We can use the tools of text mining to approach the emotional content of text programmatically. The tidyverse package has three sentiment lexicons for evaluating opinion or emotion in text. Here I will replicate some ofthe work in the book with the occaisonal addition
# lets look how one of the lexicons classifies words
nrc <- get_sentiments("nrc")
unique(nrc$sentiment)
[1] "trust" "fear" "negative" "sadness" "anger" "surprise" "positive"
[8] "disgust" "joy" "anticipation"
get_sentiments("nrc") %>%
filter(sentiment == "positive") %>%
config(displayModeBar = F, showLink = F)
Good to see that academic is positive! However, I will leave positive and negative out of this stage
Let’s look as the other emotions as a percentage of all words in each chapter
# first all words
words_chapter <- tidy_expectations %>%
group_by(chapter) %>%
count() %>%
rename(total=n)
# sentiments to exclude
chuck <- c("negative","positive")
#
tidy_expectations %>%
inner_join(nrc) %>%
filter(!sentiment %in% chuck) %>%
group_by(sentiment,chapter) %>%
count() %>%
inner_join(words_chapter) %>%
mutate(pc=round(100*n/total,1)) %>%
filter(chapter!=0) %>%
plot_ly(x=~chapter,y=~pc,color=~sentiment) %>%
add_bars() %>%
layout(barmode = 'stack',
title="Great Expectations - % of each Chapter with words of varying emotions ",
yaxis=list(title="Percentage"))
Stacked bar-charts are often not the best method of visualization but just toggle on the legend to remove/add emotions. For instance, the fear factor peaks in the chapter when Pip has just attempted to rescue Miss Havisham from the fire and he determines that Estella is Magwitch’s daughter
Another use of sentiment analysis is to examine the flow throughout the novel by breaking the word count into equal chunks. This time using the bing lexicon which just splits words by a binary positive/negative. The Bing lexicon has more negative words so wariness should be applied to a single novel. Any trajectory over time and comparison with other novels would be more robust
tidy_expectations %>%
inner_join(get_sentiments("bing")) %>%
count( index = linenumber %/% 100, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
plot_ly(x=~index,y=~sentiment) %>%
add_bars()
Joining, by = "word"
Even with the caveat above, this is a bit of a downer especially given that apparantly (i.e as referenced in wikipedia) G.K. Chesterton admired the novel’s optimism
Here is the tidytext books voutcome of the Jane Austen novels
I guess Dicken’s novel is a little grittier than life in upper middle-class country homes
Julia ( i hope I am not being over familiar) has extended her analysis in a blog post on readability. If you want to read more about the technique (and you should) head off there but suffice to say it starts with the premise that useful categories include
Let’s have a look at sentences first. The notebook style means all outputs are easily selectable
# easiest just to download again
ge <- gutenberg_download(c(1400),
meta_fields = "title")
tidy_ge <- ge %>%
mutate(text = iconv(text, to = 'latin1')) %>%
nest(-title) %>%
mutate(tidied = map(data, unnest_tokens, 'sentence', 'text', token = 'sentences'))#Error in mutate_impl(.data, dots) : `.x` is not a vector (closure)
tidy_ge
# we are only interested in the tidied column which should be in sentences. Lets check
tidy_ge <-tidy_ge %>%
unnest(tidied)
tidy_ge %>%
sample_n(5) %>%
select(sentence)
# Mine look good
# What is distribution like
sentences_ge <- tidy_ge %>%
unnest_tokens(word, sentence, drop = FALSE) %>% #5893748 seems too many
unique() %>% #160179 seems more likely
group_by(sentence) %>%
summarize(length=n())
summary(sentences_ge)
sentence length
Length:10033 Min. : 1.00
Class :character 1st Qu.: 6.00
Mode :character Median : 13.00
Mean : 15.97
3rd Qu.: 23.00
Max. :105.00
sentences_ge %>%
plot_ly(x=~length)
# and the longest
sentences_ge %>%
arrange(desc(length)) %>%
head(1) %>%
.$sentence
[1] "again among the tiers of shipping, in and out, avoiding rusty chain-cables frayed hempen hawsers and bobbing buoys, sinking for the moment floating broken baskets, scattering floating chips of wood and shaving, cleaving floating scum of coal, in and out, under the figure-head of the john of sunderland making a speech to the winds (as is done by many johns), and the betsy of yarmouth with a firm formality of bosom and her knobby eyes starting two inches out of her head; in and out, hammers going in ship-builders' yards, saws going at timber, clashing engines going at things unknown, pumps going in leaky ships, capstans going, ships going out to sea, and unintelligible sea-creatures roaring curses over the bulwarks at respondent lightermen, in and out,--out at last upon the clearer river, where the ships' boys might take their fenders in, no longer fishing in troubled waters with them over the side, and where the festooned sails might fly out to the wind."
The longest sentence is a reference to the River Thames when they are trying to effect Magwitch’s escape