Natural Language Processing using computational means is one of the fastest growing fields today and its applications across various domains are innumerable. Virtual assistants such as Google Home and Amazon Alexa, which have become household names, are applications of NLP. Just as these devices are getting more skillful or “human-like” each day, the field of NLP is also growing rapidly and the ability of models to infer meaning from language can propel our existing forms of communication to a new age. As such, the study of this field has become paramount to establish a promising career as an academic or as a data science professional.
This paper intends to apply some of the important themes surrounding NLP such as sentiment analysis and topics modeling to a work of fictional literature to study the main themes in it. Research in this area has been rife especially with the coming of age of computational power and the understanding of how to apply mathematics other than summary statistics to language.
Ashok, Feng, and Choi (2013) were able to predict the commercial success of a novel based on the writing style. Egbert (2013) compared styles of 19th century fiction writing among authors, and the style variations among novels of individual authors using multi-dimensional analysis. Jautze (2014) looked at the extent to which the distribution of the most frequent words of two chick lit and literature novelistic genres gave insights into genre styles. Solorio, Montes-y-Gomez, Maharjan, Ovalle, and Gonzalez (2017) used feature engineering and neural network models to predict the likability of books from the Gutenberg corpus. Anvari and Amirkhani (2018) created a neural network based embedding approach called book2vec for creating book representations using Google’s word2vec model.
The fictional work chosen for this paper was ‘Thirty Strange Stories’, written by H.G Wells. H.G. Wells is a famous fiction writer of the late nineteenth century, well known for writing books such as the War of the worlds, The Invisible Man and The Time Machine. As the name suggests, ‘Thirty Strange Stories’ is a collection of 30 stories, each with the overarching theme of mystery.
Below is a list of the 30 stories covered in this book:
Since this literary collection represents thirty different mystery stories, it is tedious to try to understand each one of them separately. However, it might be interesting to see how these stories cluster together to form sub-themes within the broader mystery theme. Another interesting hypothesis is to test whether a mystery story always represents negative sentiment. Lastly, the study will try to understand these stories better through some exploratory data analysis.
Text data such as this would require significant cleaning and as such data processing techniques will be used to transform the raw data into tidy datasets.
After data preparation, exploratory data analysis will be performed on the combined data from the thirty different stories. This will include looking at the most frequent words and collocates, the strongest pairs in terms of their correlations and the spatial representation of the strongest pairs of words used in the collection of stories. We will then look at the word cloud representations of each story to understand them a little better.
This will be followed by a sentiment analysis of each story to test the hypothesis that a mystery story always represents negative sentiment.
Finally, topics models will be created to understand the sub-themes within the mystery theme in the context of these stories.
This study uses text data from the book “Thirty Strange Stories” by H.G. Wells. The book is a part of the Gutenberg corpus and the text from the book was sourced for analysis using the ‘gutenbergr’ package in R. This book includes thirty chapters, representing thirty different stories. The raw text for the book also includes a copyright page and a table contents at the beginning and some transcriber’s notes at the end.
For data preparation, the first few rows, the last few rows, and empty rows from the dataset were removed to focus on just the main text from each of the chapters. Additional variables for chapter number and line number were then created to prepare the main dataset for analysis. For secondary analysis, the words were unnested and another dataset was created.
library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
library(ggplot2)
library(wordcloud)
library(Rling)
library(modeest)
library(scales)
library(widyr)
library(tokenizers)# download the entire book
data <- gutenberg_download(59774)
# Remove the first and last unwanted rows
data <- data[96:12231,]
# check for UPPER case
data$check <- data$text == toupper(data$text)
# filter out empty rows
data <- data %>% filter(text != "" )
# create row number
data <- data %>% mutate(row_num = row_number())
# remove incorrectly detected chapter headings
data <- data[-c(205,2823,3291,4194,4630,4631,5833,5923,5975,6064,6109,8864,8989,9137,9205),]
# Create Chapter
data$chapter <- cumsum(data$check)
# Create a separate datset for chapter
chapters_headings <- filter(data, check == TRUE) %>% rename(chapter_name = text) %>%
select(chapter, chapter_name)
# Clean up and join with chapter headings
data <- data %>% mutate(title = "Thirty Strange Stories") %>% mutate(row_num = row_number()) %>%
select(title, text, row_num, chapter) %>% left_join(chapters_headings)
# Remove leading and trailing white spaces
data <- data.frame(lapply(data, trimws), stringsAsFactors = FALSE)
data$row_num <- as.integer(data$row_num)
data$chapter <- as.integer(data$chapter)
# Words for initial analysis
words <- data %>% unnest_tokens(word, text)
# Remove leading and trailing white spaces from chapters_heading for later use
chapters_headings <- data.frame(lapply(chapters_headings, trimws), stringsAsFactors = FALSE)Frequently occurring words
The most common words used in the entire book are listed below. It’s not surprising to see that ‘the’ has the highest count since it is one of the most commonly used words in English. What stands out in the below table is that character names are not on top. This makes sense as the book has different stories with different characters and story plots.
Collocates
Below are the most commonly occurring collocates. Interestingly, the top collocates have the pronoun ‘it’ rather than ‘he’ or ‘she’.
Strongest pairs
Below are the correlations of pairs of words in the book “Thirty Strange Stories”. Three pairs of words that seem to have an interesting co-relation are ‘before - from’, ‘before - they’ and ‘from - they’.
keyword_cors = words %>%
group_by(word) %>%
filter(n() >= 50) %>%
pairwise_cor(word, chapter, sort = TRUE, upper = FALSE)
keyword_corsThe below graph shows a network plot consisting of the strongest pairs with correlations greater than 0.8.
library(ggplot2)
library(igraph)
library(ggraph)
keyword_cors %>%
filter(correlation > .8) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "blue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()Aubrey seems to have highest correlation with Vair, which is her last name in the chapter ‘In the modern vein’. She is the main protagonist with the story-line revolving around her life. But it is interesting how her last name is always used with her first name.
The rest of the correlations are between general terms such as ‘thought’ and ‘came’, ‘thing’ and ‘me’, and ‘still’ and ‘face’. As the grammar and word pair analysis was done on the entire collection composed of different stories, no specific pattern was identified.
This section provides a word cloud of the most frequently used words in each short story in the book.
The Strange Orchid
words %>% filter(chapter == 1) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Ãpyornis Island
words %>% filter(chapter == 2) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Plattner Story
words %>% filter(chapter == 3) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Argonauts Of The Air
words %>% filter(chapter == 4) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Story Of The Late Mr. Elvesham
words %>% filter(chapter == 5) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Stolen Bacillus
words %>% filter(chapter == 6) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Red Room
words %>% filter(chapter == 7) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Moth (Genus Unknown)
words %>% filter(chapter == 8) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))In The Abyss
words %>% filter(chapter == 9) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Under The Knife
words %>% filter(chapter == 10) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Reconciliation
words %>% filter(chapter == 11) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Slip Under The Microscope
words %>% filter(chapter == 12) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))In The Avu Observatory
words %>% filter(chapter == 13) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Triumphs Of A Taxidermist
words %>% filter(chapter == 14) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Deal In Ostriches
words %>% filter(chapter == 15) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Rajah’s Treasure
words %>% filter(chapter == 16) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Story Of Davidson’s Eyes
words %>% filter(chapter == 17) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Cone
words %>% filter(chapter == 18) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Purple Pileus
words %>% filter(chapter == 19) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Catastrophe
words %>% filter(chapter == 20) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Le Mari Terrible
words %>% filter(chapter == 21) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Apple
words %>% filter(chapter == 22) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Sad Story Of A Dramatic Critic
words %>% filter(chapter == 23) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Jilting Of Jane
words %>% filter(chapter == 24) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Lost Inheritance
words %>% filter(chapter == 25) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Pollock And The Porroh Man
words %>% filter(chapter == 26) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Sea Raiders
words %>% filter(chapter == 27) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))In The Modern Vein
words %>% filter(chapter == 28) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Lord Of The Dynamos
words %>% filter(chapter == 29) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Treasure In The Forest
words %>% filter(chapter == 30) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The most commonly used words from each story are presented below:
Looking at each of these word clouds and the most commonly used words in each story, it was seen that each chapter has a completely different set of words and most of them are character names. Some of the most commonly occurring words also happen to be from the title of the story. Each story seems to have some profession or job associated with it, such as painter, hooker, or commissioner.
H.G. Wells is known for his science fiction works, which are filled with horror and dark mysteries. As such, the negative sentiment in these stories is expected to be high. However, the study intends to test the hypothesis that a mystery story always represents negative sentiment.
This section looks at the sentiments represented by each of the 30 stories based the positive and negative scores of every 20 lines from each story using the ‘bing’ sentiment lexicon.
sentiment_books <- words %>%
inner_join(get_sentiments("bing")) %>%
count(chapter_name, index = row_num %/% 20, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
sentiment_books_1 <- sentiment_books %>% inner_join(chapters_headings[1:6,])
sentiment_books_2 <- sentiment_books %>% inner_join(chapters_headings[7:12,])
sentiment_books_3 <- sentiment_books %>% inner_join(chapters_headings[13:18,])
sentiment_books_4 <- sentiment_books %>% inner_join(chapters_headings[19:24,])
sentiment_books_5 <- sentiment_books %>% inner_join(chapters_headings[25:30,])
ggplot(sentiment_books_1, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_2, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_3, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_4, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_5, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()A vast majority of the stories overwhelmingly represent negative sentiment throughout as expected. However, not all stories represent negative sentiment. In fact, stories such as ‘The Triumphs of a Taxidermist’ and ‘Le Mari Terrible’ represent positive sentiment mostly. This disproves the hypothesis that a mystery story always represents negative sentiment.
The following observations can be made with respect to each story.
In this section, the 30 mystery stories were clustered into five different themes to understand some of the sub-themes within the mystery theme. To this end, various topics models such as LDA Fit, LDA Fixed, LDA Gibbs, and CTM were run and their alpha and entropy values were compared. Based on their alpha and entropy values, the LDA fit model was picked for further analysis and the gamma values for the short stories and the beta values of the terms were explored using visual means.
A corpus was first created using all the short stories and the stories were considered as documents.
topics_data <- data %>% select(chapter_name, text, chapter) %>%
unite(document, chapter_name, chapter)
by_chapter = topics_data %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))
import_corpus = Corpus(VectorSource(by_chapter$text))The text was then cleaned up to remove numbers, punctuation, stop words, and words of length less than 4 in order to create a term document matrix.
import_mat =
DocumentTermMatrix(import_corpus,
control = list(stemming = FALSE,
stopwords = TRUE, #remove stop words
minWordLength = 4, #cut out small words
removeNumbers = TRUE, #take out the numbers
removePunctuation = TRUE)) #take out the punctuationThe term document matrix was then weighted to control for the sparsity of the matrix. This was done because not all words are in each document and some words are very frequent. It is required to control for both ends of the spectrum, that is the words with zero frequency as well as the very frequent words.
#weight the space
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i],
import_mat$j,
mean) *
log2(nDocs(import_mat)/col_sums(import_mat > 0))
#ignore the very frequent and 0 terms
import_mat = import_mat[ , import_weight >= 0.015]
import_mat = import_mat[row_sums(import_mat) > 0, ]The following models were run:
The number of topics for the analysis was set to 5 to see if these mystery stories can be clustered into five sub-themes.
#set the number of topics
k = 5
#set a random number for seed
SEED = 12345
LDA_fit = LDA(import_mat, k = k,
control = list(seed = SEED))
LDA_fixed = LDA(import_mat, k = k,
control = list(estimate.alpha = FALSE, seed = SEED))
LDA_gibbs = LDA(import_mat, k = k, method = "Gibbs",
control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000))
CTM_fit = CTM(import_mat, k = k,
control = list(seed = SEED,
var = list(tol = 10^-4),
em = list(tol = 10^-3)))Alpha Values
Alpha is a measure of the number or rather the predominance of topics. Low alpha values indicate that few document topics are predominant per story and high values indicate more topics are predominant per story.
## [1] 0.016422
## [1] 10
## [1] 10
The LDA Fit Model has a very low alpha value, indicating that a single topic is predominant across the stories and there is not much spread. The higher alpha values for LDA Fixed and LDA Gibbs models indicate higher spread across the topics.
Entropy Values
Entropy is a measure of randomness. Low entropy values indicate low randomness or less topics or more coherence in a doc. High entropy values indicate high randomness, that is the topics are all over the place.
sapply(list(LDA_fit, LDA_fixed, LDA_gibbs, CTM_fit),
function (x)
mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))## [1] 0.04223961 1.14582955 1.15065869 0.13283934
The LDA Fit Model and CTM have low entropy values, indicating low randomness or coherence in each story. The LDA Fixed and LDA Gibbs models have very high entropy values, indicating that the topics are all over the place, that is, very less coherence within the stories.
Based on the alpha values and the entropy values, the LDA fit model was picked for further analysis.
In this section, the LDA Fit Model, which has the lowest entropy, is explored in detail.
First, the most frequent terms within each of the topics were looked at. Most of these were names of characters or their professions and hence it is difficult to understand what the themes of their topics were. However, it seems that topic 1 is related to mystery in an indoor setting or in a boat.
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "woodhouse" "plattner" "hapley" "aubrey"
## [2,] "monson" "golam" "pawkins" "vair"
## [3,] "boat" "sphere" "evans" "horrocks"
## [4,] "davidson" "deputycommissioner" "findlay" "raut"
## [5,] "fison" "azim" "hinchcliff" "azumazi"
## [6,] "canoe" "winslow" "temple" "jane"
## [7,] "telescope" "rajah" "hooker" "holroyd"
## [8,] "observatory" "elstead" "moth" "dynamo"
## [9,] "candles" "plattnerâs" "findlayâs" "william"
## [10,] "housekeeper" "samud" "elvesham" "diamond"
## Topic 5
## [1,] "pollock"
## [2,] "wedderburn"
## [3,] "coombes"
## [4,] "porroh"
## [5,] "waterhouse"
## [6,] "jennie"
## [7,] "perera"
## [8,] "bacteriologist"
## [9,] "haysman"
## [10,] "clarence"
The stories that represent these topics the most and the least were then looked at.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## [1,] 2 4 3 5 1 2 1 4 1 5 3 1 4 4 4 1 2 5 2 3 1 1 1 5 1 3
## [2,] 1 1 1 1 5 1 3 1 2 3 1 2 1 1 1 5 1 4 1 1 2 2 2 2 2 1
## [3,] 3 3 2 4 2 3 2 2 3 1 2 3 2 2 2 2 4 1 3 2 3 3 3 4 3 2
## [4,] 4 5 4 2 4 4 4 3 4 2 4 4 3 3 3 3 5 2 5 4 4 4 4 1 4 4
## [5,] 5 2 5 3 3 5 5 5 5 4 5 5 5 5 5 4 3 3 4 5 5 5 5 3 5 5
## 27 28 29 30
## [1,] 1 3 5 2
## [2,] 5 1 1 1
## [3,] 2 2 2 3
## [4,] 3 4 3 4
## [5,] 4 5 4 5
Each of these five topics are represented the most by the following stories:
Based on the stories represented by topic 1, it was clear that topic 1 is related to mystery in an indoor setting or in a boat. The boat theme is probably coming from The Sea Raiders.
The beta values represent the weight of each word with respect to each topic. Let us plot the most frequent terms for each topic and visualize their beta values.
#use tidyverse to clean up the fit
LDA_fit_topics = tidy(LDA_fit, matrix = "beta")
#create top terms
top_terms = LDA_fit_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
#clean up ggplot2 defaults
cleanup = theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
legend.key = element_rect(fill = "white"),
text = element_text(size = 10))
#make the plot
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
cleanup +
coord_flip()The terms with highest beta values (> 0.1) and the topics they represent are below:
All these seem to be names of characters from each of these stories.
The gamma values represent the probability of each topic within each story. The gamma matrix from the LDA Fit Model was taken and their gamma values were visualized.
LDA_gamma = tidy(LDA_fit, matrix = "gamma") %>%
left_join(chapters_headings, by = c("document" = "chapter"))
LDA_gamma %>%
ggplot(aes(factor(topic), gamma)) +
geom_point() +
geom_text(aes(label=ifelse(gamma < 0.9 & gamma > 0.1, as.character(LDA_gamma$chapter_name),''),
hjust=0,
vjust=0)) +
cleanupA lot of points with gamma of 1 and a lot of points with gamma of 0 are seen. This is in tune with the low entropy score for this model. Either a topic is highly probable for a story or it is highly improbable.
The only exception was the story, ‘The Sea Raiders’ which seemed to have a 75% probability of representing Topic 1 and a 25% probability of representing Topic 5.
This was in tune with the earlier interpretation that topic 1 is related to mystery in a indoor setting or a boat. The boat setting probably came from ‘The Sea Raiders’ which struggled to be completely represented by topic 1, which is mostly related to an indoor setting.
Natural Language Processing is a rapidly evolving field and as such the study of it is quintessential to anyone involved in the field of data analysis. Using motivation from prior research in the field of applying NLP to fictional literature, this study explored NLP themes such as sentiment analysis and topics modeling using text data from ‘Thirty Strange Stories’ by H.G. Wells.
Initial data exploration revealed some of the most commonly used words, collocates, and the relationship between pairs of words used in these stories. Sentiment analysis helped disprove the hypothesis that a mystery story always represents negative sentiment.Topics Models were then built to explore the major sub-themes within these mystery stories. Although mystery in an indoor setting was interpreted as one of the main sub-themes other topics couldn’t be explored further due to the high prevalence of names and designations of characters.
Further study of this fictional work could involve retesting the hypothesis that a mystery story always represents negative sentiment using other sentiment lexicons either individually or as a combination in a weighted manner. Distinctive Collexeme Analysis can also be performed to help understand how interesting words in the book are used in positive and negative contexts. Using packages to remove non dictionary words, the names of characters can also be removed and the topics models rerun. This would help in better interpretation of the sub-themes within these mystery stories. Finally, additional analysis could include using the LIWC2015 Text Processing Module to extract the psychometric properties of the language used in this fictional work to build models such as MDS, PCA and EFA for deriving more insights about the style of writing.
Anvari, S., & Amirkhani, H. (2018). Book2Vec: Representing Books in Vector Space Without Using the Contents. 2018 8th International Conference on Computer and Knowledge Engineering (ICCKE), 176-182.
Ashok, V.G., Feng, S., & Choi, Y. (2013). Success with Style: Using Writing Style to Predict the Success of Novels. EMNLP.
Egbert, Jesse. (2012). Style in nineteenth century fiction: A Multi-Dimensional analysis. Scientific Study of Literature. 2. 10.1075/ssol.2.2.01egb.
Jautze, K.J. (2014). Measuring the style of chick lit and literature. DH.
Solorio, T., Montes-y-Gomez, M., Maharjan, S., Ovalle, J.E., & Gonzalez, F.A. (2017). A Multi-task Approach to Predict Likability of Books. EACL.