Introduction

Natural Language Processing using computational means is one of the fastest growing fields today and its applications across various domains are innumerable. Virtual assistants such as Google Home and Amazon Alexa, which have become household names, are applications of NLP. Just as these devices are getting more skillful or “human-like” each day, the field of NLP is also growing rapidly and the ability of models to infer meaning from language can propel our existing forms of communication to a new age. As such, the study of this field has become paramount to establish a promising career as an academic or as a data science professional.

This paper intends to apply some of the important themes surrounding NLP such as sentiment analysis and topics modeling to a work of fictional literature to study the main themes in it. Research in this area has been rife especially with the coming of age of computational power and the understanding of how to apply mathematics other than summary statistics to language.

Ashok, Feng, and Choi (2013) were able to predict the commercial success of a novel based on the writing style. Egbert (2013) compared styles of 19th century fiction writing among authors, and the style variations among novels of individual authors using multi-dimensional analysis. Jautze (2014) looked at the extent to which the distribution of the most frequent words of two chick lit and literature novelistic genres gave insights into genre styles. Solorio, Montes-y-Gomez, Maharjan, Ovalle, and Gonzalez (2017) used feature engineering and neural network models to predict the likability of books from the Gutenberg corpus. Anvari and Amirkhani (2018) created a neural network based embedding approach called book2vec for creating book representations using Google’s word2vec model.

The fictional work chosen for this paper was ‘Thirty Strange Stories’, written by H.G Wells. H.G. Wells is a famous fiction writer of the late nineteenth century, well known for writing books such as the War of the worlds, The Invisible Man and The Time Machine. As the name suggests, ‘Thirty Strange Stories’ is a collection of 30 stories, each with the overarching theme of mystery.

Below is a list of the 30 stories covered in this book:

  • The Strange Orchid
  • Æpyornis Island
  • The Plattner Story
  • The Argonauts Of The Air
  • The Story Of The Late Mr. Elvesham
  • The Stolen Bacillus
  • The Red Room
  • A Moth (Genus Unknown)
  • In The Abyss
  • Under The Knife
  • The Reconciliation
  • A Slip Under The Microscope
  • In The Avu Observatory
  • The Triumphs Of A Taxidermist
  • A Deal In Ostriches
  • The Rajah’s Treasure
  • The Story Of Davidson’s Eyes
  • The Cone
  • The Purple Pileus
  • A Catastrophe
  • Le Mari Terrible
  • The Apple
  • The Sad Story Of A Dramatic Critic
  • The Jilting Of Jane
  • The Lost Inheritance
  • Pollock And The Porroh Man
  • The Sea Raiders
  • In The Modern Vein
  • The Lord Of The Dynamos
  • The Treasure In The Forest

Problem Statement

Since this literary collection represents thirty different mystery stories, it is tedious to try to understand each one of them separately. However, it might be interesting to see how these stories cluster together to form sub-themes within the broader mystery theme. Another interesting hypothesis is to test whether a mystery story always represents negative sentiment. Lastly, the study will try to understand these stories better through some exploratory data analysis.

Statistical Analysis Plan

Text data such as this would require significant cleaning and as such data processing techniques will be used to transform the raw data into tidy datasets.

After data preparation, exploratory data analysis will be performed on the combined data from the thirty different stories. This will include looking at the most frequent words and collocates, the strongest pairs in terms of their correlations and the spatial representation of the strongest pairs of words used in the collection of stories. We will then look at the word cloud representations of each story to understand them a little better.

This will be followed by a sentiment analysis of each story to test the hypothesis that a mystery story always represents negative sentiment.

Finally, topics models will be created to understand the sub-themes within the mystery theme in the context of these stories.

Method and Statistical Analysis

Data

This study uses text data from the book “Thirty Strange Stories” by H.G. Wells. The book is a part of the Gutenberg corpus and the text from the book was sourced for analysis using the ‘gutenbergr’ package in R. This book includes thirty chapters, representing thirty different stories. The raw text for the book also includes a copyright page and a table contents at the beginning and some transcriber’s notes at the end.

Data Preparation

For data preparation, the first few rows, the last few rows, and empty rows from the dataset were removed to focus on just the main text from each of the chapters. Additional variables for chapter number and line number were then created to prepare the main dataset for analysis. For secondary analysis, the words were unnested and another dataset was created.

Exploratory data analysis

Word Usage

Frequently occurring words

The most common words used in the entire book are listed below. It’s not surprising to see that ‘the’ has the highest count since it is one of the most commonly used words in English. What stands out in the below table is that character names are not on top. This makes sense as the book has different stories with different characters and story plots.

Collocates

Below are the most commonly occurring collocates. Interestingly, the top collocates have the pronoun ‘it’ rather than ‘he’ or ‘she’.

Strongest pairs

Below are the correlations of pairs of words in the book “Thirty Strange Stories”. Three pairs of words that seem to have an interesting co-relation are ‘before - from’, ‘before - they’ and ‘from - they’.

The below graph shows a network plot consisting of the strongest pairs with correlations greater than 0.8.

Aubrey seems to have highest correlation with Vair, which is her last name in the chapter ‘In the modern vein’. She is the main protagonist with the story-line revolving around her life. But it is interesting how her last name is always used with her first name.

The rest of the correlations are between general terms such as ‘thought’ and ‘came’, ‘thing’ and ‘me’, and ‘still’ and ‘face’. As the grammar and word pair analysis was done on the entire collection composed of different stories, no specific pattern was identified.

Word Clouds

This section provides a word cloud of the most frequently used words in each short story in the book.

The Strange Orchid

Æpyornis Island

The Plattner Story

The Argonauts Of The Air

The Story Of The Late Mr. Elvesham

The Stolen Bacillus

The Red Room

A Moth (Genus Unknown)

In The Abyss

Under The Knife

The Reconciliation

A Slip Under The Microscope

In The Avu Observatory

The Triumphs Of A Taxidermist

A Deal In Ostriches

The Rajah’s Treasure

The Story Of Davidson’s Eyes

The Cone

The Purple Pileus

A Catastrophe

Le Mari Terrible

The Apple

The Sad Story Of A Dramatic Critic

The Jilting Of Jane

The Lost Inheritance

Pollock And The Porroh Man

The Sea Raiders

In The Modern Vein

The Lord Of The Dynamos

The Treasure In The Forest

The most commonly used words from each story are presented below:

  • The Strange Orchid - wedderburn, housekeeper, orchid
  • Æpyornis Island - water, canoe, time
  • The Plattner Story - plattner, green, world
  • The Argonauts Of The Air - monson, woodhouse, flying machine
  • The Story Of The Late Mr. Elvesham - mind, body, eyes, head, street
  • The Stolen Bacillus - bacteriologist, cab, anarchist
  • The Red Room - candle, door, table
  • A Moth (Genus Unknown) - hapley, moth, pawkins
  • In The Abyss - light, sphere, water
  • Under The Knife - earth, light, black
  • The Reconciliation - temple, findlay, hand
  • A Slip Under The Microscope - hill, wedderburn, laboratory
  • In The Avu Observatory - woodhouse, telescope, observatory
  • The Triumphs Of A Taxidermist - birds, stuffed, taxidermy
  • A Deal In Ostriches - diamond, padishah, potter
  • The Rajah’s Treasure - deputy, commissioner, golam
  • The Story Of Davidson’s Eyes - davidson, eyes, bellows
  • The Cone - horrocks, raut, suddenly
  • The Purple Pileus - coombes, jennie, clarence
  • A Catastrophe - winslow, minnie, shop
  • Le Mari Terrible - hot, tea, people, bellows
  • The Apple - hinchcliff, fruit, stranger
  • The Sad Story Of A Dramatic Critic - dalia, dramatic, hand
  • The Jilting Of Jane - Jane, William, Ma’am
  • The Lost Inheritance - Ted, uncle, book, eye
  • Pollock And The Porroh Man - pollock, porroh, waterhouse
  • The Sea Raiders - fison, tentacles, creatures
  • In The Modern Vein - vair, aubrey, love
  • The Lord Of The Dynamos - azuma, zi, holroyd
  • The Treasure In The Forest - hooker, evans, canoe

Looking at each of these word clouds and the most commonly used words in each story, it was seen that each chapter has a completely different set of words and most of them are character names. Some of the most commonly occurring words also happen to be from the title of the story. Each story seems to have some profession or job associated with it, such as painter, hooker, or commissioner.

Sentiment Analysis

H.G. Wells is known for his science fiction works, which are filled with horror and dark mysteries. As such, the negative sentiment in these stories is expected to be high. However, the study intends to test the hypothesis that a mystery story always represents negative sentiment.

This section looks at the sentiments represented by each of the 30 stories based the positive and negative scores of every 20 lines from each story using the ‘bing’ sentiment lexicon.

A vast majority of the stories overwhelmingly represent negative sentiment throughout as expected. However, not all stories represent negative sentiment. In fact, stories such as ‘The Triumphs of a Taxidermist’ and ‘Le Mari Terrible’ represent positive sentiment mostly. This disproves the hypothesis that a mystery story always represents negative sentiment.

The following observations can be made with respect to each story.

  • The Strange Orchid - Starts off on a fairly positive note, but ends on a negative one.
  • Æpyornis Island - Negative sentiment mostly except in the middle
  • The Plattner Story - After a positive opening, the story is negative for about four-fifth of the story.
  • The Argonauts Of The Air - Starts and ends negatively, few sections of positivism in between.
  • The Story Of The Late Mr. Elvesham - First third is positive and the remaining is negative.
  • The Stolen Bacillus - Negative throughout
  • The Red Room - Mostly Negative
  • A Moth (Genus Unknown) - Mostly Negative
  • In The Abyss - Mostly Negative except in the beginning and slightly towards the end
  • Under The Knife - Mostly Negative
  • The Reconciliation - Mostly Negative
  • A Slip Under The Microscope - Mix of negative and positive till the last third. Last third is mostly negative.
  • In The Avu Observatory - Mostly negative except at the end.
  • The Triumphs Of A Taxidermist - Mostly positive
  • A Deal In Ostriches - Starts off negatively, but is mostly positive.
  • The Rajah’s Treasure - Mostly positive
  • The Story Of Davidson’s Eyes - Mostly negative except at the end.
  • The Cone - Mostly negative
  • The Purple Pileus - Mostly negative except at the end.
  • A Catastrophe - Mostly negative
  • Le Mari Terrible - Mostly positive
  • The Apple - Mostly negative
  • The Sad Story Of A Dramatic Critic - Mostly negative except at the beginning
  • The Jilting Of Jane - Mostly positive
  • The Lost Inheritance - Positive in the middle and negative towards the end
  • Pollock And The Porroh Man - Mostly negative
  • The Sea Raiders - Mostly negative
  • In The Modern Vein - Starts off positively, but ends negatively
  • The Lord Of The Dynamos - Mostly negative
  • The Treasure In The Forest - Mostly negative

Topics Modeling

In this section, the 30 mystery stories were clustered into five different themes to understand some of the sub-themes within the mystery theme. To this end, various topics models such as LDA Fit, LDA Fixed, LDA Gibbs, and CTM were run and their alpha and entropy values were compared. Based on their alpha and entropy values, the LDA fit model was picked for further analysis and the gamma values for the short stories and the beta values of the terms were explored using visual means.

Building the Semantic Vector Space

A corpus was first created using all the short stories and the stories were considered as documents.

The text was then cleaned up to remove numbers, punctuation, stop words, and words of length less than 4 in order to create a term document matrix.

The term document matrix was then weighted to control for the sparsity of the matrix. This was done because not all words are in each document and some words are very frequent. It is required to control for both ends of the spectrum, that is the words with zero frequency as well as the very frequent words.

Modeling

The following models were run:

  • LDA Fit Model : This model uses the VEM (Variational expectation maximization) algorithm and estimates alpha.
  • LDA Fixed Model : This model also uses the VEM algorithm but with a fixed alpha value.
  • LDA Gibbs Model : This model uses the Gibbs algorithm, which is a Bayesian algorithm, instead of the VEM algorithm.
  • CTM : The correlated topics models allows for correlation between topics and uses the VEM algorithm.

The number of topics for the analysis was set to 5 to see if these mystery stories can be clustered into five sub-themes.

Comparison of Models

Alpha Values

Alpha is a measure of the number or rather the predominance of topics. Low alpha values indicate that few document topics are predominant per story and high values indicate more topics are predominant per story.

## [1] 0.016422
## [1] 10
## [1] 10

The LDA Fit Model has a very low alpha value, indicating that a single topic is predominant across the stories and there is not much spread. The higher alpha values for LDA Fixed and LDA Gibbs models indicate higher spread across the topics.

Entropy Values

Entropy is a measure of randomness. Low entropy values indicate low randomness or less topics or more coherence in a doc. High entropy values indicate high randomness, that is the topics are all over the place.

## [1] 0.04223961 1.14582955 1.15065869 0.13283934

The LDA Fit Model and CTM have low entropy values, indicating low randomness or coherence in each story. The LDA Fixed and LDA Gibbs models have very high entropy values, indicating that the topics are all over the place, that is, very less coherence within the stories.

Based on the alpha values and the entropy values, the LDA fit model was picked for further analysis.

Deep-dive of LDA Fit Model

In this section, the LDA Fit Model, which has the lowest entropy, is explored in detail.

Topic Representation

First, the most frequent terms within each of the topics were looked at. Most of these were names of characters or their professions and hence it is difficult to understand what the themes of their topics were. However, it seems that topic 1 is related to mystery in an indoor setting or in a boat.

##       Topic 1       Topic 2              Topic 3      Topic 4   
##  [1,] "woodhouse"   "plattner"           "hapley"     "aubrey"  
##  [2,] "monson"      "golam"              "pawkins"    "vair"    
##  [3,] "boat"        "sphere"             "evans"      "horrocks"
##  [4,] "davidson"    "deputycommissioner" "findlay"    "raut"    
##  [5,] "fison"       "azim"               "hinchcliff" "azumazi" 
##  [6,] "canoe"       "winslow"            "temple"     "jane"    
##  [7,] "telescope"   "rajah"              "hooker"     "holroyd" 
##  [8,] "observatory" "elstead"            "moth"       "dynamo"  
##  [9,] "candles"     "plattner’s"         "findlay’s"  "william" 
## [10,] "housekeeper" "samud"              "elvesham"   "diamond" 
##       Topic 5         
##  [1,] "pollock"       
##  [2,] "wedderburn"    
##  [3,] "coombes"       
##  [4,] "porroh"        
##  [5,] "waterhouse"    
##  [6,] "jennie"        
##  [7,] "perera"        
##  [8,] "bacteriologist"
##  [9,] "haysman"       
## [10,] "clarence"

The stories that represent these topics the most and the least were then looked at.

##      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## [1,] 2 4 3 5 1 2 1 4 1  5  3  1  4  4  4  1  2  5  2  3  1  1  1  5  1  3
## [2,] 1 1 1 1 5 1 3 1 2  3  1  2  1  1  1  5  1  4  1  1  2  2  2  2  2  1
## [3,] 3 3 2 4 2 3 2 2 3  1  2  3  2  2  2  2  4  1  3  2  3  3  3  4  3  2
## [4,] 4 5 4 2 4 4 4 3 4  2  4  4  3  3  3  3  5  2  5  4  4  4  4  1  4  4
## [5,] 5 2 5 3 3 5 5 5 5  4  5  5  5  5  5  4  3  3  4  5  5  5  5  3  5  5
##      27 28 29 30
## [1,]  1  3  5  2
## [2,]  5  1  1  1
## [3,]  2  2  2  3
## [4,]  3  4  3  4
## [5,]  4  5  4  5

Each of these five topics are represented the most by the following stories:

  • Topic 1:
    • The Story Of The Late Mr. Elvesham
    • The Red Room
    • In The Abyss
    • A Slip Under The Microscope
    • The Rajah’s Treasure
    • Le Mari Terrible
    • The Apple
    • The Sad Story Of A Dramatic Critic
    • The Lost Inheritance
    • The Sea Raiders
  • Topic 2:
    • The Strange Orchid
    • The Stolen Bacillus
    • The Story Of Davidson’s Eyes
    • The Purple Pileus
    • The Treasure In The Forest
  • Topic 3:
    • The Plattner Story
    • The Reconciliation
    • A Catastrophe
    • Pollock And The Porroh Man
    • In The Modern Vein
  • Topic 4:
    • Æpyornis Island
    • A Moth (Genus Unknown)
    • In The Avu Observatory
    • The Triumphs Of A Taxidermist
    • A Deal In Ostriches
  • Topic 5:
    • The Argonauts Of The Air
    • Under The Knife
    • The Cone
    • The Jilting Of Jane
    • The Lord Of The Dynamos

Based on the stories represented by topic 1, it was clear that topic 1 is related to mystery in an indoor setting or in a boat. The boat theme is probably coming from The Sea Raiders.

Gamma Values

The gamma values represent the probability of each topic within each story. The gamma matrix from the LDA Fit Model was taken and their gamma values were visualized.

A lot of points with gamma of 1 and a lot of points with gamma of 0 are seen. This is in tune with the low entropy score for this model. Either a topic is highly probable for a story or it is highly improbable.

The only exception was the story, ‘The Sea Raiders’ which seemed to have a 75% probability of representing Topic 1 and a 25% probability of representing Topic 5.

This was in tune with the earlier interpretation that topic 1 is related to mystery in a indoor setting or a boat. The boat setting probably came from ‘The Sea Raiders’ which struggled to be completely represented by topic 1, which is mostly related to an indoor setting.

Conclusion

Natural Language Processing is a rapidly evolving field and as such the study of it is quintessential to anyone involved in the field of data analysis. Using motivation from prior research in the field of applying NLP to fictional literature, this study explored NLP themes such as sentiment analysis and topics modeling using text data from ‘Thirty Strange Stories’ by H.G. Wells.

Initial data exploration revealed some of the most commonly used words, collocates, and the relationship between pairs of words used in these stories. Sentiment analysis helped disprove the hypothesis that a mystery story always represents negative sentiment.Topics Models were then built to explore the major sub-themes within these mystery stories. Although mystery in an indoor setting was interpreted as one of the main sub-themes other topics couldn’t be explored further due to the high prevalence of names and designations of characters.

Further study of this fictional work could involve retesting the hypothesis that a mystery story always represents negative sentiment using other sentiment lexicons either individually or as a combination in a weighted manner. Distinctive Collexeme Analysis can also be performed to help understand how interesting words in the book are used in positive and negative contexts. Using packages to remove non dictionary words, the names of characters can also be removed and the topics models rerun. This would help in better interpretation of the sub-themes within these mystery stories. Finally, additional analysis could include using the LIWC2015 Text Processing Module to extract the psychometric properties of the language used in this fictional work to build models such as MDS, PCA and EFA for deriving more insights about the style of writing.

References

Anvari, S., & Amirkhani, H. (2018). Book2Vec: Representing Books in Vector Space Without Using the Contents. 2018 8th International Conference on Computer and Knowledge Engineering (ICCKE), 176-182.

Ashok, V.G., Feng, S., & Choi, Y. (2013). Success with Style: Using Writing Style to Predict the Success of Novels. EMNLP.

Egbert, Jesse. (2012). Style in nineteenth century fiction: A Multi-Dimensional analysis. Scientific Study of Literature. 2. 10.1075/ssol.2.2.01egb.

Jautze, K.J. (2014). Measuring the style of chick lit and literature. DH.

Solorio, T., Montes-y-Gomez, M., Maharjan, S., Ovalle, J.E., & Gonzalez, F.A. (2017). A Multi-task Approach to Predict Likability of Books. EACL.