Introduction

Natural Language Processing using computational means is one of the fastest growing fields today and its applications across various domains are innumerable. Virtual assistants such as Google Home and Amazon Alexa, which have become household names, are applications of NLP. Just as these devices are getting more skillful or “human-like” each day, the field of NLP is also growing rapidly and the ability of models to infer meaning from language can propel our existing forms of communication to a new age. As such, the study of this field has become paramount to establish a promising career as an academic or as a data science professional.

This paper intends to apply some of the important themes surrounding NLP such as sentiment analysis and topics modeling to a work of fictional literature to study the main themes in it. Research in this area has been rife especially with the coming of age of computational power and the understanding of how to apply mathematics other than summary statistics to language.

Ashok, Feng, and Choi (2013) were able to predict the commercial success of a novel based on the writing style. Egbert (2013) compared styles of 19th century fiction writing among authors, and the style variations among novels of individual authors using multi-dimensional analysis. Jautze (2014) looked at the extent to which the distribution of the most frequent words of two chick lit and literature novelistic genres gave insights into genre styles. Solorio, Montes-y-Gomez, Maharjan, Ovalle, and Gonzalez (2017) used feature engineering and neural network models to predict the likability of books from the Gutenberg corpus. Anvari and Amirkhani (2018) created a neural network based embedding approach called book2vec for creating book representations using Google’s word2vec model.

The fictional work chosen for this paper was ‘Thirty Strange Stories’, written by H.G Wells. H.G. Wells is a famous fiction writer of the late nineteenth century, well known for writing books such as the War of the worlds, The Invisible Man and The Time Machine. As the name suggests, ‘Thirty Strange Stories’ is a collection of 30 stories, each with the overarching theme of mystery.

Below is a list of the 30 stories covered in this book:

The Strange Orchid
Ãpyornis Island
The Plattner Story
The Argonauts Of The Air
The Story Of The Late Mr. Elvesham
The Stolen Bacillus
The Red Room
A Moth (Genus Unknown)
In The Abyss
Under The Knife
The Reconciliation
A Slip Under The Microscope
In The Avu Observatory
The Triumphs Of A Taxidermist
A Deal In Ostriches
The Rajah’s Treasure
The Story Of Davidson’s Eyes
The Cone
The Purple Pileus
A Catastrophe
Le Mari Terrible
The Apple
The Sad Story Of A Dramatic Critic
The Jilting Of Jane
The Lost Inheritance
Pollock And The Porroh Man
The Sea Raiders
In The Modern Vein
The Lord Of The Dynamos
The Treasure In The Forest

Problem Statement

Since this literary collection represents thirty different mystery stories, it is tedious to try to understand each one of them separately. However, it might be interesting to see how these stories cluster together to form sub-themes within the broader mystery theme. Another interesting hypothesis is to test whether a mystery story always represents negative sentiment. Lastly, the study will try to understand these stories better through some exploratory data analysis.

Statistical Analysis Plan

Text data such as this would require significant cleaning and as such data processing techniques will be used to transform the raw data into tidy datasets.

After data preparation, exploratory data analysis will be performed on the combined data from the thirty different stories. This will include looking at the most frequent words and collocates, the strongest pairs in terms of their correlations and the spatial representation of the strongest pairs of words used in the collection of stories. We will then look at the word cloud representations of each story to understand them a little better.

This will be followed by a sentiment analysis of each story to test the hypothesis that a mystery story always represents negative sentiment.

Finally, topics models will be created to understand the sub-themes within the mystery theme in the context of these stories.

Method and Statistical Analysis

Data

This study uses text data from the book “Thirty Strange Stories” by H.G. Wells. The book is a part of the Gutenberg corpus and the text from the book was sourced for analysis using the ‘gutenbergr’ package in R. This book includes thirty chapters, representing thirty different stories. The raw text for the book also includes a copyright page and a table contents at the beginning and some transcriber’s notes at the end.

Data Preparation

For data preparation, the first few rows, the last few rows, and empty rows from the dataset were removed to focus on just the main text from each of the chapters. Additional variables for chapter number and line number were then created to prepare the main dataset for analysis. For secondary analysis, the words were unnested and another dataset was created.

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
library(ggplot2)
library(wordcloud)
library(Rling)
library(modeest)
library(scales)
library(widyr)
library(tokenizers)

# download the entire book
data <- gutenberg_download(59774)

# Remove the first and last unwanted rows
data <- data[96:12231,]

# check for UPPER case 
data$check <- data$text == toupper(data$text)

# filter out empty rows
data <- data %>% filter(text != "" )

# create row number
data <- data %>% mutate(row_num = row_number())

# remove incorrectly detected chapter headings
data <- data[-c(205,2823,3291,4194,4630,4631,5833,5923,5975,6064,6109,8864,8989,9137,9205),]

# Create Chapter
data$chapter <- cumsum(data$check)

# Create a separate datset for chapter 
chapters_headings <- filter(data, check == TRUE) %>% rename(chapter_name = text) %>% 
                      select(chapter, chapter_name) 


# Clean up and join with chapter headings 
data <- data %>% mutate(title = "Thirty Strange Stories") %>% mutate(row_num = row_number()) %>% 
        select(title, text, row_num, chapter) %>% left_join(chapters_headings)

# Remove leading and trailing white spaces
data <- data.frame(lapply(data, trimws), stringsAsFactors = FALSE)
data$row_num <- as.integer(data$row_num)
data$chapter <- as.integer(data$chapter)

# Words for initial analysis
words <- data %>% unnest_tokens(word, text)

# Remove leading and trailing white spaces from chapters_heading for later use
chapters_headings <- data.frame(lapply(chapters_headings, trimws), stringsAsFactors = FALSE)

Exploratory data analysis

Word Usage

Frequently occurring words

The most common words used in the entire book are listed below. It’s not surprising to see that ‘the’ has the highest count since it is one of the most commonly used words in English. What stands out in the below table is that character names are not on top. This makes sense as the book has different stories with different characters and story plots.

words %>% 
  count(word, sort = TRUE)

Collocates

Below are the most commonly occurring collocates. Interestingly, the top collocates have the pronoun ‘it’ rather than ‘he’ or ‘she’.

word_pairs =words %>% 
  pairwise_count(word, chapter, sort = TRUE, upper = FALSE)
word_pairs

Strongest pairs

Below are the correlations of pairs of words in the book “Thirty Strange Stories”. Three pairs of words that seem to have an interesting co-relation are ‘before - from’, ‘before - they’ and ‘from - they’.

keyword_cors = words %>% 
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, chapter, sort = TRUE, upper = FALSE)
keyword_cors

The below graph shows a network plot consisting of the strongest pairs with correlations greater than 0.8.

library(ggplot2)
library(igraph)
library(ggraph)
keyword_cors %>%
  filter(correlation > .8) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "blue") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Aubrey seems to have highest correlation with Vair, which is her last name in the chapter ‘In the modern vein’. She is the main protagonist with the story-line revolving around her life. But it is interesting how her last name is always used with her first name.

The rest of the correlations are between general terms such as ‘thought’ and ‘came’, ‘thing’ and ‘me’, and ‘still’ and ‘face’. As the grammar and word pair analysis was done on the entire collection composed of different stories, no specific pattern was identified.

Word Clouds

This section provides a word cloud of the most frequently used words in each short story in the book.

The Strange Orchid

words %>% filter(chapter == 1) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Ãpyornis Island

words %>% filter(chapter == 2) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Plattner Story

words %>% filter(chapter == 3) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Argonauts Of The Air

words %>% filter(chapter == 4) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Story Of The Late Mr. Elvesham

words %>% filter(chapter == 5) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Stolen Bacillus

words %>% filter(chapter == 6) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Red Room

words %>% filter(chapter == 7) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Moth (Genus Unknown)

words %>% filter(chapter == 8) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

In The Abyss

words %>% filter(chapter == 9) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Under The Knife

words %>% filter(chapter == 10) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Reconciliation

words %>% filter(chapter == 11) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Slip Under The Microscope

words %>% filter(chapter == 12) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

In The Avu Observatory

words %>% filter(chapter == 13) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Triumphs Of A Taxidermist

words %>% filter(chapter == 14) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Deal In Ostriches

words %>% filter(chapter == 15) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Rajah’s Treasure

words %>% filter(chapter == 16) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Story Of Davidson’s Eyes

words %>% filter(chapter == 17) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Cone

words %>% filter(chapter == 18) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Purple Pileus

words %>% filter(chapter == 19) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Catastrophe

words %>% filter(chapter == 20) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Le Mari Terrible

words %>% filter(chapter == 21) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Apple

words %>% filter(chapter == 22) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Sad Story Of A Dramatic Critic

words %>% filter(chapter == 23) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Jilting Of Jane

words %>% filter(chapter == 24) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Lost Inheritance

words %>% filter(chapter == 25) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Pollock And The Porroh Man

words %>% filter(chapter == 26) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Sea Raiders

words %>% filter(chapter == 27) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

In The Modern Vein

words %>% filter(chapter == 28) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Lord Of The Dynamos

words %>% filter(chapter == 29) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Treasure In The Forest

words %>% filter(chapter == 30) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The most commonly used words from each story are presented below:

The Strange Orchid - wedderburn, housekeeper, orchid
Ãpyornis Island - water, canoe, time
The Plattner Story - plattner, green, world
The Argonauts Of The Air - monson, woodhouse, flying machine
The Story Of The Late Mr. Elvesham - mind, body, eyes, head, street
The Stolen Bacillus - bacteriologist, cab, anarchist
The Red Room - candle, door, table
A Moth (Genus Unknown) - hapley, moth, pawkins
In The Abyss - light, sphere, water
Under The Knife - earth, light, black
The Reconciliation - temple, findlay, hand
A Slip Under The Microscope - hill, wedderburn, laboratory
In The Avu Observatory - woodhouse, telescope, observatory
The Triumphs Of A Taxidermist - birds, stuffed, taxidermy
A Deal In Ostriches - diamond, padishah, potter
The Rajah’s Treasure - deputy, commissioner, golam
The Story Of Davidson’s Eyes - davidson, eyes, bellows
The Cone - horrocks, raut, suddenly
The Purple Pileus - coombes, jennie, clarence
A Catastrophe - winslow, minnie, shop
Le Mari Terrible - hot, tea, people, bellows
The Apple - hinchcliff, fruit, stranger
The Sad Story Of A Dramatic Critic - dalia, dramatic, hand
The Jilting Of Jane - Jane, William, Ma’am
The Lost Inheritance - Ted, uncle, book, eye
Pollock And The Porroh Man - pollock, porroh, waterhouse
The Sea Raiders - fison, tentacles, creatures
In The Modern Vein - vair, aubrey, love
The Lord Of The Dynamos - azuma, zi, holroyd
The Treasure In The Forest - hooker, evans, canoe

Looking at each of these word clouds and the most commonly used words in each story, it was seen that each chapter has a completely different set of words and most of them are character names. Some of the most commonly occurring words also happen to be from the title of the story. Each story seems to have some profession or job associated with it, such as painter, hooker, or commissioner.

Sentiment Analysis

H.G. Wells is known for his science fiction works, which are filled with horror and dark mysteries. As such, the negative sentiment in these stories is expected to be high. However, the study intends to test the hypothesis that a mystery story always represents negative sentiment.

This section looks at the sentiments represented by each of the 30 stories based the positive and negative scores of every 20 lines from each story using the ‘bing’ sentiment lexicon.

sentiment_books <- words %>%
  inner_join(get_sentiments("bing")) %>%
  count(chapter_name, index = row_num %/% 20, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

sentiment_books_1 <- sentiment_books %>% inner_join(chapters_headings[1:6,])
sentiment_books_2 <- sentiment_books %>% inner_join(chapters_headings[7:12,])
sentiment_books_3 <- sentiment_books %>% inner_join(chapters_headings[13:18,])
sentiment_books_4 <- sentiment_books %>% inner_join(chapters_headings[19:24,])
sentiment_books_5 <- sentiment_books %>% inner_join(chapters_headings[25:30,])

ggplot(sentiment_books_1, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_2, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_3, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_4, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_5, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

A vast majority of the stories overwhelmingly represent negative sentiment throughout as expected. However, not all stories represent negative sentiment. In fact, stories such as ‘The Triumphs of a Taxidermist’ and ‘Le Mari Terrible’ represent positive sentiment mostly. This disproves the hypothesis that a mystery story always represents negative sentiment.

The following observations can be made with respect to each story.

The Strange Orchid - Starts off on a fairly positive note, but ends on a negative one.
Ãpyornis Island - Negative sentiment mostly except in the middle
The Plattner Story - After a positive opening, the story is negative for about four-fifth of the story.
The Argonauts Of The Air - Starts and ends negatively, few sections of positivism in between.
The Story Of The Late Mr. Elvesham - First third is positive and the remaining is negative.
The Stolen Bacillus - Negative throughout
The Red Room - Mostly Negative
A Moth (Genus Unknown) - Mostly Negative
In The Abyss - Mostly Negative except in the beginning and slightly towards the end
Under The Knife - Mostly Negative
The Reconciliation - Mostly Negative
A Slip Under The Microscope - Mix of negative and positive till the last third. Last third is mostly negative.
In The Avu Observatory - Mostly negative except at the end.
The Triumphs Of A Taxidermist - Mostly positive
A Deal In Ostriches - Starts off negatively, but is mostly positive.
The Rajah’s Treasure - Mostly positive
The Story Of Davidson’s Eyes - Mostly negative except at the end.
The Cone - Mostly negative
The Purple Pileus - Mostly negative except at the end.
A Catastrophe - Mostly negative
Le Mari Terrible - Mostly positive
The Apple - Mostly negative
The Sad Story Of A Dramatic Critic - Mostly negative except at the beginning
The Jilting Of Jane - Mostly positive
The Lost Inheritance - Positive in the middle and negative towards the end
Pollock And The Porroh Man - Mostly negative
The Sea Raiders - Mostly negative
In The Modern Vein - Starts off positively, but ends negatively
The Lord Of The Dynamos - Mostly negative
The Treasure In The Forest - Mostly negative

Topics Modeling

In this section, the 30 mystery stories were clustered into five different themes to understand some of the sub-themes within the mystery theme. To this end, various topics models such as LDA Fit, LDA Fixed, LDA Gibbs, and CTM were run and their alpha and entropy values were compared. Based on their alpha and entropy values, the LDA fit model was picked for further analysis and the gamma values for the short stories and the beta values of the terms were explored using visual means.

Building the Semantic Vector Space

A corpus was first created using all the short stories and the stories were considered as documents.

topics_data <- data %>% select(chapter_name, text, chapter) %>%
                unite(document, chapter_name, chapter)

by_chapter = topics_data %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

import_corpus = Corpus(VectorSource(by_chapter$text))

The text was then cleaned up to remove numbers, punctuation, stop words, and words of length less than 4 in order to create a term document matrix.

import_mat =
  DocumentTermMatrix(import_corpus,
                     control = list(stemming = FALSE, 
                                    stopwords = TRUE, #remove stop words
                                    minWordLength = 4, #cut out small words
                                    removeNumbers = TRUE, #take out the numbers
                                    removePunctuation = TRUE)) #take out the punctuation

The term document matrix was then weighted to control for the sparsity of the matrix. This was done because not all words are in each document and some words are very frequent. It is required to control for both ends of the spectrum, that is the words with zero frequency as well as the very frequent words.

#weight the space
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i],
                       import_mat$j,
                       mean) *
  log2(nDocs(import_mat)/col_sums(import_mat > 0))

#ignore the very frequent and 0 terms
import_mat = import_mat[ , import_weight >= 0.015]
import_mat = import_mat[row_sums(import_mat) > 0, ]

Modeling

The following models were run:

LDA Fit Model : This model uses the VEM (Variational expectation maximization) algorithm and estimates alpha.
LDA Fixed Model : This model also uses the VEM algorithm but with a fixed alpha value.
LDA Gibbs Model : This model uses the Gibbs algorithm, which is a Bayesian algorithm, instead of the VEM algorithm.
CTM : The correlated topics models allows for correlation between topics and uses the VEM algorithm.

The number of topics for the analysis was set to 5 to see if these mystery stories can be clustered into five sub-themes.

#set the number of topics
k = 5

#set a random number for seed
SEED = 12345

LDA_fit = LDA(import_mat, k = k,
              control = list(seed = SEED))

LDA_fixed = LDA(import_mat, k = k,
                control = list(estimate.alpha = FALSE, seed = SEED))

LDA_gibbs = LDA(import_mat, k = k, method = "Gibbs",
                control = list(seed = SEED, burnin = 1000,
                               thin = 100, iter = 1000))

CTM_fit = CTM(import_mat, k = k,
              control = list(seed = SEED,
                             var = list(tol = 10^-4),
                             em = list(tol = 10^-3)))

Comparison of Models

Alpha Values

Alpha is a measure of the number or rather the predominance of topics. Low alpha values indicate that few document topics are predominant per story and high values indicate more topics are predominant per story.

LDA_fit@alpha

## [1] 0.016422

LDA_fixed@alpha

## [1] 10

LDA_gibbs@alpha

## [1] 10

The LDA Fit Model has a very low alpha value, indicating that a single topic is predominant across the stories and there is not much spread. The higher alpha values for LDA Fixed and LDA Gibbs models indicate higher spread across the topics.

Entropy Values

Entropy is a measure of randomness. Low entropy values indicate low randomness or less topics or more coherence in a doc. High entropy values indicate high randomness, that is the topics are all over the place.

sapply(list(LDA_fit, LDA_fixed, LDA_gibbs, CTM_fit),
       function (x)
         mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))

## [1] 0.04223961 1.14582955 1.15065869 0.13283934

The LDA Fit Model and CTM have low entropy values, indicating low randomness or coherence in each story. The LDA Fixed and LDA Gibbs models have very high entropy values, indicating that the topics are all over the place, that is, very less coherence within the stories.

Based on the alpha values and the entropy values, the LDA fit model was picked for further analysis.

Deep-dive of LDA Fit Model

In this section, the LDA Fit Model, which has the lowest entropy, is explored in detail.

Topic Representation

First, the most frequent terms within each of the topics were looked at. Most of these were names of characters or their professions and hence it is difficult to understand what the themes of their topics were. However, it seems that topic 1 is related to mystery in an indoor setting or in a boat.

terms(LDA_fit,10)

##       Topic 1       Topic 2              Topic 3      Topic 4   
##  [1,] "woodhouse"   "plattner"           "hapley"     "aubrey"  
##  [2,] "monson"      "golam"              "pawkins"    "vair"    
##  [3,] "boat"        "sphere"             "evans"      "horrocks"
##  [4,] "davidson"    "deputycommissioner" "findlay"    "raut"    
##  [5,] "fison"       "azim"               "hinchcliff" "azumazi" 
##  [6,] "canoe"       "winslow"            "temple"     "jane"    
##  [7,] "telescope"   "rajah"              "hooker"     "holroyd" 
##  [8,] "observatory" "elstead"            "moth"       "dynamo"  
##  [9,] "candles"     "plattnerâs"         "findlayâs"  "william" 
## [10,] "housekeeper" "samud"              "elvesham"   "diamond" 
##       Topic 5         
##  [1,] "pollock"       
##  [2,] "wedderburn"    
##  [3,] "coombes"       
##  [4,] "porroh"        
##  [5,] "waterhouse"    
##  [6,] "jennie"        
##  [7,] "perera"        
##  [8,] "bacteriologist"
##  [9,] "haysman"       
## [10,] "clarence"

The stories that represent these topics the most and the least were then looked at.

topics(LDA_fit, k)

##      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## [1,] 2 4 3 5 1 2 1 4 1  5  3  1  4  4  4  1  2  5  2  3  1  1  1  5  1  3
## [2,] 1 1 1 1 5 1 3 1 2  3  1  2  1  1  1  5  1  4  1  1  2  2  2  2  2  1
## [3,] 3 3 2 4 2 3 2 2 3  1  2  3  2  2  2  2  4  1  3  2  3  3  3  4  3  2
## [4,] 4 5 4 2 4 4 4 3 4  2  4  4  3  3  3  3  5  2  5  4  4  4  4  1  4  4
## [5,] 5 2 5 3 3 5 5 5 5  4  5  5  5  5  5  4  3  3  4  5  5  5  5  3  5  5
##      27 28 29 30
## [1,]  1  3  5  2
## [2,]  5  1  1  1
## [3,]  2  2  2  3
## [4,]  3  4  3  4
## [5,]  4  5  4  5

Each of these five topics are represented the most by the following stories:

Topic 1:
- The Story Of The Late Mr. Elvesham
- The Red Room
- In The Abyss
- A Slip Under The Microscope
- The Rajah’s Treasure
- Le Mari Terrible
- The Apple
- The Sad Story Of A Dramatic Critic
- The Lost Inheritance
- The Sea Raiders
Topic 2:
- The Strange Orchid
- The Stolen Bacillus
- The Story Of Davidson’s Eyes
- The Purple Pileus
- The Treasure In The Forest
Topic 3:
- The Plattner Story
- The Reconciliation
- A Catastrophe
- Pollock And The Porroh Man
- In The Modern Vein
Topic 4:
- Ãpyornis Island
- A Moth (Genus Unknown)
- In The Avu Observatory
- The Triumphs Of A Taxidermist
- A Deal In Ostriches
Topic 5:
- The Argonauts Of The Air
- Under The Knife
- The Cone
- The Jilting Of Jane
- The Lord Of The Dynamos

Based on the stories represented by topic 1, it was clear that topic 1 is related to mystery in an indoor setting or in a boat. The boat theme is probably coming from The Sea Raiders.

Beta Values

The beta values represent the weight of each word with respect to each topic. Let us plot the most frequent terms for each topic and visualize their beta values.

#use tidyverse to clean up the fit
LDA_fit_topics = tidy(LDA_fit, matrix = "beta")

#create top terms
top_terms = LDA_fit_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

#clean up ggplot2 defaults
cleanup = theme(panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.background = element_blank(),
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 10))

#make the plot
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  cleanup +
  coord_flip()

The terms with highest beta values (> 0.1) and the topics they represent are below:

pollock - Topic 5
hapley - Topic 3
wedderburn - Topic 5
woodhouse - Topic 1
coombes - Topic 5
plattner - Topic 2
pawkins - Topic 3
aubrey - Topic 4

All these seem to be names of characters from each of these stories.

Gamma Values

The gamma values represent the probability of each topic within each story. The gamma matrix from the LDA Fit Model was taken and their gamma values were visualized.

LDA_gamma = tidy(LDA_fit, matrix = "gamma") %>% 
            left_join(chapters_headings, by = c("document" = "chapter"))

LDA_gamma %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_point() +
  geom_text(aes(label=ifelse(gamma < 0.9 & gamma > 0.1, as.character(LDA_gamma$chapter_name),''),
                hjust=0, 
                vjust=0)) +
  cleanup

A lot of points with gamma of 1 and a lot of points with gamma of 0 are seen. This is in tune with the low entropy score for this model. Either a topic is highly probable for a story or it is highly improbable.

The only exception was the story, ‘The Sea Raiders’ which seemed to have a 75% probability of representing Topic 1 and a 25% probability of representing Topic 5.

This was in tune with the earlier interpretation that topic 1 is related to mystery in a indoor setting or a boat. The boat setting probably came from ‘The Sea Raiders’ which struggled to be completely represented by topic 1, which is mostly related to an indoor setting.

Conclusion

Natural Language Processing is a rapidly evolving field and as such the study of it is quintessential to anyone involved in the field of data analysis. Using motivation from prior research in the field of applying NLP to fictional literature, this study explored NLP themes such as sentiment analysis and topics modeling using text data from ‘Thirty Strange Stories’ by H.G. Wells.

Initial data exploration revealed some of the most commonly used words, collocates, and the relationship between pairs of words used in these stories. Sentiment analysis helped disprove the hypothesis that a mystery story always represents negative sentiment.Topics Models were then built to explore the major sub-themes within these mystery stories. Although mystery in an indoor setting was interpreted as one of the main sub-themes other topics couldn’t be explored further due to the high prevalence of names and designations of characters.

Further study of this fictional work could involve retesting the hypothesis that a mystery story always represents negative sentiment using other sentiment lexicons either individually or as a combination in a weighted manner. Distinctive Collexeme Analysis can also be performed to help understand how interesting words in the book are used in positive and negative contexts. Using packages to remove non dictionary words, the names of characters can also be removed and the topics models rerun. This would help in better interpretation of the sub-themes within these mystery stories. Finally, additional analysis could include using the LIWC2015 Text Processing Module to extract the psychometric properties of the language used in this fictional work to build models such as MDS, PCA and EFA for deriving more insights about the style of writing.

References

Anvari, S., & Amirkhani, H. (2018). Book2Vec: Representing Books in Vector Space Without Using the Contents. 2018 8th International Conference on Computer and Knowledge Engineering (ICCKE), 176-182.

Ashok, V.G., Feng, S., & Choi, Y. (2013). Success with Style: Using Writing Style to Predict the Success of Novels. EMNLP.

Egbert, Jesse. (2012). Style in nineteenth century fiction: A Multi-Dimensional analysis. Scientific Study of Literature. 2. 10.1075/ssol.2.2.01egb.

Jautze, K.J. (2014). Measuring the style of chick lit and literature. DH.

Solorio, T., Montes-y-Gomez, M., Maharjan, S., Ovalle, J.E., & Gonzalez, F.A. (2017). A Multi-task Approach to Predict Likability of Books. EACL.

ANLY540 - Analysis of Human Language - Final Project

Suraj Kumaran, Sumana Samuk, Rabya Suleman

2019-08-09