Objective

The research study intends to analyze the textual differences among 30 chapters from the book, ‘Thirty Strange Stories’ written by H.G Wells. Each of the 30 chapters from this book represents an individual story. Analysis will include assessing the text for verbiage and colexemes, sentiment analysis and topic modeling/clustering. Given that the 30 different stories are written by the same author, study intends to understand the relationship between word usage and storyline.

Introduction

The author of Thirty Strange Stories is H.G. Wells, who is well known for writing books such as the War of the worlds, The Invisible Man and The Time Machine. The text for the book has been sourced from the Gutenberg corpus.

Below is a list of the 30 stories covered in this book:

  • The Strange Orchid
  • Æpyornis Island
  • The Plattner Story
  • The Argonauts Of The Air
  • The Story Of The Late Mr. Elvesham
  • The Stolen Bacillus
  • The Red Room
  • A Moth (Genus Unknown)
  • In The Abyss
  • Under The Knife
  • The Reconciliation
  • A Slip Under The Microscope
  • In The Avu Observatory
  • The Triumphs Of A Taxidermist
  • A Deal In Ostriches
  • The Rajah’s Treasure
  • The Story Of Davidson’s Eyes
  • The Cone
  • The Purple Pileus
  • A Catastrophe
  • Le Mari Terrible
  • The Apple
  • The Sad Story Of A Dramatic Critic
  • The Jilting Of Jane
  • The Lost Inheritance
  • Pollock And The Porroh Man
  • The Sea Raiders
  • In The Modern Vein
  • The Lord Of The Dynamos
  • The Treasure In The Forest

Relevant literture pertaining the study of natural language processing through textual analysis of various authors’ works will be reviewed.

Hypothesis / Problem Statement

Hypothesis

“In fictional writing, the plot of the story determines the author’s vocabulary usage”

Importance

Authors face the challenge of keeping the audience captivated in different genres of writing such as adventure, fiction and romance. The study will attempt to understand whether the author can overcome the challange through versatality of word usage or by adding variety to the plot itself.

Statistical Analysis Plan

The research study intends to touch upon the following analysis

  • Sentiment Analysis
  • Basic exploratory data analysis: Dispersion and Location
  • Topics Models
  • Word Frequency
  • Co Lexemes analysis
  • Attraction and Reliance
  • PMI and Odds Ratio

Method

Data

Source

This study uses text data from the book “Thirty Strange Stories” by H.G. Wells. The book is a part of the Gutenberg corpus and the text from the book was sourced for analysis using the ‘gutenbergr’ package in R.

Data Preparation

For data preparation, the first few, the last few rows, and empty rows from the dataset were removed to focus on just the main text from each of the chapters. Additional variables for chapter number and line number was then created to prepare the main data. For secondary analysis, the words were unnested and another dataset was created.

Exploratory data analysis

Word Clouds

The Strange Orchid

Æpyornis Island

The Plattner Story

The Argonauts Of The Air

The Story Of The Late Mr. Elvesham

The Stolen Bacillus

The Red Room

A Moth (Genus Unknown)

In The Abyss

Under The Knife

The Reconciliation

A Slip Under The Microscope

In The Avu Observatory

The Triumphs Of A Taxidermist

A Deal In Ostriches

The Rajah’s Treasure

The Story Of Davidson’s Eyes

The Cone

The Purple Pileus

A Catastrophe

Le Mari Terrible

The Apple

The Sad Story Of A Dramatic Critic

The Jilting Of Jane

The Lost Inheritance

Pollock And The Porroh Man

The Sea Raiders

In The Modern Vein

The Lord Of The Dynamos

The Treasure In The Forest