The research study intends to analyze the textual differences among 30 chapters from the book, ‘Thirty Strange Stories’ written by H.G Wells. Each of the 30 chapters from this book represents an individual story. Analysis will include assessing the text for verbiage and colexemes, sentiment analysis and topic modeling/clustering. Given that the 30 different stories are written by the same author, study intends to understand the relationship between word usage and storyline.
The author of Thirty Strange Stories is H.G. Wells, who is well known for writing books such as the War of the worlds, The Invisible Man and The Time Machine. The text for the book has been sourced from the Gutenberg corpus.
Below is a list of the 30 stories covered in this book:
Relevant literture pertaining the study of natural language processing through textual analysis of various authors’ works will be reviewed.
“In fictional writing, the plot of the story determines the author’s vocabulary usage”
Authors face the challenge of keeping the audience captivated in different genres of writing such as adventure, fiction and romance. The study will attempt to understand whether the author can overcome the challange through versatality of word usage or by adding variety to the plot itself.
The research study intends to touch upon the following analysis
This study uses text data from the book “Thirty Strange Stories” by H.G. Wells. The book is a part of the Gutenberg corpus and the text from the book was sourced for analysis using the ‘gutenbergr’ package in R.
For data preparation, the first few, the last few rows, and empty rows from the dataset were removed to focus on just the main text from each of the chapters. Additional variables for chapter number and line number was then created to prepare the main data. For secondary analysis, the words were unnested and another dataset was created.
library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
library(ggplot2)
library(wordcloud)# download the entire book
data <- gutenberg_download(59774)
# Remove the first and last unwanted rows
data <- data[96:12231,]
# check for UPPER case
data$check <- data$text == toupper(data$text)
# filter out empty rows
data <- data %>% filter(text != "" )
# create row number
data <- data %>% mutate(row_num = row_number())
# remove incorrectly detected chapter headings
data <- data[-c(205,2823,3291,4194,4630,4631,5833,5923,5975,6064,6109,8864,8989,9137,9205),]
# Create Chapter
data$chapter <- cumsum(data$check)
# Create a separate datset for chapter
chapters_headings <- filter(data, check == TRUE) %>% rename(chapter_name = text) %>%
select(chapter, chapter_name)
# Clean up and join with chapter headings
data <- data %>% mutate(title = "Thirty Strange Stories") %>% mutate(row_num = row_number()) %>%
select(title, text, row_num, chapter) %>% left_join(chapters_headings)
# Remove leading and trailing white spaces
data <- data.frame(lapply(data, trimws), stringsAsFactors = FALSE)
data$row_num <- as.integer(data$row_num)
data$chapter <- as.integer(data$chapter)
# Words for initial analysis
words <- data %>% unnest_tokens(word, text)
# Remove leading and trailing white spaces from chapters_heading for later use
chapters_headings <- data.frame(lapply(chapters_headings, trimws), stringsAsFactors = FALSE)The Strange Orchid
words %>% filter(chapter == 1) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Æpyornis Island
words %>% filter(chapter == 2) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Plattner Story
words %>% filter(chapter == 3) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Argonauts Of The Air
words %>% filter(chapter == 4) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Story Of The Late Mr. Elvesham
words %>% filter(chapter == 5) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Stolen Bacillus
words %>% filter(chapter == 6) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Red Room
words %>% filter(chapter == 7) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Moth (Genus Unknown)
words %>% filter(chapter == 8) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))In The Abyss
words %>% filter(chapter == 9) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Under The Knife
words %>% filter(chapter == 10) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Reconciliation
words %>% filter(chapter == 11) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Slip Under The Microscope
words %>% filter(chapter == 12) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))In The Avu Observatory
words %>% filter(chapter == 13) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Triumphs Of A Taxidermist
words %>% filter(chapter == 14) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Deal In Ostriches
words %>% filter(chapter == 15) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Rajah’s Treasure
words %>% filter(chapter == 16) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Story Of Davidson’s Eyes
words %>% filter(chapter == 17) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Cone
words %>% filter(chapter == 18) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Purple Pileus
words %>% filter(chapter == 19) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))A Catastrophe
words %>% filter(chapter == 20) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Le Mari Terrible
words %>% filter(chapter == 21) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Apple
words %>% filter(chapter == 22) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Sad Story Of A Dramatic Critic
words %>% filter(chapter == 23) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Jilting Of Jane
words %>% filter(chapter == 24) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Lost Inheritance
words %>% filter(chapter == 25) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))Pollock And The Porroh Man
words %>% filter(chapter == 26) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Sea Raiders
words %>% filter(chapter == 27) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))In The Modern Vein
words %>% filter(chapter == 28) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Lord Of The Dynamos
words %>% filter(chapter == 29) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))The Treasure In The Forest
words %>% filter(chapter == 30) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 50,))sentiment_books <- words %>%
inner_join(get_sentiments("bing")) %>%
count(chapter_name, index = row_num %/% 20, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
sentiment_books_1 <- sentiment_books %>% inner_join(chapters_headings[1:6,])
sentiment_books_2 <- sentiment_books %>% inner_join(chapters_headings[7:12,])
sentiment_books_3 <- sentiment_books %>% inner_join(chapters_headings[13:18,])
sentiment_books_4 <- sentiment_books %>% inner_join(chapters_headings[19:24,])
sentiment_books_5 <- sentiment_books %>% inner_join(chapters_headings[25:30,])
ggplot(sentiment_books_1, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_2, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_3, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_4, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()ggplot(sentiment_books_5, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()