ANLY540 - Analysis of Human Language

Objective

The research study intends to analyze the textual differences among 30 chapters from the book, ‘Thirty Strange Stories’ written by H.G Wells. Each of the 30 chapters from this book represents an individual story. Analysis will include assessing the text for verbiage and colexemes, sentiment analysis and topic modeling/clustering. Given that the 30 different stories are written by the same author, study intends to understand the relationship between word usage and storyline.

Introduction

The author of Thirty Strange Stories is H.G. Wells, who is well known for writing books such as the War of the worlds, The Invisible Man and The Time Machine. The text for the book has been sourced from the Gutenberg corpus.

Below is a list of the 30 stories covered in this book:

The Strange Orchid
Æpyornis Island
The Plattner Story
The Argonauts Of The Air
The Story Of The Late Mr. Elvesham
The Stolen Bacillus
The Red Room
A Moth (Genus Unknown)
In The Abyss
Under The Knife
The Reconciliation
A Slip Under The Microscope
In The Avu Observatory
The Triumphs Of A Taxidermist
A Deal In Ostriches
The Rajah’s Treasure
The Story Of Davidson’s Eyes
The Cone
The Purple Pileus
A Catastrophe
Le Mari Terrible
The Apple
The Sad Story Of A Dramatic Critic
The Jilting Of Jane
The Lost Inheritance
Pollock And The Porroh Man
The Sea Raiders
In The Modern Vein
The Lord Of The Dynamos
The Treasure In The Forest

Relevant literture pertaining the study of natural language processing through textual analysis of various authors’ works will be reviewed.

Hypothesis / Problem Statement

Hypothesis

“In fictional writing, the plot of the story determines the author’s vocabulary usage”

Importance

Authors face the challenge of keeping the audience captivated in different genres of writing such as adventure, fiction and romance. The study will attempt to understand whether the author can overcome the challange through versatality of word usage or by adding variety to the plot itself.

Statistical Analysis Plan

The research study intends to touch upon the following analysis

Sentiment Analysis
Basic exploratory data analysis: Dispersion and Location
Topics Models
Word Frequency
Co Lexemes analysis
Attraction and Reliance
PMI and Odds Ratio

Method

Data

Source

This study uses text data from the book “Thirty Strange Stories” by H.G. Wells. The book is a part of the Gutenberg corpus and the text from the book was sourced for analysis using the ‘gutenbergr’ package in R.

Data Preparation

For data preparation, the first few, the last few rows, and empty rows from the dataset were removed to focus on just the main text from each of the chapters. Additional variables for chapter number and line number was then created to prepare the main data. For secondary analysis, the words were unnested and another dataset was created.

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
library(ggplot2)
library(wordcloud)

# download the entire book
data <- gutenberg_download(59774)

# Remove the first and last unwanted rows
data <- data[96:12231,]

# check for UPPER case 
data$check <- data$text == toupper(data$text)

# filter out empty rows
data <- data %>% filter(text != "" )

# create row number
data <- data %>% mutate(row_num = row_number())

# remove incorrectly detected chapter headings
data <- data[-c(205,2823,3291,4194,4630,4631,5833,5923,5975,6064,6109,8864,8989,9137,9205),]

# Create Chapter
data$chapter <- cumsum(data$check)

# Create a separate datset for chapter 
chapters_headings <- filter(data, check == TRUE) %>% rename(chapter_name = text) %>% 
                      select(chapter, chapter_name) 


# Clean up and join with chapter headings 
data <- data %>% mutate(title = "Thirty Strange Stories") %>% mutate(row_num = row_number()) %>% 
        select(title, text, row_num, chapter) %>% left_join(chapters_headings)

# Remove leading and trailing white spaces
data <- data.frame(lapply(data, trimws), stringsAsFactors = FALSE)
data$row_num <- as.integer(data$row_num)
data$chapter <- as.integer(data$chapter)

# Words for initial analysis
words <- data %>% unnest_tokens(word, text)

# Remove leading and trailing white spaces from chapters_heading for later use
chapters_headings <- data.frame(lapply(chapters_headings, trimws), stringsAsFactors = FALSE)

Exploratory data analysis

Word Clouds

The Strange Orchid

words %>% filter(chapter == 1) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Æpyornis Island

words %>% filter(chapter == 2) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Plattner Story

words %>% filter(chapter == 3) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Argonauts Of The Air

words %>% filter(chapter == 4) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Story Of The Late Mr. Elvesham

words %>% filter(chapter == 5) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Stolen Bacillus

words %>% filter(chapter == 6) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Red Room

words %>% filter(chapter == 7) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Moth (Genus Unknown)

words %>% filter(chapter == 8) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

In The Abyss

words %>% filter(chapter == 9) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Under The Knife

words %>% filter(chapter == 10) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Reconciliation

words %>% filter(chapter == 11) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Slip Under The Microscope

words %>% filter(chapter == 12) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

In The Avu Observatory

words %>% filter(chapter == 13) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Triumphs Of A Taxidermist

words %>% filter(chapter == 14) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Deal In Ostriches

words %>% filter(chapter == 15) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Rajah’s Treasure

words %>% filter(chapter == 16) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Story Of Davidson’s Eyes

words %>% filter(chapter == 17) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Cone

words %>% filter(chapter == 18) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Purple Pileus

words %>% filter(chapter == 19) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

A Catastrophe

words %>% filter(chapter == 20) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Le Mari Terrible

words %>% filter(chapter == 21) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Apple

words %>% filter(chapter == 22) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Sad Story Of A Dramatic Critic

words %>% filter(chapter == 23) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Jilting Of Jane

words %>% filter(chapter == 24) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Lost Inheritance

words %>% filter(chapter == 25) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Pollock And The Porroh Man

words %>% filter(chapter == 26) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Sea Raiders

words %>% filter(chapter == 27) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

In The Modern Vein

words %>% filter(chapter == 28) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Lord Of The Dynamos

words %>% filter(chapter == 29) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

The Treasure In The Forest

words %>% filter(chapter == 30) %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50,))

Sentiment Analysis

sentiment_books <- words %>%
  inner_join(get_sentiments("bing")) %>%
  count(chapter_name, index = row_num %/% 20, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

sentiment_books_1 <- sentiment_books %>% inner_join(chapters_headings[1:6,])
sentiment_books_2 <- sentiment_books %>% inner_join(chapters_headings[7:12,])
sentiment_books_3 <- sentiment_books %>% inner_join(chapters_headings[13:18,])
sentiment_books_4 <- sentiment_books %>% inner_join(chapters_headings[19:24,])
sentiment_books_5 <- sentiment_books %>% inner_join(chapters_headings[25:30,])

ggplot(sentiment_books_1, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_2, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_3, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_4, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_5, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ANLY540 - Analysis of Human Language - Executive Session 3

Rabya Suleman, Sumana Samuk, and Suraj Kumaran

2019-07-28