Final Project

Introduction

NLP is one of the fastest growing fields today and its applications across various domains are innumerable. I work in the finance industry and based on my fundamental research, I found that NLP is the founding tool for the most advanced technology such as virtual assistants (Alexa and Google Home). This paper intends to apply the tools that learned in class such as sentiment analysis to a work of fictional literature. The fictional work chosen for this paper was ‘Thirty Strange Stories’, which is a collection of 30 stories written by H.G Wells.

Hypothesis / Problem Statement

As we discussed, this literary collection includesthirty different mystery stories. We want to use the sentiment analysis tool to test whether a mystery story always represents negative sentiment.

Statistical Analysis Plan

Step one is to clean the data set through data processing techniques.Step two exploratory data analysis. Srep three is sentiment analysis of each story to test the hypothesis that if a mystery story always represents negative sentiment.

Method - Data - Variables

library(gutenbergr)
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tm)

## Loading required package: NLP

library(topicmodels)
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v readr   1.3.1
## v tibble  2.1.1     v purrr   0.3.2
## v ggplot2 3.2.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()

library(tidytext)
library(slam)
library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

library(Rling)
library(modeest)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(widyr)
library(tokenizers)

data <- gutenberg_download(59774)

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

data <- data[96:12231,]

data$check <- data$text == toupper(data$text)

data <- data %>% filter(text != "" )

data <- data %>% mutate(row_num = row_number())

data <- data[-c(205,2823,3291,4194,4630,4631,5833,5923,5975,6064,6109,8864,8989,9137,9205),]

data$chapter <- cumsum(data$check)

chapters_headings <- filter(data, check == TRUE) %>% rename(chapter_name = text) %>% 
                      select(chapter, chapter_name) 

data <- data %>% mutate(title = "Thirty Strange Stories") %>% mutate(row_num = row_number()) %>% 
        select(title, text, row_num, chapter) %>% left_join(chapters_headings)

## Joining, by = "chapter"

data <- data.frame(lapply(data, trimws), stringsAsFactors = FALSE)
data$row_num <- as.integer(data$row_num)
data$chapter <- as.integer(data$chapter)

words <- data %>% unnest_tokens(word, text)

chapters_headings <- data.frame(lapply(chapters_headings, trimws), stringsAsFactors = FALSE)

Exploratory data analysis

Frequently occurring words

words %>% 
  count(word, sort = TRUE)

## # A tibble: 10,893 x 2
##    word      n
##    <chr> <int>
##  1 the    7717
##  2 and    4558
##  3 of     3825
##  4 a      3177
##  5 to     2506
##  6 he     2350
##  7 in     1902
##  8 was    1894
##  9 i      1850
## 10 his    1752
## # ... with 10,883 more rows

Collocates

word_pairs =words %>% 
  pairwise_count(word, chapter, sort = TRUE, upper = FALSE)
word_pairs

## # A tibble: 13,584,579 x 3
##    item1 item2     n
##    <chr> <chr> <dbl>
##  1 the   of       30
##  2 the   in       30
##  3 of    in       30
##  4 the   it       30
##  5 of    it       30
##  6 in    it       30
##  7 the   a        30
##  8 of    a        30
##  9 in    a        30
## 10 it    a        30
## # ... with 13,584,569 more rows

Strongest pairs

keyword_cors = words %>% 
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, chapter, sort = TRUE, upper = FALSE)
keyword_cors

## # A tibble: 33,930 x 3
##    item1     item2 correlation
##    <chr>     <chr>       <dbl>
##  1 aubrey    vair        1.   
##  2 thing     me          1    
##  3 still     face        1    
##  4 three     last        1    
##  5 something began       1    
##  6 again     are         1.000
##  7 before    from        1.000
##  8 before    they        1.000
##  9 from      they        1.000
## 10 thought   came        1.000
## # ... with 33,920 more rows

Network plot

library(ggplot2)
library(igraph)

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:Rling':
## 
##     normalize

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)
keyword_cors %>%
  filter(correlation > .8) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "blue") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Sentiment Analysis

sentiment_books <- words %>%
  inner_join(get_sentiments("bing")) %>%
  count(chapter_name, index = row_num %/% 20, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

sentiment_books_1 <- sentiment_books %>% inner_join(chapters_headings[1:6,])

## Joining, by = "chapter_name"

sentiment_books_2 <- sentiment_books %>% inner_join(chapters_headings[7:12,])

## Joining, by = "chapter_name"

sentiment_books_3 <- sentiment_books %>% inner_join(chapters_headings[13:18,])

## Joining, by = "chapter_name"

sentiment_books_4 <- sentiment_books %>% inner_join(chapters_headings[19:24,])

## Joining, by = "chapter_name"

sentiment_books_5 <- sentiment_books %>% inner_join(chapters_headings[25:30,])

## Joining, by = "chapter_name"

ggplot(sentiment_books_1, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_2, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_3, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_4, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

ggplot(sentiment_books_5, aes(index, sentiment, fill = chapter_name)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
  theme_bw()

Statistical Analysis Results

Based on above abalysis, we can find that the majority of the 30 stories represented negative sentiment. However, not all stories represent negative sentiment. In contrast, stories such as ‘The Triumphs of a Taxidermist’ and ‘Le Mari Terrible’ actually represented positive sentiment. This proves that our hypothesis that a mystery story always represents negative sentiment is not correct.

Interpret and Discuss

In the beginning of research, we used data exploration analysis which revealed some of the most commonly used words, collocates, and the relationship between pairs of words used in these stories. And then Sentiment analysis disproved the hypothesis that a mystery story always represents negative sentiment.

References

Ashok, V.G., Feng, S., & Choi, Y. (2013). Success with Style: Using Writing Style to Predict the Success of Novels. EMNLP.

Egbert, Jesse. (2012). Style in nineteenth century fiction: A Multi-Dimensional analysis. Scientific Study of Literature. 2. 10.1075/ssol.2.2.01egb.

Jautze, K.J. (2014). Measuring the style of chick lit and literature. DH.