NLP is one of the fastest growing fields today and its applications across various domains are innumerable. I work in the finance industry and based on my fundamental research, I found that NLP is the founding tool for the most advanced technology such as virtual assistants (Alexa and Google Home). This paper intends to apply the tools that learned in class such as sentiment analysis to a work of fictional literature. The fictional work chosen for this paper was ‘Thirty Strange Stories’, which is a collection of 30 stories written by H.G Wells.
As we discussed, this literary collection includesthirty different mystery stories. We want to use the sentiment analysis tool to test whether a mystery story always represents negative sentiment.
Step one is to clean the data set through data processing techniques.Step two exploratory data analysis. Srep three is sentiment analysis of each story to test the hypothesis that if a mystery story always represents negative sentiment.
library(gutenbergr)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tm)
## Loading required package: NLP
library(topicmodels)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v readr 1.3.1
## v tibble 2.1.1 v purrr 0.3.2
## v ggplot2 3.2.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(slam)
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(Rling)
library(modeest)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(widyr)
library(tokenizers)
data <- gutenberg_download(59774)
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
data <- data[96:12231,]
data$check <- data$text == toupper(data$text)
data <- data %>% filter(text != "" )
data <- data %>% mutate(row_num = row_number())
data <- data[-c(205,2823,3291,4194,4630,4631,5833,5923,5975,6064,6109,8864,8989,9137,9205),]
data$chapter <- cumsum(data$check)
chapters_headings <- filter(data, check == TRUE) %>% rename(chapter_name = text) %>%
select(chapter, chapter_name)
data <- data %>% mutate(title = "Thirty Strange Stories") %>% mutate(row_num = row_number()) %>%
select(title, text, row_num, chapter) %>% left_join(chapters_headings)
## Joining, by = "chapter"
data <- data.frame(lapply(data, trimws), stringsAsFactors = FALSE)
data$row_num <- as.integer(data$row_num)
data$chapter <- as.integer(data$chapter)
words <- data %>% unnest_tokens(word, text)
chapters_headings <- data.frame(lapply(chapters_headings, trimws), stringsAsFactors = FALSE)
Frequently occurring words
words %>%
count(word, sort = TRUE)
## # A tibble: 10,893 x 2
## word n
## <chr> <int>
## 1 the 7717
## 2 and 4558
## 3 of 3825
## 4 a 3177
## 5 to 2506
## 6 he 2350
## 7 in 1902
## 8 was 1894
## 9 i 1850
## 10 his 1752
## # ... with 10,883 more rows
Collocates
word_pairs =words %>%
pairwise_count(word, chapter, sort = TRUE, upper = FALSE)
word_pairs
## # A tibble: 13,584,579 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 the of 30
## 2 the in 30
## 3 of in 30
## 4 the it 30
## 5 of it 30
## 6 in it 30
## 7 the a 30
## 8 of a 30
## 9 in a 30
## 10 it a 30
## # ... with 13,584,569 more rows
Strongest pairs
keyword_cors = words %>%
group_by(word) %>%
filter(n() >= 50) %>%
pairwise_cor(word, chapter, sort = TRUE, upper = FALSE)
keyword_cors
## # A tibble: 33,930 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 aubrey vair 1.
## 2 thing me 1
## 3 still face 1
## 4 three last 1
## 5 something began 1
## 6 again are 1.000
## 7 before from 1.000
## 8 before they 1.000
## 9 from they 1.000
## 10 thought came 1.000
## # ... with 33,920 more rows
Network plot
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:Rling':
##
## normalize
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
keyword_cors %>%
filter(correlation > .8) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "blue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
sentiment_books <- words %>%
inner_join(get_sentiments("bing")) %>%
count(chapter_name, index = row_num %/% 20, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
sentiment_books_1 <- sentiment_books %>% inner_join(chapters_headings[1:6,])
## Joining, by = "chapter_name"
sentiment_books_2 <- sentiment_books %>% inner_join(chapters_headings[7:12,])
## Joining, by = "chapter_name"
sentiment_books_3 <- sentiment_books %>% inner_join(chapters_headings[13:18,])
## Joining, by = "chapter_name"
sentiment_books_4 <- sentiment_books %>% inner_join(chapters_headings[19:24,])
## Joining, by = "chapter_name"
sentiment_books_5 <- sentiment_books %>% inner_join(chapters_headings[25:30,])
## Joining, by = "chapter_name"
ggplot(sentiment_books_1, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()
ggplot(sentiment_books_2, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()
ggplot(sentiment_books_3, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()
ggplot(sentiment_books_4, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()
ggplot(sentiment_books_5, aes(index, sentiment, fill = chapter_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~chapter_name, ncol = 2, scales = "free_x") +
theme_bw()
Based on above abalysis, we can find that the majority of the 30 stories represented negative sentiment. However, not all stories represent negative sentiment. In contrast, stories such as ‘The Triumphs of a Taxidermist’ and ‘Le Mari Terrible’ actually represented positive sentiment. This proves that our hypothesis that a mystery story always represents negative sentiment is not correct.
In the beginning of research, we used data exploration analysis which revealed some of the most commonly used words, collocates, and the relationship between pairs of words used in these stories. And then Sentiment analysis disproved the hypothesis that a mystery story always represents negative sentiment.
Ashok, V.G., Feng, S., & Choi, Y. (2013). Success with Style: Using Writing Style to Predict the Success of Novels. EMNLP.
Egbert, Jesse. (2012). Style in nineteenth century fiction: A Multi-Dimensional analysis. Scientific Study of Literature. 2. 10.1075/ssol.2.2.01egb.
Jautze, K.J. (2014). Measuring the style of chick lit and literature. DH.