Exploring the package syuzhet by Matt Jockers.
if(!require("syuzhet")) {
devtools::install_github("mjockers/syuzhet")
library("syuzhet")
}
library("magrittr")
library("rvest")
library("NLP")
library("rlist")
library("tidyr")
library("ggplot2")
library("stringr")
Load Damnation of Theron Ware, Moby Dick, Minister’s Wooing, Uncle Tom’s Cabin, and Norwood. Put them in a list and find the sentences. Cache them.
if(!file.exists("books.rda")) {
moby_dick <- "mobydick.txt" %>%
get_text_as_string()
theron_ware <- "theronware.txt" %>%
get_text_as_string()
norwood <- "https://archive.org/stream/norwood00beecgoog/norwood00beecgoog_djvu.txt" %>%
html() %>%
html_node("pre") %>%
html_text() %>%
as.String()
wooing <- "wooing.txt" %>%
get_text_as_string()
uncle_tom <- "uncletom.txt" %>%
get_text_as_string()
books <- list(moby_dick = moby_dick,
theron_ware = theron_ware,
norwood = norwood,
wooing = wooing,
uncle_tom = uncle_tom) %>%
lapply(get_sentences)
save(books, file = "books.rda")
} else {
load("books.rda")
}
Run the sentiment analysis using bing and afinn methods. I couldn’t get stanford to work, though I’ve never really tried to get the Stanford NLP to work on this machine before.
multi_sentiment <- function(sentences) {
list(bing = get_sentiment(sentences, method = "bing"),
afinn = get_sentiment(sentences, method = "afinn"),
nrc = get_sentiment(sentences, method = "nrc")
# stanford = get_sentiment(sentences, method = "stanford",
# path_to_tagger = "/Applications/stanford-corenlp")
)
}
sentiment <- books %>%
lapply(multi_sentiment)
How do these novels compare to one another in summary terms?
sum_up_sentiment <- function(x) {
apply_sentiment <- function(vec) {
list(sum = sum(vec),
mean = mean(vec),
summary = summary(vec))
}
if(is.list(x))
lapply(x, apply_sentiment)
else
apply_sentiment(x)
}
sentiment %>%
lapply(sum_up_sentiment) %>%
list.unzip()
## $bing
## moby_dick theron_ware norwood wooing uncle_tom
## sum -888 474 2448 2376 120
## mean -0.09474021 0.07247706 0.2045284 0.5074754 0.01326847
## summary Numeric,6 Numeric,6 Numeric,6 Numeric,6 Numeric,6
##
## $afinn
## moby_dick theron_ware norwood wooing uncle_tom
## sum 1072 2357 6466 5430 2060
## mean 0.1143711 0.3603976 0.5402289 1.159761 0.2277753
## summary Numeric,6 Numeric,6 Numeric,6 Numeric,6 Numeric,6
##
## $nrc
## moby_dick theron_ware norwood wooing uncle_tom
## sum 2231 2565 5416 4942 2489
## mean 0.2380241 0.3922018 0.4525023 1.055532 0.2752101
## summary Numeric,6 Numeric,6 Numeric,6 Numeric,6 Numeric,6
It’s curious that Moby Dick has a negative mean and sum with bing and positive with afinn. In general, the afinn numbers are much higher than the bing numbers. I take this to mean that the numbers generated by the different methods are not meant to be compared. It’s also curious that afinn generates all positive numbers and bing only has one negative number. I’m not sure how to interpret that right now.
Nevertheless, the ranking by mean is consistent between bing and afinn: Moby Dick, Uncle Tom’s Cabin, Theron Ware, Norwood, and Wooing. The nrc method gives slightly different results. It’s not surprising that Norwood and Wooing are on average more positive than the others.
Now let’s plot sentiment:
plot_sentiment <- function(x, title) {
plot(x,
type = "l",
main = title,
xlab = "Narrative time",
ylab = "Emotion",
# ylim = c(-1.5, 3.25) # roughly the min and the max
)
abline(h = 0, col = 3, lty = 2) # neutral sentiment
}
sentiment %>%
list.flatten() %>%
lapply(get_percentage_values) %>%
Map(plot_sentiment, ., names(.))
Not an expert on novels, but these results seem to make sense to me. A few observations:
Let’s try the NRC method where we get the data frame.
bind_pos <- function(df) {
pos <- data.frame(position = 1:nrow(df))
cbind(df, pos)
}
plot_nrc <- function(df, title) {
ggplot(df,
aes(x = position, y = value, color = emotion)) +
geom_smooth(size = 2, se = FALSE) +
xlab("Narrative position") +
ylab("Prevalence") +
theme_classic() +
ggtitle(title)
}
Plot the different kinds of emotion:
books %>%
lapply(get_nrc_sentiment) %>%
lapply(bind_pos) %>%
lapply(gather, emotion, value, -position, -negative, -positive) %>%
Map(plot_nrc, ., names(.))
I think this kind of plot is potentially very useful. The prevelance of “trust” seems exaggerated. (Though these novels do have something to do with religion, so perhaps “faith” or the like creates this effect.)
Let’s try the sentiment analysis on the Tracts for the Times (list of titles). These theological texts aren’t novels, so maybe this is an interested test to see whether this method has application to non-narrative texts. I’d really like to try this with the American Tract Society publications, which are often narrative, but I don’t have plain text (or even page images) for most publications.
files <- Sys.glob("~/dev/tracts-for-the-times/clean/*.txt")
tract_names <- files %>%
str_extract("tract\\d\\d")
tracts <- files %>%
lapply(get_text_as_string) %>%
lapply(get_sentences) %>%
lapply(get_sentiment, method = "bing")
## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string
## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string
## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string
names(tracts) <- tract_names
What kind of summary results do we get?
tracts %>%
lapply(sum_up_sentiment) %>%
list.unzip() %$%
sort(mean)
## tract59 tract55 tract83 tract03 tract02 tract39
## -0.62650602 -0.40853659 -0.33226837 -0.22485207 -0.16666667 -0.16000000
## tract53 tract23 tract62 tract36 tract38 tract57
## -0.15662651 -0.15000000 -0.12883436 -0.11764706 -0.11355311 -0.09090909
## tract19 tract79 tract47 tract51 tract48 tract46
## -0.08888889 -0.07513812 -0.06666667 -0.06194690 -0.05660377 -0.04285714
## tract71 tract15 tract06 tract33 tract45 tract82
## -0.04145078 -0.03623188 -0.03508772 0.01587302 0.02439024 0.03109656
## tract11 tract09 tract31 tract37 tract08 tract22
## 0.03952569 0.04166667 0.04166667 0.04166667 0.05454545 0.05649718
## tract50 tract01 tract20 tract41 tract32 tract07
## 0.10638298 0.11494253 0.12328767 0.12844037 0.13970588 0.15686275
## tract72 tract34 tract43 tract90 tract21 tract75
## 0.16310680 0.18803419 0.20670391 0.21106383 0.21359223 0.23626374
## tract61 tract40 tract49 tract89 tract13 tract28
## 0.23809524 0.24210526 0.24431818 0.25881470 0.26956522 0.27272727
## tract29 tract66 tract84 tract73 tract10 tract18
## 0.27840909 0.29090909 0.29230769 0.33246415 0.37662338 0.39252336
## tract63 tract52 tract74 tract12 tract44 tract35
## 0.40259740 0.41176471 0.41800643 0.41897233 0.42473118 0.43636364
## tract76 tract26 tract54 tract14 tract81 tract86
## 0.44034440 0.44047619 0.44444444 0.44680851 0.45293899 0.45620438
## tract25 tract27 tract56 tract60 tract80 tract87
## 0.45833333 0.47445255 0.54198473 0.55474453 0.56303419 0.59952607
## tract58 tract64 tract05 tract30 tract17 tract04
## 0.60000000 0.66666667 0.68309859 0.71153846 0.71794872 0.85858586
## tract16 tract24
## 0.94805195 1.04000000
Most negative 6 tracts:
Most positive tracts:
Plot the sentiments
tracts %>%
lapply(get_percentage_values) %>%
Map(plot_sentiment, ., names(.))
## $tract01
## NULL
##
## $tract02
## NULL
##
## $tract03
## NULL
##
## $tract04
## NULL
##
## $tract05
## NULL
##
## $tract06
## NULL
##
## $tract07
## NULL
##
## $tract08
## NULL
##
## $tract09
## NULL
##
## $tract10
## NULL
##
## $tract11
## NULL
##
## $tract12
## NULL
##
## $tract13
## NULL
##
## $tract14
## NULL
##
## $tract15
## NULL
##
## $tract16
## NULL
##
## $tract17
## NULL
##
## $tract18
## NULL
##
## $tract19
## NULL
##
## $tract20
## NULL
##
## $tract21
## NULL
##
## $tract22
## NULL
##
## $tract23
## NULL
##
## $tract24
## NULL
##
## $tract25
## NULL
##
## $tract26
## NULL
##
## $tract27
## NULL
##
## $tract28
## NULL
##
## $tract29
## NULL
##
## $tract30
## NULL
##
## $tract31
## NULL
##
## $tract32
## NULL
##
## $tract33
## NULL
##
## $tract34
## NULL
##
## $tract35
## NULL
##
## $tract36
## NULL
##
## $tract37
## NULL
##
## $tract38
## NULL
##
## $tract39
## NULL
##
## $tract40
## NULL
##
## $tract41
## NULL
##
## $tract43
## NULL
##
## $tract44
## NULL
##
## $tract45
## NULL
##
## $tract46
## NULL
##
## $tract47
## NULL
##
## $tract48
## NULL
##
## $tract49
## NULL
##
## $tract50
## NULL
##
## $tract51
## NULL
##
## $tract52
## NULL
##
## $tract53
## NULL
##
## $tract54
## NULL
##
## $tract55
## NULL
##
## $tract56
## NULL
##
## $tract57
## NULL
##
## $tract58
## NULL
##
## $tract59
## NULL
##
## $tract60
## NULL
##
## $tract61
## NULL
##
## $tract62
## NULL
##
## $tract63
## NULL
##
## $tract64
## NULL
##
## $tract66
## NULL
##
## $tract71
## NULL
##
## $tract72
## NULL
##
## $tract73
## NULL
##
## $tract74
## NULL
##
## $tract75
## NULL
##
## $tract76
## NULL
##
## $tract79
## NULL
##
## $tract80
## NULL
##
## $tract81
## NULL
##
## $tract82
## NULL
##
## $tract83
## NULL
##
## $tract84
## NULL
##
## $tract86
## NULL
##
## $tract87
## NULL
##
## $tract89
## NULL
##
## $tract90
## NULL
There is a lot of noise in these texts, in part because they are so short, except for the very long Tract 90 and a few others which are more books than tracts. Some also have endnotes. Still, it’s at least suggestive of how this could be used with non-narrative texts.
Finally, one last experiment. Let’s try the emotions in my dissertation just for the fun of it.
diss <- get_text_as_string("diss.txt") %>%
get_sentences()
diss %>%
get_sentiment(method = "bing") %>%
get_percentage_values() %>%
plot(type = "l",
xlab = "Narrative time",
ylab = "Emotion",
main = "Dissertation")
abline(h = 0, col = 3, lty = 2) # neutral sentiment
diss %>%
get_nrc_sentiment() %>%
bind_pos() %>%
gather(emotion, value, -position, -positive, -negative) %>%
plot_nrc("Dissertation NRC")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
I’m not surprised that chapters 3 and 4, on slavery and conversions away from Judaism, are emotionally negative, while the chapter on conversion to Catholicism (ch 5) is the most emotionally positive of the chapters.
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.1.2 (2014-10-31)
## system x86_64, darwin14.0.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat * 0.1 2013-12-06 CRAN (R 3.1.2)
## colorspace * 1.2-4 2013-09-30 CRAN (R 3.1.2)
## DBI * 0.3.1 2014-09-24 CRAN (R 3.1.2)
## devtools * 1.7.0 2015-01-17 CRAN (R 3.1.2)
## digest * 0.6.8 2014-12-31 CRAN (R 3.1.2)
## dplyr * 0.4.1 2015-01-14 CRAN (R 3.1.2)
## evaluate * 0.5.5 2014-04-29 CRAN (R 3.1.2)
## formatR * 1.0 2014-08-25 CRAN (R 3.1.2)
## ggplot2 1.0.0 2014-05-21 CRAN (R 3.1.2)
## gtable * 0.1.2 2012-12-05 CRAN (R 3.1.2)
## htmltools * 0.2.6 2014-09-08 CRAN (R 3.1.2)
## httr * 0.6.1 2015-01-01 CRAN (R 3.1.2)
## knitr * 1.9 2015-01-20 CRAN (R 3.1.2)
## labeling * 0.3 2014-08-23 CRAN (R 3.1.2)
## lattice * 0.20-29 2014-04-04 CRAN (R 3.1.2)
## lazyeval * 0.1.10 2015-01-02 CRAN (R 3.1.2)
## magrittr 1.5 2014-11-22 CRAN (R 3.1.2)
## MASS * 7.3-37 2015-01-10 CRAN (R 3.1.2)
## Matrix * 1.1-5 2015-01-18 CRAN (R 3.1.2)
## mgcv 1.8-4 2014-11-27 CRAN (R 3.1.2)
## munsell * 0.4.2 2013-07-11 CRAN (R 3.1.2)
## nlme 3.1-119 2015-01-10 CRAN (R 3.1.2)
## NLP 0.1-6 2015-01-24 CRAN (R 3.1.2)
## openNLP * 0.2-3 2014-12-10 CRAN (R 3.1.2)
## openNLPdata * 1.5.3-1 2013-09-05 CRAN (R 3.1.2)
## plyr * 1.8.1 2014-02-26 CRAN (R 3.1.2)
## proto * 0.3-10 2012-12-22 CRAN (R 3.1.2)
## Rcpp * 0.11.4 2015-01-24 CRAN (R 3.1.2)
## reshape2 * 1.4.1 2014-12-06 CRAN (R 3.1.2)
## rJava * 0.9-6 2013-12-24 CRAN (R 3.1.2)
## rlist 0.4 2015-01-24 CRAN (R 3.1.2)
## rmarkdown * 0.5.1 2015-01-26 CRAN (R 3.1.2)
## rstudioapi * 0.2 2014-12-31 CRAN (R 3.1.2)
## rvest 0.2.0 2015-01-01 CRAN (R 3.1.2)
## scales * 0.2.4 2014-04-22 CRAN (R 3.1.2)
## stringi * 0.4-1 2014-12-14 CRAN (R 3.1.2)
## stringr 0.9.0.9000 2015-02-01 Github (hadley/stringr@a0f03f5)
## syuzhet 0.1.1 2015-01-31 Github (mjockers/syuzhet@69bc7da)
## tidyr 0.2.0 2014-12-05 CRAN (R 3.1.2)
## yaml * 2.1.13 2014-06-12 CRAN (R 3.1.2)