Exploring Syuzhet

Exploring the package syuzhet by Matt Jockers.

if(!require("syuzhet")) {
  devtools::install_github("mjockers/syuzhet")
  library("syuzhet")
}
library("magrittr")
library("rvest")
library("NLP")
library("rlist")
library("tidyr")
library("ggplot2")
library("stringr")

Load Damnation of Theron Ware, Moby Dick, Minister’s Wooing, Uncle Tom’s Cabin, and Norwood. Put them in a list and find the sentences. Cache them.

if(!file.exists("books.rda")) {
  moby_dick <- "mobydick.txt" %>% 
    get_text_as_string()
  theron_ware <- "theronware.txt" %>% 
    get_text_as_string()
  norwood <- "https://archive.org/stream/norwood00beecgoog/norwood00beecgoog_djvu.txt" %>% 
    html() %>% 
    html_node("pre") %>% 
    html_text() %>% 
    as.String()
  wooing <- "wooing.txt" %>% 
    get_text_as_string()
  uncle_tom <- "uncletom.txt" %>%
    get_text_as_string()
  
  books <- list(moby_dick = moby_dick, 
                theron_ware = theron_ware, 
                norwood = norwood, 
                wooing = wooing, 
                uncle_tom = uncle_tom) %>% 
  lapply(get_sentences)

  save(books, file = "books.rda")
} else {
  load("books.rda")
}

Run the sentiment analysis using bing and afinn methods. I couldn’t get stanford to work, though I’ve never really tried to get the Stanford NLP to work on this machine before.

multi_sentiment <- function(sentences) {
  list(bing  = get_sentiment(sentences, method = "bing"),
       afinn = get_sentiment(sentences, method = "afinn"),
       nrc   = get_sentiment(sentences, method = "nrc")
#        stanford = get_sentiment(sentences, method = "stanford", 
#                     path_to_tagger = "/Applications/stanford-corenlp")
       )
}

sentiment <- books %>% 
  lapply(multi_sentiment)

How do these novels compare to one another in summary terms?

sum_up_sentiment <- function(x) {
  apply_sentiment <- function(vec) {
    list(sum = sum(vec),
       mean = mean(vec),
       summary = summary(vec))
  }
  
  if(is.list(x))
    lapply(x, apply_sentiment)
  else
    apply_sentiment(x)
}

sentiment %>% 
  lapply(sum_up_sentiment) %>% 
  list.unzip()

## $bing
##         moby_dick   theron_ware norwood   wooing    uncle_tom 
## sum     -888        474         2448      2376      120       
## mean    -0.09474021 0.07247706  0.2045284 0.5074754 0.01326847
## summary Numeric,6   Numeric,6   Numeric,6 Numeric,6 Numeric,6 
## 
## $afinn
##         moby_dick theron_ware norwood   wooing    uncle_tom
## sum     1072      2357        6466      5430      2060     
## mean    0.1143711 0.3603976   0.5402289 1.159761  0.2277753
## summary Numeric,6 Numeric,6   Numeric,6 Numeric,6 Numeric,6
## 
## $nrc
##         moby_dick theron_ware norwood   wooing    uncle_tom
## sum     2231      2565        5416      4942      2489     
## mean    0.2380241 0.3922018   0.4525023 1.055532  0.2752101
## summary Numeric,6 Numeric,6   Numeric,6 Numeric,6 Numeric,6

It’s curious that Moby Dick has a negative mean and sum with bing and positive with afinn. In general, the afinn numbers are much higher than the bing numbers. I take this to mean that the numbers generated by the different methods are not meant to be compared. It’s also curious that afinn generates all positive numbers and bing only has one negative number. I’m not sure how to interpret that right now.

Nevertheless, the ranking by mean is consistent between bing and afinn: Moby Dick, Uncle Tom’s Cabin, Theron Ware, Norwood, and Wooing. The nrc method gives slightly different results. It’s not surprising that Norwood and Wooing are on average more positive than the others.

Now let’s plot sentiment:

plot_sentiment <- function(x, title) {
  plot(x,
       type = "l",
       main = title,
       xlab = "Narrative time",
       ylab = "Emotion",
       # ylim = c(-1.5, 3.25) # roughly the min and the max
       )
  abline(h = 0, col = 3, lty = 2) # neutral sentiment
}

sentiment %>% 
  list.flatten() %>% 
  lapply(get_percentage_values) %>% 
  Map(plot_sentiment, ., names(.))

Not an expert on novels, but these results seem to make sense to me. A few observations:

All of these novels seem to display a large up-and-down variability. This could be displayed differently with a smoothing line, of course. But it does seem notable that sentiment does not proceed in a smooth arc.
Notice how sickeningly positive the Minister’s Wooing is. But it does seem to follow a similar pattern to the others, just in a relentless tone of positivity.
Norwood drones on for a few hundred pages with more or less the same cheerfulness, but then has to complicate things a little for the finish.
Damnation of Theron Ware is the only novel here I know well. I’m surprised by the results. The early spike makes sense: that’s the section where the husband and wife are happy in their new home and church. The early drop makes sense too: that’s when Ware becomes dissatisfied. I’m surprised though that the middle part of the novel appears mostly positive. I read it as rather dark. But perhaps it is a darkness presented ironically as positive? I could buy that interpretation, though I’d want to go back and look at the text more closely.

Let’s try the NRC method where we get the data frame.

bind_pos <- function(df) {
  pos <- data.frame(position = 1:nrow(df))
  cbind(df, pos)
}

plot_nrc <- function(df, title) {
  ggplot(df,
                           aes(x = position, y = value, color = emotion)) +
    geom_smooth(size = 2, se = FALSE) +
    xlab("Narrative position") +
    ylab("Prevalence") +
    theme_classic() +
    ggtitle(title)
}

Plot the different kinds of emotion:

books %>% 
  lapply(get_nrc_sentiment) %>% 
  lapply(bind_pos) %>% 
  lapply(gather, emotion, value, -position, -negative, -positive) %>% 
  Map(plot_nrc, ., names(.))

I think this kind of plot is potentially very useful. The prevelance of “trust” seems exaggerated. (Though these novels do have something to do with religion, so perhaps “faith” or the like creates this effect.)

Let’s try the sentiment analysis on the Tracts for the Times (list of titles). These theological texts aren’t novels, so maybe this is an interested test to see whether this method has application to non-narrative texts. I’d really like to try this with the American Tract Society publications, which are often narrative, but I don’t have plain text (or even page images) for most publications.

files <- Sys.glob("~/dev/tracts-for-the-times/clean/*.txt") 
tract_names <- files %>% 
  str_extract("tract\\d\\d")

tracts <- files %>% 
  lapply(get_text_as_string) %>% 
  lapply(get_sentences) %>% 
  lapply(get_sentiment, method = "bing")

## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string

## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string

## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string

names(tracts) <- tract_names

What kind of summary results do we get?

tracts %>% 
  lapply(sum_up_sentiment) %>% 
  list.unzip() %$% 
  sort(mean)

##     tract59     tract55     tract83     tract03     tract02     tract39 
## -0.62650602 -0.40853659 -0.33226837 -0.22485207 -0.16666667 -0.16000000 
##     tract53     tract23     tract62     tract36     tract38     tract57 
## -0.15662651 -0.15000000 -0.12883436 -0.11764706 -0.11355311 -0.09090909 
##     tract19     tract79     tract47     tract51     tract48     tract46 
## -0.08888889 -0.07513812 -0.06666667 -0.06194690 -0.05660377 -0.04285714 
##     tract71     tract15     tract06     tract33     tract45     tract82 
## -0.04145078 -0.03623188 -0.03508772  0.01587302  0.02439024  0.03109656 
##     tract11     tract09     tract31     tract37     tract08     tract22 
##  0.03952569  0.04166667  0.04166667  0.04166667  0.05454545  0.05649718 
##     tract50     tract01     tract20     tract41     tract32     tract07 
##  0.10638298  0.11494253  0.12328767  0.12844037  0.13970588  0.15686275 
##     tract72     tract34     tract43     tract90     tract21     tract75 
##  0.16310680  0.18803419  0.20670391  0.21106383  0.21359223  0.23626374 
##     tract61     tract40     tract49     tract89     tract13     tract28 
##  0.23809524  0.24210526  0.24431818  0.25881470  0.26956522  0.27272727 
##     tract29     tract66     tract84     tract73     tract10     tract18 
##  0.27840909  0.29090909  0.29230769  0.33246415  0.37662338  0.39252336 
##     tract63     tract52     tract74     tract12     tract44     tract35 
##  0.40259740  0.41176471  0.41800643  0.41897233  0.42473118  0.43636364 
##     tract76     tract26     tract54     tract14     tract81     tract86 
##  0.44034440  0.44047619  0.44444444  0.44680851  0.45293899  0.45620438 
##     tract25     tract27     tract56     tract60     tract80     tract87 
##  0.45833333  0.47445255  0.54198473  0.55474453  0.56303419  0.59952607 
##     tract58     tract64     tract05     tract30     tract17     tract04 
##  0.60000000  0.66666667  0.68309859  0.71153846  0.71794872  0.85858586 
##     tract16     tract24 
##  0.94805195  1.04000000

Most negative 6 tracts:

59: “Church and State” (not happy with resistance from low church party)
55: “Bishop Wilson’s Meditations on His Sacred Office” (meditation on sin)
83: “Advent Sermons on Antichrist” (Advent is about repentance from sin)
3: “On Alterations in the Liturgy” (Not in favor of alterations)
2: “The Catholic Church” (opening belligerence)
39: “Bishop Wilson’s Form of Receiving Penitents” (again, penitence and sin)

Most positive tracts:

24: “The Scripture View of the Apostolical Commission” (definitely in favor of apostolicity, lots of Scripture quotations)
16: “Advent” (focus on the joyful part of Advent, though the judgment is there too)
4: “Adherence to the Apostolical Succession the safest Course” (still in favor of apostolicity)
17: “The Ministerial Commission, a Trust from Christ for the Benefit of His People” (in favor of high view of clerical duties)
64: “Bishop Bull on the Ancient Liturgies”

Plot the sentiments

tracts %>% 
  lapply(get_percentage_values) %>% 
  Map(plot_sentiment, ., names(.))

## $tract01
## NULL
## 
## $tract02
## NULL
## 
## $tract03
## NULL
## 
## $tract04
## NULL
## 
## $tract05
## NULL
## 
## $tract06
## NULL
## 
## $tract07
## NULL
## 
## $tract08
## NULL
## 
## $tract09
## NULL
## 
## $tract10
## NULL
## 
## $tract11
## NULL
## 
## $tract12
## NULL
## 
## $tract13
## NULL
## 
## $tract14
## NULL
## 
## $tract15
## NULL
## 
## $tract16
## NULL
## 
## $tract17
## NULL
## 
## $tract18
## NULL
## 
## $tract19
## NULL
## 
## $tract20
## NULL
## 
## $tract21
## NULL
## 
## $tract22
## NULL
## 
## $tract23
## NULL
## 
## $tract24
## NULL
## 
## $tract25
## NULL
## 
## $tract26
## NULL
## 
## $tract27
## NULL
## 
## $tract28
## NULL
## 
## $tract29
## NULL
## 
## $tract30
## NULL
## 
## $tract31
## NULL
## 
## $tract32
## NULL
## 
## $tract33
## NULL
## 
## $tract34
## NULL
## 
## $tract35
## NULL
## 
## $tract36
## NULL
## 
## $tract37
## NULL
## 
## $tract38
## NULL
## 
## $tract39
## NULL
## 
## $tract40
## NULL
## 
## $tract41
## NULL
## 
## $tract43
## NULL
## 
## $tract44
## NULL
## 
## $tract45
## NULL
## 
## $tract46
## NULL
## 
## $tract47
## NULL
## 
## $tract48
## NULL
## 
## $tract49
## NULL
## 
## $tract50
## NULL
## 
## $tract51
## NULL
## 
## $tract52
## NULL
## 
## $tract53
## NULL
## 
## $tract54
## NULL
## 
## $tract55
## NULL
## 
## $tract56
## NULL
## 
## $tract57
## NULL
## 
## $tract58
## NULL
## 
## $tract59
## NULL
## 
## $tract60
## NULL
## 
## $tract61
## NULL
## 
## $tract62
## NULL
## 
## $tract63
## NULL
## 
## $tract64
## NULL
## 
## $tract66
## NULL
## 
## $tract71
## NULL
## 
## $tract72
## NULL
## 
## $tract73
## NULL
## 
## $tract74
## NULL
## 
## $tract75
## NULL
## 
## $tract76
## NULL
## 
## $tract79
## NULL
## 
## $tract80
## NULL
## 
## $tract81
## NULL
## 
## $tract82
## NULL
## 
## $tract83
## NULL
## 
## $tract84
## NULL
## 
## $tract86
## NULL
## 
## $tract87
## NULL
## 
## $tract89
## NULL
## 
## $tract90
## NULL

There is a lot of noise in these texts, in part because they are so short, except for the very long Tract 90 and a few others which are more books than tracts. Some also have endnotes. Still, it’s at least suggestive of how this could be used with non-narrative texts.

Finally, one last experiment. Let’s try the emotions in my dissertation just for the fun of it.

diss <- get_text_as_string("diss.txt") %>% 
  get_sentences()

diss %>%
  get_sentiment(method = "bing") %>%
  get_percentage_values() %>%
  plot(type = "l",
       xlab = "Narrative time",
       ylab = "Emotion",
       main = "Dissertation")
  abline(h = 0, col = 3, lty = 2) # neutral sentiment

diss %>% 
  get_nrc_sentiment() %>%
  bind_pos() %>% 
  gather(emotion, value, -position, -positive, -negative) %>% 
  plot_nrc("Dissertation NRC")

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

I’m not surprised that chapters 3 and 4, on slavery and conversions away from Judaism, are emotionally negative, while the chapter on conversion to Catholicism (ch 5) is the most emotionally positive of the chapters.

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.1.2 (2014-10-31)
##  system   x86_64, darwin14.0.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York
## Packages ------------------------------------------------------------------
##  package     * version    date       source                           
##  assertthat  * 0.1        2013-12-06 CRAN (R 3.1.2)                   
##  colorspace  * 1.2-4      2013-09-30 CRAN (R 3.1.2)                   
##  DBI         * 0.3.1      2014-09-24 CRAN (R 3.1.2)                   
##  devtools    * 1.7.0      2015-01-17 CRAN (R 3.1.2)                   
##  digest      * 0.6.8      2014-12-31 CRAN (R 3.1.2)                   
##  dplyr       * 0.4.1      2015-01-14 CRAN (R 3.1.2)                   
##  evaluate    * 0.5.5      2014-04-29 CRAN (R 3.1.2)                   
##  formatR     * 1.0        2014-08-25 CRAN (R 3.1.2)                   
##  ggplot2       1.0.0      2014-05-21 CRAN (R 3.1.2)                   
##  gtable      * 0.1.2      2012-12-05 CRAN (R 3.1.2)                   
##  htmltools   * 0.2.6      2014-09-08 CRAN (R 3.1.2)                   
##  httr        * 0.6.1      2015-01-01 CRAN (R 3.1.2)                   
##  knitr       * 1.9        2015-01-20 CRAN (R 3.1.2)                   
##  labeling    * 0.3        2014-08-23 CRAN (R 3.1.2)                   
##  lattice     * 0.20-29    2014-04-04 CRAN (R 3.1.2)                   
##  lazyeval    * 0.1.10     2015-01-02 CRAN (R 3.1.2)                   
##  magrittr      1.5        2014-11-22 CRAN (R 3.1.2)                   
##  MASS        * 7.3-37     2015-01-10 CRAN (R 3.1.2)                   
##  Matrix      * 1.1-5      2015-01-18 CRAN (R 3.1.2)                   
##  mgcv          1.8-4      2014-11-27 CRAN (R 3.1.2)                   
##  munsell     * 0.4.2      2013-07-11 CRAN (R 3.1.2)                   
##  nlme          3.1-119    2015-01-10 CRAN (R 3.1.2)                   
##  NLP           0.1-6      2015-01-24 CRAN (R 3.1.2)                   
##  openNLP     * 0.2-3      2014-12-10 CRAN (R 3.1.2)                   
##  openNLPdata * 1.5.3-1    2013-09-05 CRAN (R 3.1.2)                   
##  plyr        * 1.8.1      2014-02-26 CRAN (R 3.1.2)                   
##  proto       * 0.3-10     2012-12-22 CRAN (R 3.1.2)                   
##  Rcpp        * 0.11.4     2015-01-24 CRAN (R 3.1.2)                   
##  reshape2    * 1.4.1      2014-12-06 CRAN (R 3.1.2)                   
##  rJava       * 0.9-6      2013-12-24 CRAN (R 3.1.2)                   
##  rlist         0.4        2015-01-24 CRAN (R 3.1.2)                   
##  rmarkdown   * 0.5.1      2015-01-26 CRAN (R 3.1.2)                   
##  rstudioapi  * 0.2        2014-12-31 CRAN (R 3.1.2)                   
##  rvest         0.2.0      2015-01-01 CRAN (R 3.1.2)                   
##  scales      * 0.2.4      2014-04-22 CRAN (R 3.1.2)                   
##  stringi     * 0.4-1      2014-12-14 CRAN (R 3.1.2)                   
##  stringr       0.9.0.9000 2015-02-01 Github (hadley/stringr@a0f03f5)  
##  syuzhet       0.1.1      2015-01-31 Github (mjockers/syuzhet@69bc7da)
##  tidyr         0.2.0      2014-12-05 CRAN (R 3.1.2)                   
##  yaml        * 2.1.13     2014-06-12 CRAN (R 3.1.2)

Exploring Syuzhet

Lincoln Mullen

January 31, 2015