Text Mining

We are to start this assignment by producing a working version of the example used in “Text Mining with R”, chapter 2.

Base Code

First let’s simply reproduce the code so that things are in the correct place.

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]", 
                                     ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "anger")

DT::datatable(tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE), 
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

Here we brought in the Jane Austen corpus and the NRC lexicon from chapter 2 in “Text Mining with R”.

The NRC license can be found here

New Corpus - Mody Dick

Let’s move on to another famous NLP corpus, Hermin Melville’s, “Moby Dick”. For the uninitiated, a summary of the book is here

We will be using the gutnebergr package to download the ’Moby Dick" corpus. There are several options, id number 2701 has the full text and is public domain.

DT::datatable(gutenberg_metadata %>%
  filter(author == "Melville, Herman") %>%
  select(gutenberg_id, title, rights, has_text),
              extensions = c('FixedColumns',"FixedHeader"),
              options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

Download the file.

moby_dick<-gutenberg_download(2701)

Basic Preprocessing to assign and index for the lines and remove the whitespace.

moby_dick<- moby_dick %>%
                  filter(!text == "")
colnames(moby_dick) <- c("id", "text")
moby_dick[, "id"] <- rownames(moby_dick)

Call Me Ishmael

The opening paragraph of Moby Dick is famous in its own right, if you have not read it, or not read it in some time, you are encouraged to do so below.

for (i in 6:24)
  {print(paste(moby_dick[i,'text']),row.names=FALSE, quote = FALSE)}

## [1] Call me Ishmael.  Some years ago--never mind how long precisely--
## [1] having little or no money in my purse, and nothing particular
## [1] to interest me on shore, I thought I would sail about a little
## [1] and see the watery part of the world.  It is a way I have
## [1] of driving off the spleen and regulating the circulation.
## [1] Whenever I find myself growing grim about the mouth;
## [1] whenever it is a damp, drizzly November in my soul; whenever I
## [1] find myself involuntarily pausing before coffin warehouses,
## [1] and bringing up the rear of every funeral I meet;
## [1] and especially whenever my hypos get such an upper hand of me,
## [1] that it requires a strong moral principle to prevent me from
## [1] deliberately stepping into the street, and methodically knocking
## [1] people's hats off--then, I account it high time to get to sea
## [1] as soon as I can.  This is my substitute for pistol and ball.
## [1] With a philosophical flourish Cato throws himself upon his sword;
## [1] I quietly take to the ship.  There is nothing surprising in this.
## [1] If they but knew it, almost all men in their degree, some time
## [1] or other, cherish very nearly the same feelings towards
## [1] the ocean with me.

Sentiment Analysis

First we will process this down by word and include the lines and chapters in case we want them later.

moby_dick_tokens<- moby_dick %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text,
                                     regex("^CHAPTER [\\d]", 
                                     ignore_case = TRUE)))) %>%
  mutate(chapter = chapter+1) %>%
  ungroup() %>%
  unnest_tokens(word, text)

New Lexicon

From the “lexicon” package I am using the Augmented Sentiword Polarity Table. For reference see page 29 in the documentation

This package is not set up for tidyverse the way bing, afinn, and nrc are so we need to make some minor changes to make these work in a similar way. hash_sentiment_sentiword is the lexicon linked to above that I will be using.

sentiword<-hash_sentiment_sentiword
names(sentiword)<-c("word", "value")

I’ll use a similar tactic as in the “Text Mining with R” and parse the sentiment by 80 line chunks. I’ll plot the sentiment by chapter and hue the sentiment by 80 line chunks (there are 135 chapters in the book)

senti_word<- moby_dick_tokens %>% 
  inner_join(sentiword) %>%
  group_by(index = linenumber %/% 80, chapter)%>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "SentiWord")
senti_word%>%
  ggplot(aes(chapter, sentiment, fill=index)) +
  geom_col() +
  theme(legend.position = "top")

SentiWord Lexicon Comparison

We can now compare the AFINN lexicon with sentiwords to see if the score are similar in trend. The scale WILL be different as SentiWord and AFINN have different scoring values. We are looking at trend here for comparison NOT whole number value.

afinn_word<-moby_dick_tokens %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(index = linenumber %/% 80, chapter) %>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "afinn")
twin_lex<-bind_rows(senti_word, afinn_word) 
twin_lex%>%
  ggplot(aes(chapter, sentiment, fill=index)) +
  geom_col() +
  facet_wrap(~method, scales = "free_y", ncol=1) +
  theme(legend.position = "top")

These are reasonably similar, but there are notable expections. The section between chapters 55-80 seem to rather different. Let’s look at those more closely.

bind_rows(senti_word, afinn_word) %>%
  filter(chapter >= 55 & chapter <= 85) %>%
  ggplot(aes(chapter, sentiment, fill=index)) +
  geom_col() +
  facet_wrap(~method, scales = "free_y", ncol=1) +
  theme(legend.position = "top")

So there is some confusion here between the two lexicons. Chapter 70-80 generally refer to the time after a whale has been captured. Ishmael describes the gruesome and and often harrowing process of dismembering a whale and procuring it’s oil from fat and tissue. Chapter 71 talks at length about the various religious sects of Christainity popular at the time. It seems AFINN categorizes these lines as pretty negative where as SentiWord is more neutral. We see a similar distinction in chapter 75 -80 where there is further discussion about both religion, the idea of karma, and the process of dismantling of a whale. SentiWord scores these ideas as generally negative where as AFINN sees them as positive. These two lexicons do agree quite well in the chapter 81, in which a new boat and crew are routinely mocked for there lack of knowledge and naivete.

These differences are somewhat expected considering the detail and nuance of Melville as well as his prose, sentence structure, and word choice.

Second New Lexicon

Let’s bring in one more lexicon that is very different. The SlangSD lexicon for more informal text. See documentation for this lexicon here

slangword<-hash_sentiment_slangsd
names(slangword)<-c("word", "value")

moby_dick_tokens %>% 
  inner_join(slangword) %>%
  group_by(index = linenumber %/% 80, chapter)%>%
  summarise(sentiment = sum(value)) %>%
  mutate(method = "Slang SD") %>%
  ggplot(aes(chapter, sentiment, fill=index)) +
  geom_col() +
  facet_wrap(~method, scales = "free_y", ncol=1) +
  theme(legend.position = "top")

Clearly the Slang SD lexicon is an example of clearly bad choice as it identifies everything as negative. Melville and modern slang terms very poorly aligned.

TFIDF For More Important Words

This is the primary use-case for TFIDF - no one cares about how frequent the words, “the” and “and” are in a corpus. To understand more valuable words and their frequency, we need a systematic way to rank the words. This is what TFIDF does. Term frequenct times inverse docment frequency allows us to know a relative sense of how unique words are distributed over the document.

First we will rank the words without TFIDF to demonstrate the issue.

word_chapter_count<-moby_dick_tokens %>%
  count(chapter, word, sort=TRUE)

chapter_total<-moby_dick_tokens %>%
  count(chapter, word, sort=TRUE) %>%
  group_by(chapter) %>%
  summarise(total = sum(n))

book_totals<-left_join(word_chapter_count,
                       chapter_total,
                       by="chapter") %>%
              rename(chapter_count = n) %>%
              rename(book_count = total)

DT::datatable(book_totals,
              extensions = c('FixedColumns',"FixedHeader"),
              options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE))

Not informative.

book_totals<-book_totals %>%
                bind_tf_idf(word, chapter, chapter_count) %>%
                arrange((desc(tf_idf)))
DT::datatable(book_totals %>%
                select(chapter, word, book_count, tf_idf),
              extensions = c('FixedColumns',"FixedHeader"),
              options = list(scrollX = TRUE, 
                         paging=TRUE,
                         fixedHeader=TRUE,
                         pageLength = 10))

Since there are 135 chapter we will need to filter the results for higher values of tf_idf and word counts per chapter because this isn’t a perfect science of ranking word value. We see above that “um”, “tum”, “want” still rank pretty high so we can filter out these words with some creative selection.

book_totals %>%
  filter(tf_idf > 0.01 & chapter_count>10) %>%    # Search criteria
  mutate(word = factor(word,                      # Group words together
                       levels = rev(unique(word)))) %>% 
  group_by(chapter) %>% 
  top_n(20) %>% 
  ungroup() %>%
  arrange(desc(tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = chapter)) +    
  geom_col(position = "dodge") +
  labs(x = NULL, y = "tf-idf") +
  theme(legend.position = "top") +
  coord_flip()

Ok so we see clearly that Johan, the bibical tale ranking very high in several chaptera in Moby Dick. As does “whiteness” which is used to describe the whale. We also see many key characters; Peleg, Bildad, Queequeg, Pip, and Gabriel. Ahab is also on the list but near the bottom likely because his name is used so frequently.

607 - HW 10 - Text Mining

Jeff Shamp

2020-04-03