Text Mining Mark Twain’s Novels

Introduction
Tokenizing by n-gram
- Bigrams
- Trigrams
Sentiment Analysis
References

Introduction

Samuel Langhorne Clemens, known worldwide by his pen name Mark Twain, was an American author renowned worldwide as one of the most influential writers in the English language. He has authored more than 25 novels,in this project I will be analysing four of his very famous works :-

Adventures of Huckleberry Finn
The Adventures of Tom Sawyer
Roughing It
The Gilded Age

I will download the novels from Project Gutenberg (A library of over 60,000 free eBooks) using the gutenbergr package.

library(tidyverse)
library(wordcloud)
library(gutenbergr)
library(tidytext)
library(tidyr)
library(ggraph)
library(igraph)
library(reshape2)

twain_books <- gutenberg_download(c( 74, 76,3177,3178), meta_fields = "title")

Tokenizing by n-gram

We got the data, the next step is tokenization.Where we split the data into individual words using the unnest_tokens().Then remove stop words using antijoin().

tidy_books <- twain_books %>%
  unnest_tokens(word, text)%>%
  anti_join(stop_words)

Let us have a quick look at the most frequently used words in Mark Twain’s novels.

tidy_books %>%
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  top_n(10)%>%
  ggplot(aes(word, n,fill=-n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  ggtitle("The Most Common Words in Mark Twains' Novels")

Earlier we simply counted the occurence of each word and listed the most frequent words.But this doesn’t provide much information on its own as we are unaware of the context in which these words were used.Thus we will now work with tokens containing two words(bigrams) or three words(trigrams) as it can help to extract useful phrases which leads to some additional insights.

Bigrams

books_bigrams <- twain_books %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
books_bigrams

## # A tibble: 516,058 x 3
##    gutenberg_id title                        bigram           
##           <int> <chr>                        <chr>            
##  1           74 The Adventures of Tom Sawyer the adventures   
##  2           74 The Adventures of Tom Sawyer adventures of    
##  3           74 The Adventures of Tom Sawyer of tom           
##  4           74 The Adventures of Tom Sawyer tom sawyer       
##  5           74 The Adventures of Tom Sawyer sawyer by        
##  6           74 The Adventures of Tom Sawyer by mark          
##  7           74 The Adventures of Tom Sawyer mark twain       
##  8           74 The Adventures of Tom Sawyer twain samuel     
##  9           74 The Adventures of Tom Sawyer samuel langhorne 
## 10           74 The Adventures of Tom Sawyer langhorne clemens
## # ... with 516,048 more rows

In the below steps, we first split the bigrams into two. Remove if any of the word is a stopword and then unite them back again.

bigrams_separated <- books_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
bigrams_filtered

## # A tibble: 47,257 x 4
##    gutenberg_id title                        word1     word2    
##           <int> <chr>                        <chr>     <chr>    
##  1           74 The Adventures of Tom Sawyer tom       sawyer   
##  2           74 The Adventures of Tom Sawyer mark      twain    
##  3           74 The Adventures of Tom Sawyer twain     samuel   
##  4           74 The Adventures of Tom Sawyer samuel    langhorne
##  5           74 The Adventures of Tom Sawyer langhorne clemens  
##  6           74 The Adventures of Tom Sawyer clemens   contents 
##  7           74 The Adventures of Tom Sawyer contents  chapter  
##  8           74 The Adventures of Tom Sawyer tom       aunt     
##  9           74 The Adventures of Tom Sawyer aunt      polly    
## 10           74 The Adventures of Tom Sawyer polly     decides  
## # ... with 47,247 more rows

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")
bigrams_united

## # A tibble: 47,257 x 3
##    gutenberg_id title                        bigram           
##           <int> <chr>                        <chr>            
##  1           74 The Adventures of Tom Sawyer tom sawyer       
##  2           74 The Adventures of Tom Sawyer mark twain       
##  3           74 The Adventures of Tom Sawyer twain samuel     
##  4           74 The Adventures of Tom Sawyer samuel langhorne 
##  5           74 The Adventures of Tom Sawyer langhorne clemens
##  6           74 The Adventures of Tom Sawyer clemens contents 
##  7           74 The Adventures of Tom Sawyer contents chapter 
##  8           74 The Adventures of Tom Sawyer tom aunt         
##  9           74 The Adventures of Tom Sawyer aunt polly       
## 10           74 The Adventures of Tom Sawyer polly decides    
## # ... with 47,247 more rows

After filtering, essentially the next step is to find the most common bigrams in each of the novels.

bigram_tf_idf <- bigrams_united %>%
  count(title, bigram) %>%
  bind_tf_idf(bigram, title, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  group_by(title) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(bigram = reorder(bigram, tf_idf)) %>%
  ggplot(aes(bigram, tf_idf, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ title, ncol = 2, scales = "free") +
  coord_flip() +
  labs(y = "tf-idf of bigram to novel",
       x = "")

We can see that the most common bigrams in Mark Twain’s novels is names such as injun joe,mary jane,aunt sally,aunt polly,miss watson etc.

Finally, we will visualize relationships between these bigrams by plotting a network of bigrams with ggraph.

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigram_counts %>%
  filter(n > 20) %>%
  graph_from_data_frame()

set.seed(2020)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) + 
  ggtitle("Network of Bigrams in Mark Twains' Novels")

In the word network we can see that quantitative words like “yards”,“miles”,“dollars”,“feet” all have a common node “hundred”.Also we can see salutation such as “aunt”,“miss”,“col” are followed by their names.Other than those there are a lot of names of characters coming up.

Trigrams

twain_books %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)

## # A tibble: 12,776 x 4
##    word1   word2    word3       n
##    <chr>   <chr>    <chr>   <int>
##  1 miss    mary     jane       15
##  2 salt    lake     city       15
##  3 genuine mexican  plug        8
##  4 ten     thousand dollars     8
##  5 hundred thousand dollars     7
##  6 twenty  thousand dollars     7
##  7 stephen dowling  bots        6
##  8 aye     aye      sir         5
##  9 de      river    en          5
## 10 gold    hill     news        5
## # ... with 12,766 more rows

The popular trigrams are “miss mary jane”, “salt lake city”, “genuine mexican plug”, “ten/hundred/twenty thousand dollars”.

Sentiment Analysis

Comparison Cloud

At first we will look at the words that contribute the most to positive and negative sentiment.We will be using the Bing lexicon for this purpose.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)

Overall Sentiment of Novels

Here we first find a sentiment score for each word using the Bing lexicon and inner_join().And then divide each novel into 80 sections and define a column index to keep track of where we are in the narrative.This index counts up sections of 80 lines of text.We then use spread() so that we have positive and negative sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).Now we can plot the net sentiment score against the index on the x-axis that keeps track of narrative time in sections of text.

section_words <- twain_books %>%
  mutate(section = row_number()) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word)


mark_twain_sentiment <- section_words %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, index = section %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(mark_twain_sentiment, aes(index, sentiment, fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free_x")

Next we will plot a stacked barchart to find the proportion of positive and negative sentiments in each of the four novels.

# Review tail of all_books
books_sent_count <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, sentiment, sort = TRUE) %>%
  ungroup()

book_pos <- books_sent_count %>%
  # Group by book
  group_by(title) %>% 
  # Mutate to add % positive column 
  mutate(percent_positive = 100 * n / sum(n) )

# Plot percent_positive vs. book, filled by sentiment
ggplot(book_pos, aes(title,percent_positive, fill = sentiment)) +  
  # Add a col layer
  geom_col()+ 
  ggtitle("Proportion of Positive/Negative Sentiment Scores")

All the novels have a greater proportion of negative sentiment.But earlier when we divided each novel into sections we found that most of the sections were negative. I suspect many words are falsely labelled as positive.

To verify this we will use the bigrams which we collected earlier and use them to find the most common words which are preceded by negation.

negation_words <- c("not", "no","can't","don't")

negated_words <- bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE)

negated_words %>%
  mutate(contribution = n * value,
         word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ggplot(aes(word2, contribution, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free",ncol=2) +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words preceded by negation term") +
  ylab("Sentiment value * # of occurrences") +
  coord_flip()+ 
  ggtitle("Positive/Negative Words that Follow Negations")

Here we can see pairings like “can’t help”,“don’t like”,“no good”,“not help” etc,which otherwise will be termed as positive.

NRC Lexicon

We will use the NRC Word Emotion Association Lexicon to add 8 more sentiments to our analysis.
Before we find the major sentiments in the novels,let us first check how the words contribute to the sentiments in the NRC lexicon.

# Review tail of all_books
tidy_books %>%
    count(word) %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(sentiment)%>%
        filter(!grepl("positive|negative", sentiment)) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(words = reorder(word, n)) %>%
  ggplot(aes(x = word, fill = sentiment)) +
  facet_grid(~sentiment) +
  geom_bar() + #Create a bar for each word per sentiment
  theme(panel.grid.major.x = element_blank(),
        axis.text.x = element_blank()) + #Place the words on the y-axis
  xlab(NULL) + ylab(NULL) +
  ggtitle("Words From Mark Twains' Novels") +
  coord_flip()+ 
  theme(legend.position="none")

# Review tail of all_books
books_sent_count2 <- tidy_books %>%
  inner_join(get_sentiments("nrc")) %>%
        filter(!grepl("positive|negative", sentiment)) %>%
  count(sentiment, sort = TRUE) %>%
  ungroup()
books_sent_count2%>%
  ggplot( aes(x=reorder(sentiment,n), y=n,fill=sentiment)) +
  geom_col()+
    coord_flip() + 
  theme(legend.position="none")+ 
  ggtitle("Top Sentiments in Mark Twains' Novels using NRC Lexicon")

Next will look at how the sentiments were distributed for each of the four novels.Since “The Glided Age” and “Roughing It” are relatively biggers novels in terms of number of words, I have divided every words contribution from a particular novel by the total contribution of all the words in that novel.So that all the novels are in similar scale.

scores <- tidy_books%>%
  # filter(title != "Roughing It")%>%
  # filter(title != "Adventures of Huckleberry Finn")%>%
  inner_join(get_sentiments("nrc"), by = c("word" = "word")) %>% 
  filter(!grepl("positive|negative", sentiment)) %>% 
  count(title, sentiment) 
  scores%>%
    group_by(title)%>%
    mutate(total = (n/sum(n))*100) %>%
    ggplot( aes(x=sentiment, y=total,fill=sentiment)) +
    geom_bar(stat="identity", width=0.6) +
    coord_flip() +

    facet_wrap(~title, ncol=4)+
    theme(legend.position="none")+ 
  ggtitle("Comparison of Sentiments from Each of the Novels")

Interestingly all the novels show a similar trend for a particular sentiment.The only major difference is in the novel “The Glided Age” where the degree of joy is greater than the degree of sadness and fear.Which is not the trend followed by other novels.
Earlier also we saw that “The Glided Age” was relatively more positive compared to the other novels.

With that we come to the end of this project.Wow! that was so much fun, learned a lot of new things along the way.
The source code for this project can be found here.

References

Text Mining with R