Text Mining

Sherlock Homes



Introduction


Text mining,is the process of deriving high-quality information from text.

Information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

In our mining approach we would be using Sherlock Homes Books To mine useful information. The books can be downloaded using the gutenberg package in R.

Initial Packages required

library(dplyr)
library(tm.plugin.webmining)
library(purrr)
library(tidytext)
library(gutenbergr)
library(ggplot2)
library(tidyr)
library(igraph)
library(ggraph)


Download Books for mining

Using gutenbergr::gutenberg_metadata, we can get the gutenberg_id of these books. These books have been written by Arthur Conan Doyle. We take the first 5 books written by the author for our analysis.

We store these books in the doyle dataframe. In the data frame each row contains the gutenberg_id of the book and a single line of text of that book as seen in the output.

#downaload 1st 5 books
doyle <- gutenberg_download(c(108,126,139,244,290))
head(doyle)
## # A tibble: 6 x 2
##   gutenberg_id text                             
##          <int> <chr>                            
## 1          108 THE RETURN OF SHERLOCK HOLMES,   
## 2          108 ""                               
## 3          108 A Collection of Holmes Adventures
## 4          108 ""                               
## 5          108 ""                               
## 6          108 by Sir Arthur Conan Doyle


Break text into words

Using the unnest_tokens(word,text) method we break the text into resulting words. We use anti_join(stop_words) method to remove stop words like “a”,“an”,“the” or prepositions like “to”,“from” etc. The final data frame contains the gutenberg_id and the word found in that book.

#create tokens and remove stop words
clean_doyle <- doyle %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
head(clean_doyle)
## # A tibble: 6 x 2
##   gutenberg_id word      
##          <int> <chr>     
## 1          108 return    
## 2          108 sherlock  
## 3          108 holmes    
## 4          108 collection
## 5          108 holmes    
## 6          108 adventures


Plot the most frequent words

We count number of words using count() method and then sort the resulting dataframe in descending order of their count. We use the ggplot() method of the ggplot2 package to plot a bar graph.

#plot most frquent words

clean_doyle %>%
  count(word, sort = TRUE) %>%
  filter(n > 250) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()


Plot the words of Joy

Using the “nrc” package we obtain all words with the sentiment as “Joy”. We store this in the dataframe nrc_joy.

# Words of Joy

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

head(nrc_joy)
## # A tibble: 6 x 2
##   word          sentiment
##   <chr>         <chr>    
## 1 absolution    joy      
## 2 abundance     joy      
## 3 abundant      joy      
## 4 accolade      joy      
## 5 accompaniment joy      
## 6 accomplish    joy


We join these words of joy with the words in our dataframe. We then use the count() method to obtain the most “Joyous”" words and then plot them using ggplot() method.

clean_doyle %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)  %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()


Overall Sentiment

We now join the words in our dataframe to all the words in the “bing” package. The bing package contains a list of words and a positive or negative label for each of the words.

# overall sentiment for each book
overall_sentiment <- clean_doyle %>%
  inner_join(get_sentiments("bing")) %>%
  count( sentiment,gutenberg_id)


We then plot the count of total positive and negative words for each book to obtain the overall sentiment for each book.

plot <- ggplot(overall_sentiment, aes(as.character(gutenberg_id), n, fill=sentiment))
plot <- plot + geom_bar(stat = "identity", position = 'dodge')+
  ggtitle("Negative and positive sentiments for each book") +
  xlab("Book ID ") + ylab("No. of words")

plot

As seen above the overall sentiment for each book is negative. This is expected as the books are of the crime and mystery genre.


Common positive / negative words

We have seen the overall sentiment for each book is negative. We now try to obtain the most commonly used positve and negative words using the “bing” package.

#Most common Positive and negative words

sentiment_word_count <- clean_doyle %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment)%>%
  ungroup()


sentiment_word_count %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()


N-grams

So far we have considered individual words, and considered their relationships to sentiments. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

Unnest into bigrams

We can also use the unnest_tokens() function to tokenize into consecutive sequences of words, called n-grams. We can do this by adding the token = “ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram which in our case is a bigram of length 2.

doyle_bigrams <- doyle %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

Sort the Bi-grams in descending order

doyle_bigrams  %>%
  count(bigram, sort = TRUE)
## # A tibble: 140,475 x 2
##    bigram      n
##    <chr>   <int>
##  1 of the   2330
##  2 in the   1564
##  3 it was   1031
##  4 to the    975
##  5 it is     749
##  6 that i    724
##  7 i have    688
##  8 at the    663
##  9 and the   609
## 10 on the    604
## # ... with 140,465 more rows
#most of the words seem uniteresting common stop words, which we need to eliminate
#separate into word1 and word2
bigrams_split <- doyle_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

Filter all the stop words in the bi-gram

#filter stop words
bigrams_filtered <- bigrams_split %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

We now sort and then filter bi-grams with more than 20 occurances

#sort and display
most_freq_bigrams <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE) 

## filter for occurances > 20
bigram_graph <-most_freq_bigrams%>%
  filter(n > 20) %>%
  graph_from_data_frame()
bigram_graph
## IGRAPH b104ff9 DN-- 28 16 -- 
## + attr: name (v/c), n (e/n)
## + edges from b104ff9 (vertex names):
##  [1] lord     ->john       sherlock ->holmes     professor->challenger
##  [4] professor->summerlee  dr       ->munro      jefferson->hope      
##  [7] dear     ->watson     john     ->roxton     baker    ->street    
## [10] south    ->america    stanley  ->hopkins    john     ->ferrier   
## [13] munro    ->sir        maple    ->white      human    ->race      
## [16] la       ->force

Visualizing a network of bigrams

Using the ggraph function(), we plot the most frequent bi-grams as a network, or “graph” .

A graph can be constructed from a tidy object since it has three variables:

  • from: the node an edge is coming from
  • to: the node an edge is going towards
  • weight: A numeric value associated with each edge
#plot graph

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.12, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "#FFA07A", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
  theme_minimal()

As seen above sherlock homes and lord john are the most used Bi-grams as they have darker arrow heads.

Conclusion

In conclusion we can see how text mining has helped us gain valuable information out of the data. Scanning to thousands of book pages to find meaning in the books would have taken months if not years. But with the help of R- packages and computing power, we were able to gather insights within seconds !