Text mining,is the process of deriving high-quality information from text.
Information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
In our mining approach we would be using Sherlock Homes Books To mine useful information. The books can be downloaded using the gutenberg package in R.
library(dplyr)
library(tm.plugin.webmining)
library(purrr)
library(tidytext)
library(gutenbergr)
library(ggplot2)
library(tidyr)
library(igraph)
library(ggraph)
Using gutenbergr::gutenberg_metadata, we can get the gutenberg_id of these books. These books have been written by Arthur Conan Doyle. We take the first 5 books written by the author for our analysis.
We store these books in the doyle dataframe. In the data frame each row contains the gutenberg_id of the book and a single line of text of that book as seen in the output.
#downaload 1st 5 books
doyle <- gutenberg_download(c(108,126,139,244,290))
head(doyle)
## # A tibble: 6 x 2
## gutenberg_id text
## <int> <chr>
## 1 108 THE RETURN OF SHERLOCK HOLMES,
## 2 108 ""
## 3 108 A Collection of Holmes Adventures
## 4 108 ""
## 5 108 ""
## 6 108 by Sir Arthur Conan Doyle
Using the unnest_tokens(word,text) method we break the text into resulting words. We use anti_join(stop_words) method to remove stop words like “a”,“an”,“the” or prepositions like “to”,“from” etc. The final data frame contains the gutenberg_id and the word found in that book.
#create tokens and remove stop words
clean_doyle <- doyle %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
head(clean_doyle)
## # A tibble: 6 x 2
## gutenberg_id word
## <int> <chr>
## 1 108 return
## 2 108 sherlock
## 3 108 holmes
## 4 108 collection
## 5 108 holmes
## 6 108 adventures
We count number of words using count() method and then sort the resulting dataframe in descending order of their count. We use the ggplot() method of the ggplot2 package to plot a bar graph.
#plot most frquent words
clean_doyle %>%
count(word, sort = TRUE) %>%
filter(n > 250) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Using the “nrc” package we obtain all words with the sentiment as “Joy”. We store this in the dataframe nrc_joy.
# Words of Joy
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
head(nrc_joy)
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
We join these words of joy with the words in our dataframe. We then use the count() method to obtain the most “Joyous”" words and then plot them using ggplot() method.
clean_doyle %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
We now join the words in our dataframe to all the words in the “bing” package. The bing package contains a list of words and a positive or negative label for each of the words.
# overall sentiment for each book
overall_sentiment <- clean_doyle %>%
inner_join(get_sentiments("bing")) %>%
count( sentiment,gutenberg_id)
We then plot the count of total positive and negative words for each book to obtain the overall sentiment for each book.
plot <- ggplot(overall_sentiment, aes(as.character(gutenberg_id), n, fill=sentiment))
plot <- plot + geom_bar(stat = "identity", position = 'dodge')+
ggtitle("Negative and positive sentiments for each book") +
xlab("Book ID ") + ylab("No. of words")
plot
As seen above the overall sentiment for each book is negative. This is expected as the books are of the crime and mystery genre.
We have seen the overall sentiment for each book is negative. We now try to obtain the most commonly used positve and negative words using the “bing” package.
#Most common Positive and negative words
sentiment_word_count <- clean_doyle %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment)%>%
ungroup()
sentiment_word_count %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
So far we have considered individual words, and considered their relationships to sentiments. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents.
We can also use the unnest_tokens() function to tokenize into consecutive sequences of words, called n-grams. We can do this by adding the token = “ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram which in our case is a bigram of length 2.
doyle_bigrams <- doyle %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
Sort the Bi-grams in descending order
doyle_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 140,475 x 2
## bigram n
## <chr> <int>
## 1 of the 2330
## 2 in the 1564
## 3 it was 1031
## 4 to the 975
## 5 it is 749
## 6 that i 724
## 7 i have 688
## 8 at the 663
## 9 and the 609
## 10 on the 604
## # ... with 140,465 more rows
#most of the words seem uniteresting common stop words, which we need to eliminate
#separate into word1 and word2
bigrams_split <- doyle_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
Filter all the stop words in the bi-gram
#filter stop words
bigrams_filtered <- bigrams_split %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
We now sort and then filter bi-grams with more than 20 occurances
#sort and display
most_freq_bigrams <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
## filter for occurances > 20
bigram_graph <-most_freq_bigrams%>%
filter(n > 20) %>%
graph_from_data_frame()
bigram_graph
## IGRAPH b104ff9 DN-- 28 16 --
## + attr: name (v/c), n (e/n)
## + edges from b104ff9 (vertex names):
## [1] lord ->john sherlock ->holmes professor->challenger
## [4] professor->summerlee dr ->munro jefferson->hope
## [7] dear ->watson john ->roxton baker ->street
## [10] south ->america stanley ->hopkins john ->ferrier
## [13] munro ->sir maple ->white human ->race
## [16] la ->force
Using the ggraph function(), we plot the most frequent bi-grams as a network, or “graph” .
A graph can be constructed from a tidy object since it has three variables:
#plot graph
set.seed(123)
a <- grid::arrow(type = "closed", length = unit(.12, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "#FFA07A", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
theme_minimal()
As seen above sherlock homes and lord john are the most used Bi-grams as they have darker arrow heads.
In conclusion we can see how text mining has helped us gain valuable information out of the data. Scanning to thousands of book pages to find meaning in the books would have taken months if not years. But with the help of R- packages and computing power, we were able to gather insights within seconds !