Text Mining and N-Grams

This analysis was originally prepared on March 25, 2015 for my Milestone Report in the Data Science Specialization Capstone on Coursera.

Synopsis

The capstone project will demonstrate this data scientist’s ability to process and analyze large volumes of unstructured text. As a final deliverable, the data scientist will develop an algorithm that predicts the next word in a provided text, similar to the predictive text functions found on today’s modern smart phones.

This report demonstrates the data scientist’s ability to successfully import the text data into R, provide basic summary statistics, and explain the planned steps for producing an algorithm for text prediction.

Getting and Cleaning Data

Three text files have been provided for machine learning.
- collection of Tweets
- collection of blog entries
- collection of news items

Each are loaded into R. Details for the R session and sourced functions are listed in the Appendix.

# files paths for each file are hidden  
blog <- readLines(file.blogs, skipNul = TRUE)
twitter <- readLines(file.twitter, skipNul = TRUE)
news <- readLines(file.news, skipNul = TRUE)

Data Summary

A summary of the full files are provided prior to random sampling.

Calculations

The text sources are put into a list and traversed to calculate length and word count.

# helper function to count number of words in a list element
f.word.count <- function(my.list) { sum(stringr::str_count(my.list, "\\S+")) }
# data frame to store counts
df <- data.frame(text.source = c("blog", "twitter", "news"), line.count = NA, word.count = NA)
# put corpura (they aren't class corpura yet) into a list
my.list <- list(blog = blog, twitter = twitter, news = news)
# get line count and word count for each Corpura
df$line.count <- sapply(my.list, length)
df$word.count <- sapply(my.list, f.word.count)

Line Count and Word Count for Each Corpus (Text Collection)

# plot prep
g.line.count <- ggplot(df, aes(x = factor(text.source), y = line.count/1e+06))
g.line.count <- g.line.count + geom_bar(stat = "identity") +
  labs(y = "# of lines/million", x = "text source", title = "Count of lines per Corpus") 
# g.line.count
g.word.count <- ggplot(df, aes(x = factor(text.source), y = word.count/1e+06))
g.word.count <- g.word.count + geom_bar(stat = "identity") + 
  labs(y = "# of words/million", x = "text source", title = "Count of words per Corpus")

These plots show the number of entries (lines) and number of words per corpus (text source). Each corpus has at least 800,000 lines of text (entries, tweets, items) and least 30 million words.

Word Frequency

This section shows the steps taken to return the most frequent words found in each corpus. The blog, news, and twitter corpora are prepared and explored individually.

Random Sampling

Analyzing each corpus in its entirety is not necessary when valid results can be obtained through random sampling. Thus, prior to exploring word frequencies, a random sample is taken from each corpus.

# create a data frame for samples
sample.df <- data.frame(text.source = c("blog", "twitter", "news"),
                 line.count = NA, word.count = NA)
# create a list of random variables
set.seed(324)
percent <- 0.05
randoms <- lapply(my.list, function (x) rbinom(x, 1, percent))
# create a new, empty list to store random selections
sample.list <- list(blog = NA, twitter = NA, news = NA)
# traverse each element of the original list, selecting ~x% of the sample, as
# determined in rbinom
for (i in 1:length(my.list)) {
  sample.list[[i]] <- my.list[[i]][randoms[[i]] == 1]
}
# get counts of sample.list
sample.df$line.count <- sapply(sample.list, length)
sample.df$word.count <- sapply(sample.list, f.word.count)

Random Sample Counts

Here are the count for the sample set. Each corpus represents about 5% of the total number of lines in its original corpus.

##   text.source line.count word.count
## 1        blog      45238    1881800
## 2     twitter     117859    1516498
## 3        news      50515    1726275

Preparing Corpura for Word Analysis

At this stage in preliminary analysis, each text collection is converted to a single Corpus class and transformations are performed.
- For tweets only
- hash tags (the # sign and the accompanying word) and twitter handles (the @ sign and the accompanying word) are removed from the tweet corpus
- For all corpora
- text is converted to lower case
- URLs are removed
- curse words are removed
- numbers are removed
- high-frequency words are removed, such as “the”, “is”, “at”, etc. (collectively known as stop words)
- remaining punctuation is removed

These data cleansing steps are appropriate at this stage of preliminary analysis, but not all these steps will be used in the final preparation for use in natural language prediction. For example, stop words will be retained in the prediction algorithm, as the goal of the final deliverable is to mimic natural language as closely as possible.

### helper functions
removeURL <- function(x) gsub("http:[[:alnum:]]*", "", x)
removeHashTags <- function(x) gsub("#\\S+", "", x)
removeTwitterHandles <- function(x) gsub("@\\S+", "", x)
### create corpus classs
text.corpus <- tm::Corpus(VectorSource(sample.list))
rm(sample.list)
# remove twitter handles and hashtags
text.corpus["twitter"] <- tm::tm_map(text.corpus["twitter"], 
                          content_transformer(removeHashTags))
text.corpus["twitter"] <- tm::tm_map(text.corpus["twitter"], 
                          content_transformer(removeTwitterHandles))
# other transformations
text.corpus <- tm::tm_map(text.corpus, content_transformer(tolower))
text.corpus <- tm::tm_map(text.corpus, removeNumbers)
# cursewords file loaded locally
text.corpus <- tm::tm_map(text.corpus, removeWords, cursewords)
text.corpus <- tm::tm_map(text.corpus, content_transformer(removeURL))
text.corpus <- tm::tm_map(text.corpus, removePunctuation)
text.corpus <- tm::tm_map(text.corpus, removeWords, stopwords("english"))

Next, the Corpora are put into their own Term Document Matrix and ready for further analysis. Words smaller than three characters are omitted.

## single tokenizers
twitterTdm <- tm::TermDocumentMatrix(text.corpus["twitter"], control = list(wordLengths = c(3,Inf)))
blogTdm <- tm::TermDocumentMatrix(text.corpus["blog"], control = list(wordLengths = c(3,Inf)))
newsTdm <- tm::TermDocumentMatrix(text.corpus["news"], control = list(wordLengths = c(3,Inf)))

Word Analysis

The corpora are now ready to be explored for distinct word counts and most frequent words. ## Distinct Words per Corpus

# put word count from term-document matrices into data frames
freq.news <- data.frame(word = newsTdm$dimnames$Terms, frequency = newsTdm$v)
freq.blog <- data.frame(word = blogTdm$dimnames$Terms, frequency = blogTdm$v)
freq.twitter <- data.frame(word = twitterTdm$dimnames$Terms, frequency = twitterTdm$v)
# reorder by descreasing number
freq.news <- plyr::arrange(freq.news, -frequency)
freq.blog <- plyr::arrange(freq.blog, -frequency)
freq.twitter <- plyr::arrange(freq.twitter, -frequency)

In the blog random sample (about 5%), there are 71,041 distinct words and 34,185 distinct words occurring two or more times.
In the news random sample (about 5%), there are 70,871 distinct words and 36,317 distinct words occurring two or more times.
In the twitter random sample (about 5%), there are 68,740 distinct words and 26,428 distinct words occurring two or more times.

Most Frequent Terms

n <- 25L # variable to set top n words
# isolate top n words by decreasing frequency
blog.top <- freq.blog[1:n, ]
news.top <- freq.news[1:n, ]
twitter.top <- freq.twitter[1:n, ]
# reorder levels so charts plot in order of frequency
blog.top$word <- reorder(blog.top$word, blog.top$frequency)
news.top$word <- reorder(news.top$word, news.top$frequency)
twitter.top$word <- reorder(twitter.top$word, twitter.top$frequency)
# plots
g.blog.top <- ggplot(blog.top, aes(x = word, y = frequency))
g.blog.top <- g.blog.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Most Frequent: Blog")

g.news.top <- ggplot(news.top, aes(x = word, y = frequency))
g.news.top <- g.news.top + geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Most Frequent: News")

g.twitter.top <- ggplot(twitter.top, aes(x = word, y = frequency))
g.twitter.top <- g.twitter.top + geom_bar(stat = "identity") + coord_flip() + 
  labs(title = "Most Frequent: Twitter")

These plots display the 25 most frequent terms in each corpus.

df.intersect <- data.frame(word = Reduce(intersect, list(blog.top$word, news.top$word, twitter.top$word)))
df.intersect <- plyr::arrange(df.intersect, word)

These 11, listed alphabetically, are found in all three top 25 lists.

##      word
## 1    back
## 2     can
## 3     get
## 4    just
## 5    like
## 6     new
## 7     now
## 8     one
## 9  people
## 10   time
## 11   will

Prediction Algorithm Plans

Moving forward, the project goal is to develop a natural language prediction algorithm and app. For example, if a user were to type, “I want to go to the …”, the app would suggest the three most likely words that would replace “…”.

N-gram Dictionary

While the word analysis performed in this document is helpful for initial exploration, the data analyst will need to construct a dictionary of bigrams, trigrams, and four-grams, collectively called n-grams. Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases. Here is an example of trigrams from the randomly sampled twitter corpus. Recall that stop words had been removed so the phrases may look choppy. In the final dictionary, stop phrases and words of any length will be maintained.

# tokenize into tri-grams
trigram.twitterTdm <- tm::TermDocumentMatrix(text.corpus["twitter"], control = list(tokenize = TrigramTokenizer))
# put into data frame
freq.trigram.twitter <- data.frame(word = trigram.twitterTdm$dimnames$Terms, frequency = trigram.twitterTdm$v)
# reorder by descending frequency
freq.trigram.twitter <- plyr::arrange(freq.trigram.twitter, -frequency)
##                 word frequency
## 1  happy mothers day       183
## 2      cant wait see       145
## 3        let us know        94
## 4     happy new year        91
## 5           ha ha ha        55
## 6      cinco de mayo        53
## 7     im pretty sure        53
## 8     dont even know        47
## 9     cant wait till        45
## 10    love love love        39

Predicting from N-grams

Each n-gram will be split, separating the last word from the previous words in the n-gram.
- bigrams will become unigram/unigram pairs
- trigrams will become bigram/unigram pairs
- four-grams will become trigram/unigram pairs

For each pair, the three most frequent occurrences will be stored in the dictionary. Here are the three most frequent trigrams for a bigram of “cant wait” from the randomly sampled twitter corpus. These three trigrams would be split into bigram/unigram pairs and stored in the twitter dictionary. Dictionaries will be built for tweets, blogs, and news items.

##              word frequency
## 2   cant wait see       145
## 9  cant wait till        45
## 11 cant wait hear        36

Application Program Flow

After the dictionaries have been established, an app will be developed allowing the user to enter text. After entering the text, the user will declare the text as being meant for a tweet, a blog, or a news item. The app will suggest the three most likely words to come next in the text for the text type, based on these rules.

  1. If the supplied text is greater than 2 words, take the last three words of the text and search the trigram/unigram pairs.
  2. If the supplied text is 2 words, take the two words and search the bigram/unigram pairs.
  3. If the supplied text is 1 word, search for that word in the unigram/unigram pairs.

Suggest the three most frequent unigrams from the n-gram/unigram pair for either 1, 2, or 3 above.

Appendix

R Session Info

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] RWeka_0.4-24    plyr_1.8.1      gridExtra_0.9.1 ggplot2_1.0.1  
## [5] stringr_0.6.2   tm_0.6-1        NLP_0.1-7      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4   digest_0.6.4       evaluate_0.5.5    
##  [4] formatR_1.0        gtable_0.1.2       htmltools_0.2.6   
##  [7] knitr_1.9          labeling_0.3       MASS_7.3-33       
## [10] munsell_0.4.2      parallel_3.1.1     proto_0.3-10      
## [13] Rcpp_0.11.5        reshape2_1.4.1     rJava_0.9-6       
## [16] rmarkdown_0.7      RWekajars_3.7.12-1 scales_0.2.4      
## [19] slam_0.1-32        tools_3.1.1

Sourced Functions

Some of these functions may not have been used.
#### [ngramTokenizer] functions

# BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
# TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# FourgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))

ngram_tokenizer

#' Ngrams tokenizer
#' @param n integer
#' @return n-gram tokenizer function
ngram_tokenizer <- function(n = 1L, skip_word_none = TRUE) {
  stopifnot(is.numeric(n), is.finite(n), n > 0)
  options <- stringi::stri_opts_brkiter(type="word", skip_word_none = skip_word_none)
  
  function(x) {
    stopifnot(is.character(x))
    
    # Split into word tokens
    tokens <- unlist(stringi::stri_split_boundaries(x, opts_brkiter=options))
    len <- length(tokens)
    
    if(all(is.na(tokens)) || len < n) {
      # If we didn't detect any words or number of tokens is less than n return empty vector
      character(0)
    } else {
      sapply(
        1:max(1, len - n + 1),
        function(i) stringi::stri_join(tokens[i:min(len, i + n - 1)], collapse = " ")
      )
    }
  }
}

#### ngram_tokenizer example
x <- ngram_tokenizer(4)(sample.list$blog)