Predictive text analytics, like Google’s predictive search text suggestions and SwiftKey’s predictive keyboard are becoming a mainstream in product offerings. As part of a data science Capstone assignment, Johns Hopkins in concert with Coursera and SwiftKey have defined a project aimed at analyzing a large set of text documents (a corpus) and building a predictive text application. Through analyzing text, a predictive algorithm can suggest words which may come next in a sentence fragment. This document outlines the steps completed to date along with a basic summary of the data used as the basis for algorithm development.

Data Summary

The large project corpus was downloaded from the Coursera website ( Project Dataset ). While containing sample data from multiple languages, the English language subset consisting of blogs, news, and tweets initially derived from a HC Corpus was used to derive the model.

Three English files were processed from the final/en_US directory. In total, the original files contained over 550 MB of data to be used in developing the predictive algorithm. A summary of the number of lines, words, and characters of these original files are contained the following summary:

Source File	Line Count	Word Count	Bytes
en_US.blogs.txt	899,288	37,334,131	210,160,014
en_US.news.txt	1,010,242	34,372,530	205,811,889
en_US.twitter.txt	2,360,148	30,373,583	167,105,338
Total	4,269,678	102,080,244	583,077,241

Review and Data Cleansing

In exploring the data, it was evident that extensive cleaning would be required. Additionally, dealing with such a large dataset with limited computing resources using the R Programming language proved challenging. After exploring several packages such as tm, tau, qdap, and RWeka, all proved to not support the size of the data desired to be processed unless relatively restrictive sample sizes or extensive incremental processing were used.

So, the data was cleaned using the base utilities in R including removing problem characters (like nulls and end of files in the middle of data), special characters, stripping punctuation, stripping unused characters not found in the English language (like smiley faces), and removing numeric values, etc. The complete data from the news, blogs and twitter input files was processes, translated to lower case, and words separated by a single space. Careful attention was made to prevent removing concatenations and hyphenated words since these are legitimate terms necessary for the predictive algorithm. For the technical implementor, it may be valuable to understand how this is done, but it is not required.

# This function will clean unwanted basic characters from the stream.
# It converts basic punctuation characters except "-" and ' which are 
# still valid for the analysis we want to perform.
cleanStrings <- function(s) {
    # first fix the unicode ' characters to be all consistent
    s <- gsub("\xe2\x80\x99", "'", s, perl=TRUE)
    s <- gsub("\u0091|\u0092|\u0093|\u0094|\u0060|\u0027|\u2019|\u000A", "'", s, perl=TRUE)
    
    # Strip unwanted UTF-8 characters
    s <- iconv(s, "UTF-8", "ASCII", "?")
    # strip unused characters but leave ' and -
    s <- gsub("[^[:alpha:][:space:]'-]", " ", s)
    
    # now let's get rid of single quotes that are quoted strings and not in the middle of a word
    # this will leave contractions like don't, I'm, etc.
    s <- gsub("(?<!\\w)[-'](?<!\\w)" , " ", s, perl=TRUE) # just leave - and ' in the middle of words
    
    s <- gsub("[[:space:]]+", " ", s, perl=TRUE) # consolidate spaces
    s <- gsub("^[[:space:]]+", "", s, perl=TRUE)  # strip leading spaces
    s <- tolower(s)
    
    return(s)
}

Data from each of the 3 source files were split into training (60%), cross validation (20%) and testing data sets (20%). Only data from training was used to develop the algorithms. Sample cases were used to validate the algorithm by using the cross validation sets. Once completed, final testing will be performed on the testing data. The resulting data in the training development consisted of the following input:

Training File	Line Count	Word Count	Bytes
en_US.blogs.txt	539,572	22,198,514	121,539,282
en_US.news.txt	606,145	20,152,366	118,262,005
en_US.twitter.txt	1,416,088	17,792,581	95,810,277
Total	2,561,805	60,143,461	335,611,564

Tokenization and Profanity Removal

Each of the 3 training sets was further “tokenized”" into individual words and combined into a single data table. The frequency of individual words in the resulting dataset was analyzed:

plot of chunk unigram

Figure 1: distribution of most frequent single words

In total there were over 570,497 unique words from all 3 datasets after cleaning. We see a sharp decline in the frequency of single words. 2,435 of the 570,497 words made up over 80% of the total single word set, so we expect to be able to greatly shorten the list. For example the frequency of the word “the” was found in over 4.7% of all words in the datasets. The word “to” made up over 2.7% of the entire word list. While different than published English frequencies, this is consistent with the English language taking into count verb tense, plurals, stemming and the source of data from informal twitter feeds.

plot of chunk freq

##     Word Percent
##  1:  the   4.762
##  2:   to   2.749
##  3:  and   2.414
##  4:    a   2.403
##  5:   of   2.004
##  6:    i   1.655
##  7:   in   1.641
##  8:  for   1.096
##  9:   is   1.076
## 10: that   1.039

Figure 2: Total distribution of all words in the corpus. The list of top 10 word frequencies is listed as well as a line chart showing the sharp decrease in most frequenly found 100 words.

Bigrams and trigrams (two and three consecutive words) datasets were then created from the tokenized data. There were 10,846,696 bigrams and 33,275,261 trigrams in the resulting datasets. The most frequent word pairs and three word sets were very logical. For example, the most frequently found trigrams were:

plot of chunk trigrams

##      Word1 Word2 Word3 Percent
##  1:    one    of   the 0.03428
##  2:      a   lot    of 0.03003
##  3: thanks   for   the 0.02363
##  4:     to    be     a 0.01817
##  5:  going    to    be 0.01737
##  6:    the   end    of 0.01505
##  7:      i  want    to 0.01489
##  8:    out    of   the 0.01485
##  9:     it   was     a 0.01430
## 10:     as  well    as 0.01392

Figure 3: Distribution of most frequent trigrams. The list of top 10 trigram combinations frequencies is listed as well as a bar chart of 25 of the most frequent 3 word combinations.

A similar drop off in frequency was found in the trigrams but not as severe as single words.

A list of 450 profanity words were downloaded from reverse engineered Google applications. Individual words, trigrams and bigrams that contained words from the profanity list were stripped from a shortened dataset which will be used in the prediction algorithm.

Conclusion and Next Steps

The next step is to use the trigrams to predict words. Initial analysis of selected phrases from the cross validation dataset and the project Quiz have proven that the trigram and bigram sets produce high probability results. In Quiz 2, the correct answer was predicted by the highest probability to the trigram or bigram phrases. A backoff strategy will be developed to use bigrams and then unigrams if matches to trigrams are not found. The key will be to make the data set small and fast enough to predict a set of highest probable next words, while maintaining accuracy of the prediction.

In general:

R text processing and language support is not a natural fit to solve this problem. The limitations of R data structures residing in memory is a limitation given the size of the data.
Existing libraries such as tm, tau, RWeka while powerful are not suitable for such large data analysis.
Reduced sampling sizes may produce results, but bias will be reduced by using a large corpus.
Other solutions like hadoop are natural fits for analyzing and building datasets of propabilities of word sequences.
Processing time to develop the base dataset to use in the predictieve algorithm is reasonable but extensive on standard PC hardware.

This project brief outlines the basic and sample statistics of the corpus used as the baseis for the preditive text algorithm. The data has been downloaded, cleaned, tokenized and the development of trigram, bigram and unigram datasets appear to be a solid basis for a predictive algorithm.

Natural Language Processing Capstone Milestone Brief

November 13, 2014

Data Summary

Review and Data Cleansing

Tokenization and Profanity Removal

Conclusion and Next Steps