Predictive text analytics, like Google’s predictive search text suggestions and SwiftKey’s predictive keyboard are becoming a mainstream in product offerings. As part of a data science Capstone assignment, Johns Hopkins in concert with Coursera and SwiftKey have defined a project aimed at analyzing a large set of text documents (a corpus) and building a predictive text application. Through analyzing text, a predictive algorithm can suggest words which may come next in a sentence fragment. This document outlines the steps completed to date along with a basic summary of the data used as the basis for algorithm development.
The large project corpus was downloaded from the Coursera website ( Project Dataset ). While containing sample data from multiple languages, the English language subset consisting of blogs, news, and tweets initially derived from a HC Corpus was used to derive the model.
Three English files were processed from the final/en_US directory. In total, the original files contained over 550 MB of data to be used in developing the predictive algorithm. A summary of the number of lines, words, and characters of these original files are contained the following summary:
| Source File | Line Count | Word Count | Bytes |
|---|---|---|---|
| en_US.blogs.txt | 899,288 | 37,334,131 | 210,160,014 |
| en_US.news.txt | 1,010,242 | 34,372,530 | 205,811,889 |
| en_US.twitter.txt | 2,360,148 | 30,373,583 | 167,105,338 |
| Total | 4,269,678 | 102,080,244 | 583,077,241 |
In exploring the data, it was evident that extensive cleaning would be required. Additionally, dealing with such a large dataset with limited computing resources using the R Programming language proved challenging. After exploring several packages such as tm, tau, qdap, and RWeka, all proved to not support the size of the data desired to be processed unless relatively restrictive sample sizes or extensive incremental processing were used.
So, the data was cleaned using the base utilities in R including removing problem characters (like nulls and end of files in the middle of data), special characters, stripping punctuation, stripping unused characters not found in the English language (like smiley faces), and removing numeric values, etc. The complete data from the news, blogs and twitter input files was processes, translated to lower case, and words separated by a single space. Careful attention was made to prevent removing concatenations and hyphenated words since these are legitimate terms necessary for the predictive algorithm. For the technical implementor, it may be valuable to understand how this is done, but it is not required.
# This function will clean unwanted basic characters from the stream.
# It converts basic punctuation characters except "-" and ' which are
# still valid for the analysis we want to perform.
cleanStrings <- function(s) {
# first fix the unicode ' characters to be all consistent
s <- gsub("\xe2\x80\x99", "'", s, perl=TRUE)
s <- gsub("\u0091|\u0092|\u0093|\u0094|\u0060|\u0027|\u2019|\u000A", "'", s, perl=TRUE)
# Strip unwanted UTF-8 characters
s <- iconv(s, "UTF-8", "ASCII", "?")
# strip unused characters but leave ' and -
s <- gsub("[^[:alpha:][:space:]'-]", " ", s)
# now let's get rid of single quotes that are quoted strings and not in the middle of a word
# this will leave contractions like don't, I'm, etc.
s <- gsub("(?<!\\w)[-'](?<!\\w)" , " ", s, perl=TRUE) # just leave - and ' in the middle of words
s <- gsub("[[:space:]]+", " ", s, perl=TRUE) # consolidate spaces
s <- gsub("^[[:space:]]+", "", s, perl=TRUE) # strip leading spaces
s <- tolower(s)
return(s)
}
Data from each of the 3 source files were split into training (60%), cross validation (20%) and testing data sets (20%). Only data from training was used to develop the algorithms. Sample cases were used to validate the algorithm by using the cross validation sets. Once completed, final testing will be performed on the testing data. The resulting data in the training development consisted of the following input:
| Training File | Line Count | Word Count | Bytes |
|---|---|---|---|
| en_US.blogs.txt | 539,572 | 22,198,514 | 121,539,282 |
| en_US.news.txt | 606,145 | 20,152,366 | 118,262,005 |
| en_US.twitter.txt | 1,416,088 | 17,792,581 | 95,810,277 |
| Total | 2,561,805 | 60,143,461 | 335,611,564 |
Each of the 3 training sets was further “tokenized”" into individual words and combined into a single data table. The frequency of individual words in the resulting dataset was analyzed:
Figure 1: distribution of most frequent single words
In total there were over 570,497 unique words from all 3 datasets after cleaning. We see a sharp decline in the frequency of single words. 2,435 of the 570,497 words made up over 80% of the total single word set, so we expect to be able to greatly shorten the list. For example the frequency of the word “the” was found in over 4.7% of all words in the datasets. The word “to” made up over 2.7% of the entire word list. While different than published English frequencies, this is consistent with the English language taking into count verb tense, plurals, stemming and the source of data from informal twitter feeds.
## Word Percent
## 1: the 4.762
## 2: to 2.749
## 3: and 2.414
## 4: a 2.403
## 5: of 2.004
## 6: i 1.655
## 7: in 1.641
## 8: for 1.096
## 9: is 1.076
## 10: that 1.039
Figure 2: Total distribution of all words in the corpus. The list of top 10 word frequencies is listed as well as a line chart showing the sharp decrease in most frequenly found 100 words.
Bigrams and trigrams (two and three consecutive words) datasets were then created from the tokenized data. There were 10,846,696 bigrams and 33,275,261 trigrams in the resulting datasets. The most frequent word pairs and three word sets were very logical. For example, the most frequently found trigrams were:
## Word1 Word2 Word3 Percent
## 1: one of the 0.03428
## 2: a lot of 0.03003
## 3: thanks for the 0.02363
## 4: to be a 0.01817
## 5: going to be 0.01737
## 6: the end of 0.01505
## 7: i want to 0.01489
## 8: out of the 0.01485
## 9: it was a 0.01430
## 10: as well as 0.01392
Figure 3: Distribution of most frequent trigrams. The list of top 10 trigram combinations frequencies is listed as well as a bar chart of 25 of the most frequent 3 word combinations.
A similar drop off in frequency was found in the trigrams but not as severe as single words.
A list of 450 profanity words were downloaded from reverse engineered Google applications. Individual words, trigrams and bigrams that contained words from the profanity list were stripped from a shortened dataset which will be used in the prediction algorithm.
The next step is to use the trigrams to predict words. Initial analysis of selected phrases from the cross validation dataset and the project Quiz have proven that the trigram and bigram sets produce high probability results. In Quiz 2, the correct answer was predicted by the highest probability to the trigram or bigram phrases. A backoff strategy will be developed to use bigrams and then unigrams if matches to trigrams are not found. The key will be to make the data set small and fast enough to predict a set of highest probable next words, while maintaining accuracy of the prediction.
In general:
This project brief outlines the basic and sample statistics of the corpus used as the baseis for the preditive text algorithm. The data has been downloaded, cleaned, tokenized and the development of trigram, bigram and unigram datasets appear to be a solid basis for a predictive algorithm.