This document is the initial interim summary of progress on the Word Prediction project. The outcome of the project is to be an interactive web based Shiny app that will attempt to predict the next word given an phrase input by the user of the web app. The prediction will be made using statistical language processing techniques applied to the corpus (training set) provided. On March 7, 2016, the assignment was provided along with three data files intended as the text corpus, and initial summary statistics were extracted, as follows:
##
## --------------------------------------------------------------------------------------
## name lncnt avglnlen mxlnlen wdcnt avgwdlen mxwdlen wdperln
## ----------------- ------- ---------- --------- -------- ---------- --------- ---------
## en_US.blogs.txt 899288 232 40836 37334131 4 1146 41
##
## en_US.news.txt 77259 204 5761 2643969 4 123 34
##
## en_US.twitter.txt 2360148 69 214 30373583 4 188 12
## --------------------------------------------------------------------------------------
These initial parameters were assembled using the simplest tokenization (breaking on whitespace). As we will see in the case of average word length, some of these parameters will change as we refine our approach to tokenization, token cleanup, and token abandonment.
The tm package was used as a corpus framework initially, with the first attempt based on a memory based corpus (i.e., tm::VCorpus). This proved unworkable given the volume of text, and was revised to a database approach (tm::PCorpus). Transformations were applied using the tm package to convert all leters to lowercase, remove punctuation and numbers, and eliminate any tokens with length greater than 20 (the raw input includes some longer tokens including URLs). The transformations were applied serially. In order to limit the runtime, only words with a minimum frequency of 100 or more were retained. A document-term-matrix was generated (runtime of 17 hours). While the tm framework made available a number of preconstructed tools well-suited to the application, the runtimes and memory limitations (some attempts at document-term-matrices immediately exceeded the constraints of RAM memory available to me) led me to examine simpler scripting approaches using the corpus as an input stream.
“Two key aspects” are mentioned in the assignment, as follows:
Size: the amount of memory (physical RAM) required to run the model in R
Runtime: The amount of time the algorithm takes to make a prediction given the acceptable input
These will significantly influence some of the early decisions regarding methodology. In addition, this project will be conducted entirely using personal laptop computers for which the corpus represents a challenging volume of data.
First of all, we look at the 1-grams, or unigrams. The number of occurences will vary some as we adjust the preprocessing parsing. At the time of writing the words data frame has 32,653 distinct words representing total occurrences of over 66 million times in the corpus. Each of the words included appeared at least 50 times in the corpus.
Figure 1.1 shows the most frequently appearing words. The most commonly appearing word is “the”, appearing 2.9 million times, or 4.4% of the total word count in our words data frame. The total upon which this percentage is based excludes the tokens that were abandoned (e.g., too long, not containing text, occuring less than 50 times). This means if we predict the word “the” every time, we should be right about 4.4% of the time. The unigram counts give us an indication of the probability of each word occuring absent any other information. If we have no other information at all, we can do no better than guessing “the” to be the next word, but if we do have bigram/trigram information we can use both together. We will treat the relative frequency (unigram frequency/total unigram count) as our Maximum Likelihood Estimate (MLE) of the probability for that word to occur. Note that relative freqencies for words were smoothed using LaPlace smoothing for each word count: (word count + 1 / total word count + number of words).
Figure 1.2 compares the word length to word frequency. Shorter words dominate, but the average word length is 7 characters. Given the Objectives and Constraints guidance, we may want to consider keeping as much of the work as possible in the numeric domain rather than string (text) data which can be more demanding in terms of storage and computation. If we were to represent an individual word in our internal representations as an integer we can greatly reduce the size of bigram and trigram lists. We must have at least one internal represntation of the text, and that will be maintained in our words data frame. To give an idea of the amount of memory reduction from using integer rather than string representation, I constructed a data frame consisting of only the string representation (just over 2M bytes), and another consisting of only the integer representation (just over 131,000 bytes), making the storage of bigrams and trigrams approximately 16 times more memory-efficient than if we used the text itself. This savings is not without cost, as discussed in Initial Model Explorations.
In the preprocessing thus far, we have not verified each token against any external reference (e.g., the wordnet “dictionary”), nor has any profanity filter been applied. There could be a benefit to token verfication, but contractions (e.g., “wouldn’t”, “should’ve”) are a concern. This would also address the “foreign words” question, which has not been dealt with at all (see PGF.4 and PGF.5). Preliminary investigation indicates that resolving the issues of contractions, as well as external dictionary verification, may be time consuming for the volume of data involved. (Draft revisions to our scripting preprocessing increased the estimated run time to process the entire corpus from about 17 hours to months).
Figure 2.1 shows the top 30 bigrams. The bigram “of the” occurs 257,651 times. That’s 0.5% of the bigram count (after thresholding), so this is not a true probability of this bigram occuring, since we have both abandoned some bigrams that were observed but which did not meet the minimum frequency threshold, as well as ignoring possible bigrams that were not observed. The full space of possibilities is the size of the dictionary - 32,653 words - squared - over a billion, of which the 257,651 occurrences of “of the” would be just 0.02%. However, the relative ranking of the bigram “of the” to other bigrams should not be effected by the true absolute value of the probability. Ratios of the frequency of a given bigram over the total frequency of all bigrams starting with the “current word” (the last word of the phrase for which we are trying to predict the follow-on word) might serve as a proxy for probability, at least for the purposes of ranking bigrams against each other (any bias in the denominator will effect all bigrams equally and will therefore not effect their relative rankings). We will treat the relative frequency of the bigram (bigram frequency/total bigram frequency beginning with current word) as our MLE of the probability for that bigram to occur.
In dealing with bigrams, Foundations of Statistical Natural Language Processing (FSNLP) noted that “Predictably, just selecting the most frequently occurring bigrams is not very interesting… The table shows the bigrams (sequences of two adjacent words) that are most frequent in the corpus and their frequency. Except for [one exception appearing in their table], all the bigrams are pairs of function words.” They suggest a “simple heuristic” using part-of-speech (e.g., noun, adjective) to let “through those patterns that are likely to be ‘phrases’”. Part-of-speech tagging requires NLP tools involving non-trivial overhead. They note that “A simple quantitative technique (the frequency filter in this case) combined with a small amount of linguistic knowledge (the importance of parts of speech) goes a long way. …we will use a stop list” of function words (see PGF.6).
Also of interest regarding bigrams is the concept of “bigrams at a distance”, discussed in FSNLP “…it is easy to see how to extend techniques applicable to bigrams to bigrams at a distance. We define a collocational window (usually a window of 3 to 4 words on each side of a word), and we enter every word pair in there as a collocational bigram”. The idea is to identify significant word pairs that extend beyond immediate collocational. They discuss both a simple three word sliding window for capturing bigrams, as well as computing mean and variance of the bigram offset (see PGF.8).
Figure 2.2 shows the top 30 trigrams. The trigram “thanks for the” occurs 23,671 times, which is 0.1% of the trigram count (after thresholding). Again, computing the ratio that way omits both trigrams that were below the threshold as well as the many trigrams not observed - this is a very sparse matrix. The MLE will be the tirgram relative frequency (trigram frequency/total trigram frequency beginning with the previous word).
In thinking about how to combine the MLE information for the uni, bi, and trigrams, we can compute the probability of the components of a bigram or trigram occuring together as the product of their unigram MLEs. The bigram/trigram MLE can be divided by this product to produce a ranking measure of how likely these combinations would be expected based solely on unigram distributions. Higher scores indicate combinations that occur more frequently than would be expected (See PGF.2). After some experimentation with these approaches, the current implementation of the model uses a simple backoff model (using trigram information if it exists, backing off to bigram if it exists, and finally defaulting to unigram if necessary).
The n-gram collecting results in a sparse matrix correlating individual word entities. The potential size of the matrix grows exponentially as we increase the n in the n-grams, so it is apparent that there are limits. In addition, my reading included the following from FSNLP when discussing a small introductory model. “The four-gram model is entirely useless. In general, four-gram models do not become usable until one is training on several tens of millions of words of data.” Although we do indeed have several tens of millions of words of data, we may be facing limiting constraints with the trigram model already. This highlights the concern I have regarding using as much of the corpus data as possible, given that the “signal” from the frequency of n-grams is difficult to pull out from the “noise” included in the corpus. My approach thus far has been to scour the maximum amount of text, then arrive at a reasonable level of thresholding (minimum frequency) for the unigrams, bigrams, and trigrams observed. This may need to be addressed (See PGF.3).
Initial explorations with the tm package were time-consuming. I constructed a rough preprocessor in Python, which has several characteristics to recommend it:
String handling strength
Multilevel data structures (Python "dictionaries") for ease of aggregating unigrams, bigrams and trigrams
Scripting language processing files serially, giving more control over memory limitations
Natural Language Toolkit availability (including WordNet integration)
The approach presents several “cons” (see PGF.7):
Not an integrated end-to-end solution in one application
Preprocessing of tokens becomes integrally involved in overall refinement
The preprocessing involves a first pass to break the text into tokens, clean each token, and create a dictionary of possible tokens. A threshold is applied to the dictionary (a minimum frequency), then a CSV file that will become the words data frame is written out (including the text of the word, a ‘word-id’ number that will be used to represent the word internally, and the frequency of the word in the corpus). A second pass reads back through the text, ignoring any words not in our dictionary, and generating the bigram and trigram tables. Those tables contain the ‘word-id’ number for each of the unigrams involved, as well as a frequency count. These tables are written to CSV files that will become the bigrams and trigrams data frames in R.
Overall, I believe the decision to use the Python preprocessor was justified. I was able to produce a workable preprocessed set of n-gram tables quickly, to move to the work on the model itself. All Python code created for the preprocessor is in my sole possession and can be posted to github.
Possible items to explore (mentioned previously in text), roughly in order from highest priority to lowest priority at this time:
1. Develop "test harness" for evaluation against volumes of test text
2. Refine algorithm for combination of *n*-gram frequency information (Markov chain)
3. Train/Test set - currently using all text provided as corpora
4. Validation of tokens via external dictionary reference
5. Contraction handling (e.g., "don't", "I'm", "should've")
6. Heuristic for keeping only "phrases" with more than function words
7. Rewrite of preprocessing in R rather than Python
8. Bigrams "at a distance"
I. Feinerer. Introduction to the tm package, Text Mining in R. Comprehensive R Archive Network (CRAN), July 3, 2015. URL https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
I. Feinerer, K. Hornik, D. Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5), March 2008 ISSN 1548-7660. URL http://www.jstatsoft.org/v25/i05
C. Manning, H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press. May 28, 1999. ISBN 0-262-13360-1.
D. Jurafsky, J. Martin. Speech and Language Processing. Prentice Hall; 2nd edition (May 16, 2008). ISBN-13: 978-0131873216.
Princeton University “About WordNet.” WordNet. Princeton University. 2010. URL http://wordnet.princeton.edu
CSV: Comma Separated Values
FSNLP: Foundations of Statistical Natural Language Processing (References)
MLE: Mean Likelihood Estimate
PGF: Plan Going Forward (section of paper)