=======================================================================================

1 Natural Language Processing – a simple example

The Shannon Game, a demonstration of entropy in human language, asks a player to guess the next letter or the next word in series. In this project, we use techniques of natural language processing to implement a word prediction algorithm based on probabilities extracted from large samples of written text. The computer should be able to predict the last word in a one, two, three, or four word phrase.

Our approach began with a large collection of samples of writing from which the machine learned the most common patterns. This body of work is described as a corpus. A corpus is typically described by two metrics: the total number of words or tokens (N), and the number of distinct words or tokens (V). To put this in perspective, here are two corpora used by researchers

Corpus N – Total number of words V – Size of Vocabulary
Shakespeare 884,000 31,000
Google n-grams 1,000,000,000,000 (a trillion) 13,000,000

We started with three corpora of different origins. After we unpacked the data, we discovered our library included the following, including as you might imagine, many swear words and non-nonsensical strings from Twitter.

Corpus Number Documents N Total number words V Size of Vocabulary
Twitter 2,360,148 still calculating
News Clippings 1,010,242 still calculating
Blogs 1,138,646 still calculating
Combined 4,508,936 still calculating 227,335

We processed the data in two paths; the result of both processes are used in our final product.

The first is a tokenized data set. It has had all of the nulls, extra spaces, numbers and punctuation removed save apostrophes, spelling corrected to the extent that the writer’s intent was completely clear, nonsense words, profanity, and capitalization removed. Eventually, we converted this data set from character-based to integer-only, with each number referring to one word in the vocabulary.

The second is a bag of words. This data set is the vocabulary we extracted from the tokenized training data set. Each word is stored as its own data element along with its frequency of occurrence. It is stored along with its index to make look-up quick. We established a criteria to eliminate rare words from the vocabulary. This was required because our n-gram object is sensitive to growth at a rate of \(3V^{2}\). Our bag-of-words serves as the reference to the entire corpus. It includes the concordance of word-to-index and whether or not we track a word at all. Rare words (appearing less than 2 times) are tracked together as a single new word .

Relative Occurences of Words in Sample

Relative Occurences of Words in Sample

We then turned the entire corpus into a string of integer numbers. It is interesting to note that in R, an integer requires four bytes. A word requires one byte per character. The average word we chose to track had a length of [to be determined] characters.

We combined the three to see the worst case memory and computational requirements for our data set. It is our intent to split the data into two portions, part for training and part for testing.

Combined Corpus Metric Value Impact
Occurrences most frequent token 100,000 Integer bytes needed
All tokens appearing 50,000 Size of run-time
All bigrams given unigram not yet determined Size at run-time
All trigrams given unigram not yet determined Size at run-time
All unique quadrigrams not yet determined Size at process time

R uses four bytes per integer number on my machine, which can record a number as high as 2,147,483,647 (\(2^{4*8-1}-1\)) (and a corresponding magnitude of negative numbers. Much of the space required by this application would be served just as well by an integer that goes from zero to 65,535(\(2^{2*8}-1\)). If space turns out to be a big deal, we can switch from the standard integer class that requires four bytes to the ushort class which only requires two. This would dramatically reduce computer memeory requirements, but likely raise some other unforeseen problems.

Processing_Limits Value Impact
Shiny single file size 60 MB Ability to run
Shiny available RAM 1.0 GB Size of run-time
Tolerable response time 5.0 sec User experience

Our original strategy was to build a unigram occurrence list, and from that a bigram occurrence matrix, from that a trigram occurrence matrix and so on. These matrices were huge, sparse, and computationally overwhelming. Eventually, we hit upon a a vector-based approach that turned out to be very fast. We built an object that included every quadrigram in the corpus. Quadrigrams include every trigram, bigram, and unigram, so-gram object was needed. Werepeatedly processed this object to develop probabliites for every n-gram (1-4) in the corpus. It is interesting to note that quadrigrams were almost unique.

We found only [not yet determined] quadrigrams with occurrence more than one in a corpus of yy words. As a final step in this stage, we intend to summarize the counts of each n-gram and compute the log probability of an n-gram given its precedence by an (n-1)-gram. We will use a modified Kneser-Ney approach to assign probabilities to previously unseen or disregarded () n-grams.

Armed with a set of probabilities for each n-gram given a preceding (n-1)-gram, we will develop a prediction model for the unknown ultimate word. We made a decision to reserve 40% of the data as our test set. We will use our experience from a previous machine learning exercise to build this model. The dependent variable y is the word that completed a phrase. The independent variables are lambdas(\(\lambda_{3}\ \lambda_{2}\ \lambda_{1}\ \lambda_{0}\ \)) that describe the relative weights that applied to the preceding trigram, bigram, unigram, and probability of a word given no precedent. In the last case, we have already observed that the was the most likely word, given no other information.

\[P(y)=\lambda_{1}P(unigram|\emptyset)+\lambda_{2}P(bigram|unigram)+\lambda_{3}P(trigram|bigram)+\lambda_{4}P(quadrigram|trigram)\]

We intend to use a random forest model to develop the four values of \(\lambda\) that produce the best accuracy for the completion word (y) in the training set. We will apply this model to the reserved test set and observe accuracy. The acid test will be a publicly available shiny application that invites users to enter their own precedent n-grams.

The shiny execution environment presents a number of coding challenges to meet its limitation in terms of space (available working memory) and time (CPU cycles required to be responsive to user input). There are two possible design routes. We don’t know yet which will work better. In either case, there will be a word look-up object derived from the bag-of-words. It will include the translation of character string to integer value and will include the probability of a word given no precedent (which we know to be the in English). The first design option uses three probability objects, one each to look up n-gram probabilities, each with optimized speed. The second (and more likely option) will use a single probability object to minimize space, but at cost of giving up third-form normality and some speed.

The plan for now is to deliver two data objects to Shiny:

2 Testing the algorithm

The acid test of the algorithm is prediction of a word that completes a phrase chosen by the user from either Twitter or a news ariticle. To pass the test, the system must return the corect word in at least one out of five cases. We cannot predict the peculiarity ofthe phrases our users will choose. I follow an epidemiologist, Ian MacKay He posted this: “A jellied #Christmas pud takes shape”. Corectly predicting the second, third or fourth words of this tweet would be a challenge.

Nonetheless, we will test the application against the portion of the data set we reserve for testing. Our target is to exceed the class requirement of 20% accuracy.

3 Optimizing the algorithm

For a first go at it, we will use simple linear regression to estimate the value of four coefficients that describe the influence that the one, two and three word(s) preceding the unknown word are, as well as the likelihood that it is simply the most common word in the language “the”. If simple linear regression makes satisfactory results, we will stop. If not we will use a random forest model or a support vector machine, more advanced methods that have worked for us on other problems.