Executive Summary

The goal of this app is to make a word prediction of the next word based text that has already been entered. For example, ‘the cute’ might predict ‘puppies’ as the next word. This will initially be accomplished by analyzing and training a model based on 3 provided news, blog, and twitter corpora.

Initial Prediction Algorithm

Initially, the prediction algorithm will be based on the n-gram model. To illustrate this model, we can examine the previous example: ‘the cute puppies’. This would be a 3-gram where 3 is the number n in n-gram. I will examine 1-grams, 2-grams, 3-grams, and 4-grams from the 3 corpora.

Since 4-grams are the highest n-gram used, the model will look up the 3 previous words that are typed in and return the word with the highest frequency/probability. Using the previous example again, ‘the cute puppies’ might return ‘barked’.

In the case that there is no 4-gram, the algorithm will ‘back off’ and try to find the most frequent 3-gram form the previous 2 words entered. For example, if no one has said ‘the cute puppy (any word)’, the model will use ‘cute puppies’ and might predict ‘and’ as the next word if a sentence such as ‘cute puppies and kittens play’ existed in the data, but none beginning with ‘the’ existed. In this sense, it will back off of the longer 4-gram to a more likely 3-gram phrase.

If nothing is still found, the model will back off until the 1-gram model where is will simply return the highest frequency word in the corpus, which happens to be the word ‘the.’

This type of model works best when trained with very large amounts of data since it is based on the history of words/phrases typed in before.

Data Overview

The 3 corpora used for training are comprised of US news, US blogs, and US twitter feed text files weighing in at 196, 200, and 159 megabytes (MB) respectively.

Preparation

The raw lines were first read in. Examining the lines showed that each line was a blog/tweet/news article could contain multiple sentences, so line data were tagged and separated into sentences using the Apache OpenNLP Maxent sentence detector before removing punctuation.

This was ensure that the n-grams were built from individual sentences. OpenNLP’s Maxent detector was chosen to because it could take into account mid-sentence punctuation like ‘Dr. Doogie M.D.’ and not split it into several tiny sentences.

Contractions were then shortened (i.e. can’t became cant), then the sentences were stripped of punctuation, numbers, Unicode characters, and whitespace was removed. From here, all n-grams could be created from the sentences.

Summary Statistics

Summary data of the data is provided below. An interesting observation to note is that the twitter data contain more ‘sentences’, but roughly half the average words per sentence. Also, the Twitter data has the largest number of unqiue words, despite the fact that it has the lowest total word count.

plot of chunk unnamed-chunk-3

Source raw_file_size sentences total_words unique_words avg_words_per_sent
1 US Twitter 159 MB 3,616,732 29,849,515 424,997 8.253
2 US Blogs 200 MB 2,310,271 36,815,304 396,954 15.935
3 US News 196 MB 1,910,932 33,465,840 309,293 17.513

Word distributions

Word distribution is also different, with ‘the’ being much more prominent in non-Twitter data. Twitter data also has more references to the self, and more affective language, even after removing the top 25 stop words in the English language.

Top 50 - Blog Top 50 - News Top 50 - Twitter

The distribution of words with the top 25 English stop-words removed are shown below to show more substantive differences in three distributions.

Top 50 - Blog - Stop words Top 50 - News - Stop words Top 50 - Twitter - Stop words

Note: n-gram distributions are not present as this process is still being divided, processed, and recombined.

Below is a subsample of 10,000 sentences turned into n-grams. With almost 8 million sentences, these are very large raw data before being compressed into summary databases.

Sample ngram_1 ngram_2 ngram_3 ngram_4
1 Twitter 10k 11,992 46,552 57,448 52,528
2 Blogs 10k 18,401 87,314 125,143 125,927
3 News 10k 21,541 104,875 143,419 143,118

Future Goals

Ideally, the Shiny app might also have each of 1, 2, 3 and 4-grams vote on the next word, and the most confident average would be picked, as well as using word frequency associations to give prediction preference to words that co-occur in the same sentences frequently.

I have also not decided yet whether to combine these corpora into one large corpus since the distributions are fairly different, and the search space will be much larger. Also, limiting words/n-grams to only those which occur more than once will most definitely cut down the data size tremendously at the cost of some accuracy. This exact cut-off is still being tested.

These is being looked into to see how feasible it might be for a web app like Shiny where size, responsiveness, and accuracy are all considerations.