Milestone Report

Executive Summary

The goal of this app is to make a word prediction of the next word based text that has already been entered. For example, ‘the cute’ might predict ‘puppies’ as the next word. This will initially be accomplished by analyzing and training a model based on 3 provided news, blog, and twitter corpora.

Initial Prediction Algorithm

Initially, the prediction algorithm will be based on the n-gram model. To illustrate this model, we can examine the previous example: ‘the cute puppies’. This would be a 3-gram where 3 is the number n in n-gram. I will examine 1-grams, 2-grams, 3-grams, and 4-grams from the 3 corpora.

Since 4-grams are the highest n-gram used, the model will look up the 3 previous words that are typed in and return the word with the highest frequency/probability. Using the previous example again, ‘the cute puppies’ might return ‘barked’.

In the case that there is no 4-gram, the algorithm will ‘back off’ and try to find the most frequent 3-gram form the previous 2 words entered. For example, if no one has said ‘the cute puppy (any word)’, the model will use ‘cute puppies’ and might predict ‘and’ as the next word if a sentence such as ‘cute puppies and kittens play’ existed in the data, but none beginning with ‘the’ existed. In this sense, it will back off of the longer 4-gram to a more likely 3-gram phrase.

If nothing is still found, the model will back off until the 1-gram model where is will simply return the highest frequency word in the corpus, which happens to be the word ‘the.’

This type of model works best when trained with very large amounts of data since it is based on the history of words/phrases typed in before.

Data Overview

The 3 corpora used for training are comprised of US news, US blogs, and US twitter feed text files weighing in at 196, 200, and 159 megabytes (MB) respectively.

Preparation

The raw lines were first read in. Examining the lines showed that each line was a blog/tweet/news article could contain multiple sentences, so line data were tagged and separated into sentences using the Apache OpenNLP Maxent sentence detector before removing punctuation.

This was ensure that the n-grams were built from individual sentences. OpenNLP’s Maxent detector was chosen to because it could take into account mid-sentence punctuation like ‘Dr. Doogie M.D.’ and not split it into several tiny sentences.

Contractions were then shortened (i.e. can’t became cant), then the sentences were stripped of punctuation, numbers, Unicode characters, and whitespace was removed. From here, all n-grams could be created from the sentences.

Summary Statistics

Summary data of the data is provided below. An interesting observation to note is that the twitter data contain more ‘sentences’, but roughly half the average words per sentence. Also, the Twitter data has the largest number of unqiue words, despite the fact that it has the lowest total word count.

plot of chunk unnamed-chunk-3

	Source	raw_file_size	sentences	total_words	unique_words	avg_words_per_sent
1	US Twitter	159 MB	3,616,732	29,849,515	424,997	8.253
2	US Blogs	200 MB	2,310,271	36,815,304	396,954	15.935
3	US News	196 MB	1,910,932	33,465,840	309,293	17.513

Word distributions

Word distribution is also different, with ‘the’ being much more prominent in non-Twitter data. Twitter data also has more references to the self, and more affective language, even after removing the top 25 stop words in the English language.

Top 50 - Blog Top 50 - News Top 50 - Twitter

The distribution of words with the top 25 English stop-words removed are shown below to show more substantive differences in three distributions.

Top 50 - Blog - Stop words Top 50 - News - Stop words Top 50 - Twitter - Stop words

Note: n-gram distributions are not present as this process is still being divided, processed, and recombined.

Below is a subsample of 10,000 sentences turned into n-grams. With almost 8 million sentences, these are very large raw data before being compressed into summary databases.

	Sample	ngram_1	ngram_2	ngram_3	ngram_4
1	Twitter 10k	11,992	46,552	57,448	52,528
2	Blogs 10k	18,401	87,314	125,143	125,927
3	News 10k	21,541	104,875	143,419	143,118

Future Goals

Ideally, the Shiny app might also have each of 1, 2, 3 and 4-grams vote on the next word, and the most confident average would be picked, as well as using word frequency associations to give prediction preference to words that co-occur in the same sentences frequently.

I have also not decided yet whether to combine these corpora into one large corpus since the distributions are fairly different, and the search space will be much larger. Also, limiting words/n-grams to only those which occur more than once will most definitely cut down the data size tremendously at the cost of some accuracy. This exact cut-off is still being tested.

These is being looked into to see how feasible it might be for a web app like Shiny where size, responsiveness, and accuracy are all considerations.