N-Gram Fueled Word Predictor

pasacmorg
April 23rd, 2015

Today's fast-paced lifestyle requires a technology that can not only keep up, but can actually make you more productive.
At pasacmorg we've built predictive software that watches what you type and predicts the next word.
Over 4 million phrases from blogs, news feeds, and tweets were processed to build an accurate and robust prediction engine, delivering lightning-fast text completion that helps you spend less time typing.

The prediction algorithm uses a database of n-grams where n ranges from 2..5
The last 1..4 words of the supplied phrase are extracted and used to find all n-grams that match up to, but not including the last word.
Candidate n-grams are then sorted in descending order first by n and then by frequency. The highest order n-gram with the highest frequency is chosen as the prediction.
Frequencies from all macthing lower order n-grams are used to break ties. If the tie persists, the first n-gram in the list is chosen.
Should no n-grams match, the most frequent uni-gram ('the') is chosen as the prediction.

Enter a phrase into the text box an press the Submit button.
The algorithm is called to return a single word prediction.
Candidate n-gram statistics are displayed in a table and a word cloud is displayed for the 16 most frequently predicted words.
Words in the wordcloud are good candidates to add to the existing phrase in the text box for rudimentary n-gram 'babbling' functionality.
Babbling can be amusing. I find the generated sentence using predicted words starting with 'At' and ending with 'me' quite humorous. Enjoy!
https://pasacmorg.shinyapps.io/capapp

Repeated testing on holdout data averages 13.5% accuracy.
In an attempt to employ non-Markov chain driven prediction algorithms, a number of additional features were constructed for each n-gram, including size of training set (categorical), normalized frequency (percent) and cumulative normalized frequency (percent).
Both logistic regression and Random Forests were trained on a binomial outcome of correctly predicting the next word.
There was no lift over random selection for either of these methods