N-Gram Fueled Word Predictor
pasacmorg
April 23rd, 2015
Motivation
- Today's fast-paced lifestyle requires a technology that can not only keep up, but can actually make you more productive.
- At pasacmorg we've built predictive software that watches what you type and predicts the next word.
- Over 4 million phrases from blogs, news feeds, and tweets were processed to build an accurate and robust prediction engine, delivering lightning-fast text completion that helps you spend less time typing.
Description of the Algorithm
- The prediction algorithm uses a database of n-grams where n ranges from 2..5
- The last 1..4 words of the supplied phrase are extracted and used to find all n-grams that match up to, but not including the last word.
- Candidate n-grams are then sorted in descending order first by n and then by frequency. The highest order n-gram with the highest frequency is chosen as the prediction.
- Frequencies from all macthing lower order n-grams are used to break ties. If the tie persists, the first n-gram in the list is chosen.
- Should no n-grams match, the most frequent uni-gram ('the') is chosen as the prediction.
App Instructions and Functional Description
- Enter a phrase into the text box an press the Submit button.
- The algorithm is called to return a single word prediction.
- Candidate n-gram statistics are displayed in a table and a word cloud is displayed for the 16 most frequently predicted words.
- Words in the wordcloud are good candidates to add to the existing phrase in the text box for rudimentary n-gram 'babbling' functionality.
- Babbling can be amusing. I find the generated sentence using predicted words starting with 'At' and ending with 'me' quite humorous. Enjoy!
- https://pasacmorg.shinyapps.io/capapp
Appendix
- Repeated testing on holdout data averages 13.5% accuracy.
- In an attempt to employ non-Markov chain driven prediction algorithms, a number of additional features were constructed for each n-gram, including size of training set (categorical), normalized frequency (percent) and cumulative normalized frequency (percent).
- Both logistic regression and Random Forests were trained on a binomial outcome of correctly predicting the next word.
- There was no lift over random selection for either of these methods