N-Gram Fueled Word Predictor

pasacmorg
April 23rd, 2015



Motivation

  • Today's fast-paced lifestyle requires a technology that can not only keep up, but can actually make you more productive.
  • At pasacmorg we've built predictive software that watches what you type and predicts the next word.
  • Over 4 million phrases from blogs, news feeds, and tweets were processed to build an accurate and robust prediction engine, delivering lightning-fast text completion that helps you spend less time typing.

Description of the Algorithm

  • The prediction algorithm uses a database of n-grams where n ranges from 2..5
  • The last 1..4 words of the supplied phrase are extracted and used to find all n-grams that match up to, but not including the last word.
  • Candidate n-grams are then sorted in descending order first by n and then by frequency. The highest order n-gram with the highest frequency is chosen as the prediction.
  • Frequencies from all macthing lower order n-grams are used to break ties. If the tie persists, the first n-gram in the list is chosen.
  • Should no n-grams match, the most frequent uni-gram ('the') is chosen as the prediction.

App Instructions and Functional Description

  • Enter a phrase into the text box an press the Submit button.
  • The algorithm is called to return a single word prediction.
  • Candidate n-gram statistics are displayed in a table and a word cloud is displayed for the 16 most frequently predicted words.
  • Words in the wordcloud are good candidates to add to the existing phrase in the text box for rudimentary n-gram 'babbling' functionality.
  • Babbling can be amusing. I find the generated sentence using predicted words starting with 'At' and ending with 'me' quite humorous. Enjoy!
  • https://pasacmorg.shinyapps.io/capapp

Appendix

  • Repeated testing on holdout data averages 13.5% accuracy.
  • In an attempt to employ non-Markov chain driven prediction algorithms, a number of additional features were constructed for each n-gram, including size of training set (categorical), normalized frequency (percent) and cumulative normalized frequency (percent).
  • Both logistic regression and Random Forests were trained on a binomial outcome of correctly predicting the next word.
  • There was no lift over random selection for either of these methods