predNext

Chad Banicki
October 2016

Next Word Prediction

This product provides a prediction for the likely next word in a typed phrase, as well as words found in a similar context. Some of the benefits of this product include:

  1. Near real time prediction on user-typed words and phrases.
  2. Graphical representations of likely next words based upon user input.
  3. Predictions that take into account all of the words within a typed phrase, not just the ones at the end.


Data Considerations:

* Overly common terms were trimmed to improve performance.

* Stopwords were not removed as they sometimes helped predictions.

* Badwords were not removed since this is not a commerical application.

Prediction Algorithm

A large set of data were taken from blogs, twitter, and news articles. A 35% sample of the original data were processed to remove unwanted characters, and then assigned to ngrams from 2 to 5 words long.

  1. Ngrams
    • Longer ngrams were given more consideration.
  2. Markov Chain
    • Phrases closer to the end were given greater consideration
  3. ‘Stupid Backoff Model’
    • Match rates were backed off to shorter ngrams.
  4. Trying something new
    • Entered terms were also tokenized into ngrams, giving consideration to all words in a phrase.

Product Usage and Design:

Any NLP prediction model will be off a little sometimes. In those cases, a nice feature of this product is that it provides the user with a wordcloud of suggestions, based on the relative score of the model.

What's Next?

[1] "Entered Phrase: You have travelled a long"
[1] "Predicted Word:"
[1] "way"

Predicting longer phrases can still be a challenge for any model, including this one. Increasing the size of the corpus, the size of the ngrams, and improving the smoothing used in this model might be the next steps. Questions that were particularly difficult to resolve, like whether to remove stopwords, and how to implement a good backoff might also help improve this model. An effort to tokenize the entered phrase, in addition to the dictionary data, seemed to have some benefits but still didn’t resolve some issues related to predicting longer phrases.