Chad Banicki
October 2016
This product provides a prediction for the likely next word in a typed phrase, as well as words found in a similar context. Some of the benefits of this product include:
Data Considerations:
* Overly common terms were trimmed to improve performance.
* Stopwords were not removed as they sometimes helped predictions.
* Badwords were not removed since this is not a commerical application.
A large set of data were taken from blogs, twitter, and news articles. A 35% sample of the original data were processed to remove unwanted characters, and then assigned to ngrams from 2 to 5 words long.
Any NLP prediction model will be off a little sometimes. In those cases, a nice feature of this product is that it provides the user with a wordcloud of suggestions, based on the relative score of the model.
[1] "Entered Phrase: You have travelled a long"
[1] "Predicted Word:"
[1] "way"
Predicting longer phrases can still be a challenge for any model, including this one. Increasing the size of the corpus, the size of the ngrams, and improving the smoothing used in this model might be the next steps. Questions that were particularly difficult to resolve, like whether to remove stopwords, and how to implement a good backoff might also help improve this model. An effort to tokenize the entered phrase, in addition to the dictionary data, seemed to have some benefits but still didn’t resolve some issues related to predicting longer phrases.