23/02/2021

Executive Summary

Our goal was to create a simple and elegant app that could predict the next word in a sentence out of a meaningful input of words.

To provide this functionality we built a N-gram probabilistic language model and used Stupid Backoff algorithm to rank next-word candidates.

Our application, Your Next Word, achieves the following accuracy using an independent test set and generates predictions in less than 0.348s by using 69.5MB of disk space.

Top_1_Accuracy Top_3_Accuracy Top_5_Accuracy
5 14.83% 22.85% 27.58%

How our model works

We built our app using N-gram probabilistic language model. Given our chosen corpus of (1) 899,288 blog posts, (2) 77,259 news articles, and (3) 2,360,148 tweets, we assigned probabilities to sequences of N words. We then estimated the probability of the last word of an N-gram given the previous (N-1) words.

To rank next-word candidates, we applied a mechanism called Stupid Backoff. The scoring function is defined by the following formula:

How we choose the parameters

N-gram to use

  • We chose quadgram for our app because there is only minimal improvement in prediction accuracy by using fivegram instead of quadgram.

How to use the app