Propose and Data Used

The propose of this application is to create a user friendly text prediction model that will take a user’s input (i.e. a word or words) and try to predict the word in the phrase. To do this we used data from the following English sources from the United States:

  • Twitter
  • News Articles
  • Blog Posts

The Model: nGrams

We tokenized the words in our data using bi-, tri-, and quadgrams (that is two, three, and four word phrases) and found the frequency of each.

Using these nGrams, based on the number of words provided we could provide the most likely next word.

     Two_Gram Frequency
1   right now      2352
2    new york      1954
3   last year      1916
4  last night      1540
5 high school      1420
6   years ago      1339

The Model: Backoff Model

To help with accuracy, if the model cannot find a prediction based on the input text, it will try to use the next nGram down to try to make a prediction.

So if the user input three words and the model could not find a prediction in the quadgrams, it would next try to find the best fit for the phrase in the trigrams.

The Application

Application Screenshot