11/29/2019

Model Description

  • The model uses the stupid backoff method to determine its prediction.
  • It starts by looking for matches of the input phrase at the 7-gram level down to the 1-gram level. The predictions are returned by the highest number of occurences at the largest n-gram size.
  • If there are no matches of the phrase amongst the data set, the model samples 3 of the top 10 occuring words in the dataset, based on the number of occurences (this is to avoid the same three words being produced in the same order for unkown phrases).

Data Used for the Model

  • The model uses data from 75% of the blogs, news, and twitter data.
  • Due to processing times, 1/3 of the 75% of the data was used for 7-grams, 6-grams, and 5-grams.
  • 4-grams, 3-grams, and 2-grams each used a separate 20% of the data while 1-grams used the remaining 15%.
  • In laymen’s terms, the 75% of the total data was divided twice among the 5-7 grams and again among the 1-4 grams to achieve greater coverage of the data while reducing processing times.

How It Works

  • Type a phrase in the text box ommitting the word you want predicted and hit the submit button.
  • You will see the application’s first, second, and third choice of predicted words.
  • Compare it to the word you were thinking of using.

It’s that simple!

Here’s the link to the application!

Example

  • Phrase: “Let’s go to the”
  • Our next word: “mall”
  • Predictions:
[1] "Prediction1:  beach"
[1] "Prediction2:  mall"
[1] "Prediction3:  bathroom"

Our next word was the second prediction of the application!