Forecasting the Next Word Using Linear Interpolation.

Luis Alberto Alaniz Castillo
5th of October of 2018

Enter a sentence with at least two words and the model predicts the next word showing the three most likely choices.

Takes the last two words of a sentence and calulates the probability of the next word conditional on previous words.
Chooses the three words with highest conditional probabilities and shows them to the user.
The steps of the algorithm are:
- Using the “tm” package convert the corpora provided by SwiftKey into a Volatile Corpus.
- Use the first 90% of the corpus as a training set, the next 5% as validation set, and the next 5% as a test set.
- Load “tm” Volatile Corpus as “Quanteda” Corpus.
- Tokenize removing numbers, punctuation, twitter symbols, urls, and other symbols.

The resulting perplexity (excluding start of sentence and end of sentence tokens) of the model, a measure of how well a probability model predicts a sample, in the test set is 17.3. A lower perplexity means that the model makes better predictions.

The model was concieved in a way that most of the calculations are made in a R producing a data base of size 62.9 mb.
What the shiny application practically only has to do a search over the data base according to the words entered by the users to find most likely words sorting the results of the search.
The idea was to make easy calculations in shiny while loading a data base of a reasonable size.
n-grams with less than 7 occurrences were pruned which can produce poor results when the set of words is rare.