Forecasting the Next Word Using Linear Interpolation.

Luis Alberto Alaniz Castillo
5th of October of 2018

Instructions for the predictive model.

  • Enter a sentence with at least two words and the model predicts the next word showing the three most likely choices.

The Algorithm.

  • Takes the last two words of a sentence and calulates the probability of the next word conditional on previous words.
  • Chooses the three words with highest conditional probabilities and shows them to the user.
  • The steps of the algorithm are:
    • Using the “tm” package convert the corpora provided by SwiftKey into a Volatile Corpus.
    • Use the first 90% of the corpus as a training set, the next 5% as validation set, and the next 5% as a test set.
    • Load “tm” Volatile Corpus as “Quanteda” Corpus.
    • Tokenize removing numbers, punctuation, twitter symbols, urls, and other symbols.

The Algorithm (continued).

  • The steps of the algorithm are:
    • Create unigram, bigram and trigram tokens from the Corpus.
    • Create Data Frequency Matrices (DFM) for each token in the previous bullet.
    • Get text stats (frequencies) from each DFM . and save it as “dplyr” tbl_df.
    • Get maximum likelihood conditional probabilities for unigrams, bigrams and trigrams.
    • Get weights for linear interpolation maximizing the likelihood function over the validation set using training set n-gram probabilities that depend on the weights.
    • Calculate the perplexity of the model in the test set.

Evaluation of the model in the test set.

  • The resulting perplexity (excluding start of sentence and end of sentence tokens) of the model, a measure of how well a probability model predicts a sample, in the test set is 17.3. A lower perplexity means that the model makes better predictions.

Final remarks.

  • The model was concieved in a way that most of the calculations are made in a R producing a data base of size 62.9 mb.
  • What the shiny application practically only has to do a search over the data base according to the words entered by the users to find most likely words sorting the results of the search.
  • The idea was to make easy calculations in shiny while loading a data base of a reasonable size.
  • n-grams with less than 7 occurrences were pruned which can produce poor results when the set of words is rare.