Forecasting the Next Word Using Linear Interpolation.
Luis Alberto Alaniz Castillo
5th of October of 2018
Instructions for the predictive model.
- Enter a sentence with at least two words and the model predicts the next word showing the three most likely choices.
The Algorithm.
- Takes the last two words of a sentence and calulates the probability of the next word conditional on previous words.
- Chooses the three words with highest conditional probabilities and shows them to the user.
- The steps of the algorithm are:
- Using the “tm” package convert the corpora provided by SwiftKey into a Volatile Corpus.
- Use the first 90% of the corpus as a training set, the next 5% as validation set, and the next 5% as a test set.
- Load “tm” Volatile Corpus as “Quanteda” Corpus.
- Tokenize removing numbers, punctuation, twitter symbols, urls, and other symbols.
The Algorithm (continued).
- The steps of the algorithm are:
- Create unigram, bigram and trigram tokens from the Corpus.
- Create Data Frequency Matrices (DFM) for each token in the previous bullet.
- Get text stats (frequencies) from each DFM . and save it as “dplyr” tbl_df.
- Get maximum likelihood conditional probabilities for unigrams, bigrams and trigrams.
- Get weights for linear interpolation maximizing the likelihood function over the validation set using training set n-gram probabilities that depend on the weights.
- Calculate the perplexity of the model in the test set.
Evaluation of the model in the test set.
- The resulting perplexity (excluding start of sentence and end of sentence tokens) of the model, a measure of how well a probability model predicts a sample, in the test set is 17.3. A lower perplexity means that the model makes better predictions.
Final remarks.
- The model was concieved in a way that most of the calculations are made in a R producing a data base of size 62.9 mb.
- What the shiny application practically only has to do a search over the data base according to the words entered by the users to find most likely words sorting the results of the search.
- The idea was to make easy calculations in shiny while loading a data base of a reasonable size.
- n-grams with less than 7 occurrences were pruned which can produce poor results when the set of words is rare.