Enock Lukhetfo Dube
22 APRIL 2015
Given an incomplete sentence, this Shiny application uses an N-Gram language model of the United States (US) English to predict the next word in the sentence.
The language model consist of 1-gram, 2-gram, and 3-gram English tokens which were built from sample text from news, blogs and twitter texts.
The algorithm is based on Andrew Markov's assumption and, where necessary applies a simple Katz back-off algorithm.
Generally, the Markov assumption states that, a future event (next word) can be predicted using a relatively short history(context) such as 1 or 2 previous words.
Extract the last 2 words in the incomplete sentence, hereafter referred to as context
Find context matching 3-gram token with highest maximum likelihood probability (MLEprob), and return the last word of the token as the next predicted word.
If 3-gram match not found, backoff and consider only the last word to find highest MLEprob matching 2-gram token, and take last word as the next predicted word
If 2-gram match is not found, then the algorithm returns keyword UNKOWN
You can run the application from http://eldube.shinyapps.io/JH_Capstone_PredictNextWordApp
After cleaning the input text, N-grams were extrated and stored in a matrix .csv file that stores token counts and their maximum likelihood probabilities.
The figure below shows a sample of five 2-grams stored in the bigramMatrix.csv files
token count word_i_1 MLEprob
1: science adviser 1 science 0.004219409
2: science and 25 science 0.105485232
3: science are 1 science 0.004219409
4: science articles 1 science 0.004219409
5: science association 3 science 0.012658228