Word Prediction with an N-Gram Language Model

Steven Becker
8/10/2015

Language models assign probabilities to phrases or sequences of words, with N-Gram models being the most generally used language model. In this instance, for this project, a four-gram model was used.
The Corpus, or collection of texts, is the body of data used to train and test the model. The Corpus used was a combination of blogs, news and twitter texts.
The Corpus was tokenized to get the maximum likelihood estimates as a starting point to model the probability distributions for each of the n-gram combinations.
The main problem with higher order n-grams is sparsity. To remedy this a backoff solution was implemented. Allowing the model to move down to n-grams with higher confidence probability values.

A further remedy applied to the sparsity issue is the interpolation of probabilities between higher and lower order n-grams.
Dealing with the issue of sparsity is known as Smoothing, to which there are various methods available. The one selected for this project was interpolated Kneser-Ney smoothing [Chen and Goodman], which includes intrepolation and backoff as previously mentioned.
The consequence of this modelling is the ability to calculate the probability of a given phrase. Implicit in this is the concept of the Markov Model. Conditional probabilitiy of phrases requires the probability of a token to be conditioned on all tokens that have been. This is intractable, a more tractable approach is to use conditioning on only recent observations. In this project the four-gram model suggests that at most the last the three tokens are used as the condition.
Thus the four-gram model predicts the next word using interpolated Kneyser-Ney; a Markov based model that includes backoff and interpolation.

The model was evaluated using an intrinsic metric, namely perplexity to assess how well it had performed, and to set the discounting parameter for the model
The evaluation was performed on a development set, which is basically an out of sample set of the Corpus used to assess the parameters and performance before applying it to a test set.
The application of the model to the test set resulted in two basic metrics, the first was how accurate the first and best choice word was. The second was how good the top three words were.
The percentage accuracy for the first or best word was : 11%
The percentage accuracy for the top three words were : 14%
The results were not impressive, but considering the depth of the subject matter I believe this is a relatively good result, especially given the short time and steep learning curve given on this project

Open the application data product, note the text box with the label 'Input Phrase ' and the button 'SUBMIT' below it first.

Input a phrase into the text box that the user wishes to predict the next word for.
Press the submit button under the text box.
The application will then run the model and display in a graph the top words considered for this phrase, and the associated conditional probabilities of the words.
Just next to the input phrase textbox, the cleaned version of the phrase used for the model will appear next to the text box under the 'Cleaned Phrase' label. On the right side you will see the top three words selected for the prediction.
Finally, good on ya Roger Peng, Brian Caffo and Jeff Leek for making such a rich and broad subject matter so tractable and accomodating.