Li Jiming
22/01/2016
In this report, I discuss my creation of a predictive text app that trained on a large corpus of text from news-related sources. I state how the data was used to create n-gram models, that were subsequently used to generate my prediction model, using the Kneser-Ney smoothing algorithm. The prediction models is shown in my Shiny app https://lijiming.shinyapps.io/ShinyApp/
The purpose of this project was to create a predictive text application that predicts the next word that follows a sentence input.
This report is broken down into the two major steps:
Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams.
Given the sentence, “Francisco”“ is presented as the suggested ending, because it appears more often than "glasses” in some text. “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”).
I believe that typically, the smoothing algorithm is performed on all of the n-grams (unigram models, bigram models, etc.) prior to attempting any predictions.
I was able to write a completely recursive Kneser-Ney algorithm for n-gram models of any n. However, in effect, I limited the number of candidate words and thus the resulting term is often very inaccurate.
To implement this in real-time means I first select the candidates (what words could come next in a sentence) to be used for smoothing.
The candidates for what word should come next are chosen as the top-ranking words to follow wi in the bigram mode, where the first word of the bigram is the final word.
So Kneser-Ney probabilities based on candidates (the possible bigram continuations for the final word in the sentence)