Word Prediction

in Shiny

Fernando Hernandez

N-gram modeling based on Markov assumptions. [1]
Uses Maximum Likelihood Estimates at its core.
i.e. 5-gram: \(P\left( w_{5} | w_{1}w_{2}w_{3}w_{4}\right)\) ; 2-gram: \(P\left( w_{2} | w_{1}\right)\)
Very sparse at 3-gram and higher.
Uses linear interpolation modeling to smooth out predictions.
i.e. 5-gram: \(P\left( w_{5} | w_{1}w_{2}w_{3}w_{4}\right) = \lambda_{1}P\left( w_{5} | w_{1}w_{2}w_{3}w_{4}\right)\) + \(\lambda_{2}P\left( w_{5} | w_{2}w_{3}w_{4}\right)\) + \(\lambda_{3}P\left( w_{5} | w_{3}w_{4}\right)\) + \(\lambda_{4}P\left( w_{5} | w_{4}\right)\)
This is done at 2-5 words entered until 5+ words are entered, then 5-gram modeling is done for the most recent 5 words entered.

\(\lambda\) values were based on the average of perplexity calculations from 2 separate studies with corpora from the English NANT and Wall Street Journal.[2]
"In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample." [3]
\(\lambda\) values approximately equal to the inverse (lower perplexity is better) of the perplexity of each n-gram model divided by the total of all n-gram models used. i.e. \(2gram_{score} = \left(\dfrac{2gram}{2gram+3gram+4gram+5gram}\right)^{-1}\)
Scores are then normalized to add up to 1 to be used as \(\lambda\) weights. i.e. \(\lambda_{(2gram)} = \left(\dfrac{2gram_{score}}{2gram_{score}+3gram_{score}+4gram_{score}+5gram_{score}}\right)\)
This gives proportionally higher weights (\(\lambda\) values) to the n-gram models with lower perplexity.

If linear interpolation fails, try "pseudo"-inserting unknowns and predicting long-range.
Try only reaching back for 1 unknown(\(w_{unk}\)) word first as this will have the most information to work with.
\(P\left( w_{5} | w_{1}w_{2}w_{3}w_{unk}\right)\) + \(P\left( w_{5} | w_{1}w_{2}w_{unk}w_{4}\right)\) + \(P\left( w_{5} | w_{1}w_{unk}w_{3}w_{4}\right)\) + \(P\left( w_{5} | w_{unk}w_{2}w_{3}w_{4}\right)\)
If this fails, try reaching back with 2 unknown words.
\(P\left( w_{5} | w_{1}w_{2}w_{unk}w_{unk}\right)\) + \(P\left( w_{5} | w_{1}w_{unk}w_{3}w_{unk}\right)\) + \(...\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{3}w_{4}\right)\)
Then 3 unknown words: \(P\left( w_{5} | w_{1}w_{unk}w_{unk}w_{unk}\right)\) + \(P\left( w_{5} | w_{unk}w_{2}w_{unk}w_{unk}\right)\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{3}w_{unk}\right)\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{unk}w_{4}\right)\)
This is done at 3-gram, 4-gram, or 5-gram levels depending on how many words have been entered.

If there is still nothing, we can move words to different indices.
i.e. for 2 unknown words: \(P\left( w_{5} | w_{unk}w_{1}w_{2}w_{unk}\right)\) + \(...\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{1}w_{2}\right)\)
This is a last ditch effort to pull some context out of the n-grams to predict a related word.
No \(\lambda\) are used for these \(w_{unk}\) models since the distributions are being checked for any predictions sequentially as more information is lost and more \(w_{unk}\) are added.
If there is still nothing predicted, "the" is returned with "Unknown" confidence.

Confidence levels for the next word will be continually calculated and displayed in the rightmost table.
The top predicted word is displayed in the center in light blue.
These probabilities are based on the previous words used and the current partial word being typed in.
One word is returned as the top guess, along with a table of the top 5 guesses and confidence levels for each word.
Once a space is used, the next full word will be predicted again.
Enjoy!
https://fernando.shinyapps.io/shiny_text

[1] http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

[2] http://fccl.ksu.ru/issue8/reduced_models_Ha/ (Tables 3 and 4)

[3] http://en.wikipedia.org/wiki/Perplexity