Word Prediction

in Shiny

Fernando Hernandez

Overview - The Algorithm

  • N-gram modeling based on Markov assumptions. [1]

  • Uses Maximum Likelihood Estimates at its core.

  • i.e. 5-gram: \(P\left( w_{5} | w_{1}w_{2}w_{3}w_{4}\right)\) ; 2-gram: \(P\left( w_{2} | w_{1}\right)\)

  • Very sparse at 3-gram and higher.

  • Uses linear interpolation modeling to smooth out predictions.

  • i.e. 5-gram: \(P\left( w_{5} | w_{1}w_{2}w_{3}w_{4}\right) = \lambda_{1}P\left( w_{5} | w_{1}w_{2}w_{3}w_{4}\right)\) + \(\lambda_{2}P\left( w_{5} | w_{2}w_{3}w_{4}\right)\) + \(\lambda_{3}P\left( w_{5} | w_{3}w_{4}\right)\) + \(\lambda_{4}P\left( w_{5} | w_{4}\right)\)

  • This is done at 2-5 words entered until 5+ words are entered, then 5-gram modeling is done for the most recent 5 words entered.

Linear Interpolation and lambda values

  • \(\lambda\) values were based on the average of perplexity calculations from 2 separate studies with corpora from the English NANT and Wall Street Journal.[2]

  • "In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample." [3]

  • \(\lambda\) values approximately equal to the inverse (lower perplexity is better) of the perplexity of each n-gram model divided by the total of all n-gram models used. i.e. \(2gram_{score} = \left(\dfrac{2gram}{2gram+3gram+4gram+5gram}\right)^{-1}\)

  • Scores are then normalized to add up to 1 to be used as \(\lambda\) weights. i.e. \(\lambda_{(2gram)} = \left(\dfrac{2gram_{score}}{2gram_{score}+3gram_{score}+4gram_{score}+5gram_{score}}\right)\)

  • This gives proportionally higher weights (\(\lambda\) values) to the n-gram models with lower perplexity.

Unknowns - When linear interpolation fails

  • If linear interpolation fails, try "pseudo"-inserting unknowns and predicting long-range.
  • Try only reaching back for 1 unknown(\(w_{unk}\)) word first as this will have the most information to work with.
  • \(P\left( w_{5} | w_{1}w_{2}w_{3}w_{unk}\right)\) + \(P\left( w_{5} | w_{1}w_{2}w_{unk}w_{4}\right)\) + \(P\left( w_{5} | w_{1}w_{unk}w_{3}w_{4}\right)\) + \(P\left( w_{5} | w_{unk}w_{2}w_{3}w_{4}\right)\)
  • If this fails, try reaching back with 2 unknown words.
  • \(P\left( w_{5} | w_{1}w_{2}w_{unk}w_{unk}\right)\) + \(P\left( w_{5} | w_{1}w_{unk}w_{3}w_{unk}\right)\) + \(...\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{3}w_{4}\right)\)
  • Then 3 unknown words: \(P\left( w_{5} | w_{1}w_{unk}w_{unk}w_{unk}\right)\) + \(P\left( w_{5} | w_{unk}w_{2}w_{unk}w_{unk}\right)\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{3}w_{unk}\right)\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{unk}w_{4}\right)\)
  • This is done at 3-gram, 4-gram, or 5-gram levels depending on how many words have been entered.

Last Ditch Effort for Prediction

  • If there is still nothing, we can move words to different indices.
  • i.e. for 2 unknown words: \(P\left( w_{5} | w_{unk}w_{1}w_{2}w_{unk}\right)\) + \(...\) + \(P\left( w_{5} | w_{unk}w_{unk}w_{1}w_{2}\right)\)
  • This is a last ditch effort to pull some context out of the n-grams to predict a related word.
  • No \(\lambda\) are used for these \(w_{unk}\) models since the distributions are being checked for any predictions sequentially as more information is lost and more \(w_{unk}\) are added.
  • If there is still nothing predicted, "the" is returned with "Unknown" confidence.

Partial Word Prediction and Use