Next Word Prediction

Murtuza Attarwala
November 28 2014

Model Description

  • Next word prediction is the task of suggesting the most probable word a user will type next.
  • This model uses n-gram probability distribution to predict the next word.
  • Corpus is divided into train(70%) and test(30%). 10% random sample of the train corpus is used to build the model.
  • The train corpus is first cleaned and then tokenized into n-grams, with largest n used by the model being 4.
  • Kneser-Ney smoothing is used to reduce the data sparsity by adjusting the probability distribution of unseen sequences.
  • The idea of Kneser-Ney smoothing is to optimize the calculation of lower-order n-gram probabilities in case the higher-order n-gram was unseen in the corpus. It is thereby a backoff smoothing algorithm.

Backoff Smoothing

  • Good-Turing Estimate is based on the assumption that the probability of unseen sequences can be estimated based on N1 which is the number of n-grams that occured once in the Corpus.
  • Since the probability mass of unseen n-grams is now greater than 0, the probability mass of seen n-grams needs to be reduced to let the total probability sum to 1. This is achieved with

pr=rN where r=(r+1)Nr+1Nr

  • Here Nr is the frequency of n-grams that were seen r times in the corpus and r is the adjusted frequency r.

Backoff Smoothing Contd.

  • Church et. al empirically show that the discount (rr) applied at the Good-Turing estimate is generally constant for r3
  • Ney et. al estimate the discount value D based on the total number of n-grams occuring exactly once (n1) and twice (n2) D=n1n1+2n2
  • Based on this Pabsolutediscount(wi|wi1)=c(wi1,wi)Dc(wi1)+λ(wi1)P(w)
  • Kneser-Ney smoothing is a modified version of absolute discounting, where instead of the unigram probability P(w) we use Pcontinuation(w)
  • Pcontinuation(w) gives ,“How likely is w to appear as a novel continuation”, and λ is a nomalizing constant; the probability mass we have discounted

Prediction Accuracy

  • The test corpus(30% of the corpus) is used to verify the prediction accuracy of the model

  • The plot below shows the accuracy of the algorithm for different number of suggestions provided

plot of chunk ggplot

  • For a single suggestion the accuracy is about 13%

How to use the App

  • The app loads n-gram (n being 1-4) probability tables at startup, where each probability table has the n-gram and it's corresponding Kneser-Ney probability.
  • The algorithm accepts a string, and uses the last n-grams to predict the next word. The next word with highest KN probability is returned. When exact match is not found, algorithm backs off to (n1)th gram.
  • User can also specify max. number of predictions that he/she wants from the algorithm, and they are returned sorted by their KN probability.