Coursera Capstone Project - Next Word Prediction

Donatella Biancone
16/07/2016

Executive Summary

  • The goal of the Capstone project was to develop an algorithm for solving the “Next word prediction” problem and implement it as Shiny web application
  • The problem was solved using Natural Language Processing N-Gram model
  • 4-Gram Katz Backoff model was used in the final implementation
  • The link for the final web app

Algorithm (1/2)

  • N-gram language model uses the previous N-1 words to predict the next one. In our case the N=4, we created quadrigrams, trigrams, bigrams and unigrams from the data set provided for the project, then we can estimate probabilities of particular word sequence using counts from N-grams. This method is called Maximum Lekelihood Estimate (MLE)
  • The problem with MLE is that it assigns zero probabilty to any N-gram not in the Corpus. To avoid this problem a small but non-zero probability is assigned to these “zero probability n-grams”
  • Backoff is another method for dealing with unseen n-gram, for example Katz back-off method.

Algorithm (2/2)

  • Katz's back-off model and the Coursera SwiftKey dataset will be used to create the Shiny application. The estimate for n-gram is allowed to back off through shorter histories. If N-gram has appeared more than k times(k is set to 0), then an N-gram estimate is used, if N-gram did not appear, then we will use an estimate from a shorter N-gram. This recursion can continue down, so that we can start with a trigram model and end up estimating the next word based on unigram frequencies.
  • The selection of this model was based on the fact that such models are simple and in practice work well.

Shiny Application

  • Enter a partial sentence in the text field. Then click on “Next Word” button or hit “Enter” key.
  • The result will be appear on the right in the box.

ShinyApp