Coursera Capstone Project - Next Word Prediction

Donatella Biancone
16/07/2016

The goal of the Capstone project was to develop an algorithm for solving the “Next word prediction” problem and implement it as Shiny web application
The problem was solved using Natural Language Processing N-Gram model
4-Gram Katz Backoff model was used in the final implementation
The link for the final web app

N-gram language model uses the previous N-1 words to predict the next one. In our case the N=4, we created quadrigrams, trigrams, bigrams and unigrams from the data set provided for the project, then we can estimate probabilities of particular word sequence using counts from N-grams. This method is called Maximum Lekelihood Estimate (MLE)
The problem with MLE is that it assigns zero probabilty to any N-gram not in the Corpus. To avoid this problem a small but non-zero probability is assigned to these “zero probability n-grams”
Backoff is another method for dealing with unseen n-gram, for example Katz back-off method.

Katz's back-off model and the Coursera SwiftKey dataset will be used to create the Shiny application. The estimate for n-gram is allowed to back off through shorter histories. If N-gram has appeared more than k times(k is set to 0), then an N-gram estimate is used, if N-gram did not appear, then we will use an estimate from a shorter N-gram. This recursion can continue down, so that we can start with a trigram model and end up estimating the next word based on unigram frequencies.
The selection of this model was based on the fact that such models are simple and in practice work well.

Enter a partial sentence in the text field. Then click on “Next Word” button or hit “Enter” key.
The result will be appear on the right in the box.

ShinyApp