Word Prediction App

Nathan Smith
August 2015

Capstone Project
JHU Coursera Data Science Specialization

App Description and Instructions

The goal of the capstone project was to create a word prediction algorithm and deploy it in a Shiny application. We were instructed to use a collection of newspaper articles, blog posts, and twitter feeds to train our model. An embedded live version of the app appears in the next slide.

INSTRUCTIONS:

Enter your sentence in the input field and hit “Submit”.
A primary suggestion for the next word will show up in the table to the right. There will be another table with supplementary suggestions beneath it.
Read the DETAIL tab in the embedded app to learn more about the process of text mining on the Data/Sampling tab.
Check out the EXPLORE tab in the embedded app to see the most frequent n-grams in the sampled Corpus.

Try the app for yourself, this is an embedded live version.

Algorithm

The algorithm works as follows:

Clean the input sentence.
Determine the length (n) of the cleaned sentence.
If $n >=3$ then search for matches in the 4-gram matching on the $n-2$ , $n-1$ , and $n$ words.
If there are no matches, then back-off to the 3-gram and so on.
Return the top words in descending order of likelihood.

Future Work

The algorithm currently in use relies entirely on (at maximum) the last 3 words in the sentence. As we all know, a sentence has long-range context where the last 3 words may not really tell you much at all about the broader intent of the sentence. A method called bag-of-words should be explored to collect words used previously in the sentence to give more context around word prediction.