Predictive Next Word App

Kenneth D. Graves
April 26, 2015

Introduction

This Shiny Predictive Next Word Application demonstrates a basic algorithm written in the R Programming Language. It was produced for Coursera's John Hopkins Data Specialization Capstone course.

The goal of the Capstone was to study, design and implement a predictive application based on the following criteria:

  • An algorthim that would utilize common word patterns to predict next words
  • A user interface that would be easy to use that would display clear results
  • Small enough to deploy on ShinyIO and other platforms

The App

The application has a very simple interface. To use, type or paste a short phrase into the text window and click Predict. The app will then predict what your next common word might be. The app only uses the last four words for predictivie purposes. It will also show you the phase of the algorthim from which the guess was made–called the Type.

UI

The Algorithm

The application utilizes a cascading algorthim to perform next word prediction:

  1. First stage: Uses a truncated Katz backoff model to capture the most frequent 4, and then 3, N-gram phrases in an extremely efficent hash table lookup.
  2. Second stage: If both hash table lookups fail, the app utlizes a Naive Bayes classifier model to perform a fallback prediction.

Both the 4 and 3 n-gram hash tables and the Naive Bayes Classifier model's priors were built from data supplied by Swiftkey as part of this project. The two stage approach utlizes the hash lookup's high efficency with the Naive Bayes Classifier's better use of priors for unseen phrases.

The Data

Both of the two algorthims used by this application were based on three sampled collections of text data from news, blogs and twitter feeds. The sampled texts were cleaned from profanity with common contractions replaced with their non-contracted forms. Further processing included the removal of punctuation, numbers and whitespace.

The Naive Bayes Classifer model utilizes only texts from news and blogs, while the 4 and 3 n-gram hash table lookup were built from all three. This selective approach showed better results in cross-validation and testing.

For further information, you may contact the author here: kgraves@yahoo.com