predictR!

A simple next-word prediction RShiny Application
Sean Dobbs

Purpose

As the Capstone Project for the Data Science Specialization from Johns Hopkins University in partnership with SwiftKey, a simple RShiny application was created to offer predictions/suggestions for the next word in a sequence, given input from a user.

This presentation will aim to briefly explain: A) the data that was used, B) the underlying algorithm and methodology, C) limitations of the model, and ways to improve, and D) an overview of the online User Interface

To skip all of the boring stuff and just play with the app, you can find it here:

https://deebo415.shinyapps.io/predictR

If you want an excess of the boring stuff, you can find the code at:

https://github.com/oraclejavanet/coursera-data-science-capstone/

Text Prediction Model (Stupid Back-off)

The data for the model was an huge collection of text input from Twitter users, bloggers, and news articles. News is very structured. Twitter, blogs, and other social media are not. Ideally, we'd like all to use all data from everywhere. But this application needs to be fast while still having reasonable accuracy, so a subset of the data was used. More emphasis was placed upon Twitter data and blogs; less on news articles. Tweets and blogs are more likely to be what an everyday human might type.

  • The data was “cleaned” by normalizing it (making it more uniform, and thus more likely to be a better for prediction). Numbers, dates, times, ordinals, and money were removed; everything was made lowercase, and common misspellings were fixed.
  • Then, the model was created by extracting “n”-grams (series of words of “n” length that appear in sequence) from the data corpus and applying Stupid Back-off (SBO) Methodology to offer a next word prediction, given user text input. Other, more complex methods (namely Katz and Modified Kneser-Ney Back-off) were also considered. SBO, while very simple, was chosen because generally, with massive data sets, SBO performs nearly as well as other n-gram Back-off methods.

Limitations; Ways to Improve

  • N-gram methodology is only good for predicting human output given an extremely short human input (on or around the maximum n-gram that the model considers). We could make a 20- 30- or 100-gram model; while that may appear to be more accurate, large n-grams models like this:

    • Tend to overfit the data (ironically making it less accurate for new inputs)
    • Take up an exorbitant amount of space
    • Run extraordinarily slowly
  • With a continued investment in predictR!, future versions of the application could be based upon multi-level neural networks or other deep learning methodologies. Users will love the simplicity and fun of predictR! Version 0.1. They will love the incredible accuracy and capabilities of Versions 2.0, 3.0, and beyond!

Overview of the User Interface

predictR! UI

  1. The user enters a word or phrase in the grey box
  2. In near-real time, the most likely next word candidate is shown, along with the probability.
  3. If the user is looking for suggestions for next words rather than a straight prediction, a short list of the next most likely words appears.

This app is easy, fast, fun, reliable, and accurate