WordPredictR

Phil Ferriere's Data Science Coursera Capstone Project
April, 2016

tl;dr: WordPredictR is a Shiny web app that can predict user keyboard inputs by using a 5-gram language model (with either Stupid Backoff or Backoff with Kneser-Ney Interpolation). It achieves 14.1%/22.7% top-1/top-3 precision using an independent benchmark. The application generates predictions in less than 20ms and its language model uses 62MB of disk space. The app can be found at https://philferriere.shinyapps.io/WordPredictR

Project Motivation

WordPredictR is a Shiny application that, taking a meaningful sequence of words as an input, predicts the word most likely to follow or complete such a sequence.

This functionality can be used to speed user typing by suggesting the next word to the user and let them select it rather than having to type it themselves.

Another potential application is autocomplete: as a user starts entering a query in a search bar, WordPredictR's algorithm could be used to guess which words are most likely to complete the user's search query.

Project Implementation

WordPredictR uses a 5-gram language model and Stupid Backoff in a two-step process[1]:

Candidate Search: The algorithm first uses the last 4 words typed in to find 5-grams that complete those 4 words. If less than 5 candidates are found, the app uses the last 3 words typed in and searches for 4-grams that match those three 3, and so on.

Candidate Ranking: To rank next-word candidates, we use a mechanism known as Stupid Backoff:

if (candidateIs5gram) {
  score = matched5gramCount / input4gramCount
} else if (candidateIs4gram) {
  score = 0.4 * matched4gramCount / input3gramCount
} else if (candidateIs3gram) {
  score = 0.4 * 0.4 * matched3gramCount / input2gramCount
} else if (candidateIs2gram) {
  score = 0.4 * 0.4 * 0.4 * matched2gramcount / input1gramCount
}

[1] See WordPredictR's Documentation tab for details.

Using WordPredictR

To start WordPredictR, go here. This application can operate in two modes (use the left sidebar to alternate between them):

In Live Predictor mode, the application ingests user-typed text and predicts the next word.
In “Canned Sentences” mode, the user can watch the predictor operate on a randomly selected list of sentences.

With Live Predictor, type a few words at the top of the main body of the page, and watch the algorithm perform its magic.

The page displayed in Canned Sentences mode isn't static; the algorithms are actually run on the server.

Whatever mode you choose, displayed information includes the top-5 most likely word predictions and their score.

WordPredictR's Performance

We benchmarked WordPredictR on 599 blog sentences + 793 tweets (28,445 predictions) with a tool called benchmark.R[1]:

# Overall top-3 score:     18.70 %
# Overall top-1 precision: 14.05 %
# Overall top-3 precision: 22.73 %
# Average runtime:         13.80 msec
# Number of predictions:   28445
# Total memory used:       209.91 MB

By using lookup tables and smart pruning, WordPredictR achieves both speed (less than 20ms on a 5-year old workstation) and low memory usage – the final LUTs require 62.8MB of disk storage and only use 210MB of RAM[2].

[1][2] See WordPredictR's Documentation tab for details.