n-gram prediction app

n-gram logo

A text prediction application to ease input on web pages, prepared as the Capstone project of Johns Hopkins' Data science specialization offered in Coursera.

Oscar de Leon
2015-04-25

Description of the application

The prepared application, n-gram predictor, takes plain text input in English and uses some simple models to predict the next word.

It outputs the prediction to a red button so the user can pick it, and also provides two additional suggestions (grey buttons). Some details on the likelihood of each word are presented.

The following image shows the general appearance of the application:

screenshot

Algorithm used to make the prediction

The prediction is performed by searching for the last (up to) 3 recognized words of the text input in an n-gram table to get the absolute frequency of each n-gram containing the last (n-1) words and each prediction option.

The retrieved information is used to compute the likelihood of each prediction option given the number of times its “root” (the previous n-1), and a linear interpolation is performed to select the best prediction across all the n-grams. The model always performs back-off, to use information on all the n-grams.

The n-gram table contains information 2, 3 and 4-grams derived from the datasets provided by the course instructors.

Instructions to use the application

You can access the application in the link provided in the evaluation page from the course site. Some panels with instructions and additional information are provided under the “Information” tab.

To use the application, write some English text in the text box found under the “App” tab. To request a prediction you can either:

type a space at the end of the text input
click/tap on the Predict! button, or
click/tap one of the buttons showing words above the text box, labeled 1 through 3. This adds the selected word to the text input before requesting the prediction.

Description of the application' workings

The application is built on a shiny backend as provided by Rstudio. The text box and the buttons are observed for user interaction in a reactive environment, and the application performs actions based on user input.

The application gets its accuracy from using a large lookup table. The application gets its speed from the following design choices:

The n-grams are encoded as short integers to reduce storage size
The table is stored using the ff package, so it is read directly from disk instead of loaded in memory
Indexing is used to convert the input words into integers and the prediction into words shown to the user