Alexander Carlton
May 31st, 2016
An exercise in text prediction, and how context affects predictions.
Many apps can predict a next word, PredictAbility shows how changes in the chosen language model affects these predictions. There are five different language models built into this app – three large models built from corpora with more than 30 million words each, as well as a couple of small models built from very different specialized sources. Clicking on the name of a model causes the current text content to be re-analyzed by the new selection and all changes in probabilities and predictions are updated immediately.
Responsive behavior is a major part of PredictAbility. The app can process hundreds of N-gram probabilities a second, enabling near instantaneous result updates as the user clicks through different choices of language models.
Active: Completes the current phrase as it is typed, includes keyboard shortcuts for taking one of the current predictions to replace the word being typed.
Batch: Analyzes blocks of text, handles short phrases up through thousands of words. Displays the current model's estimates of probability for each word of provided text.
Legend: Describes the color coding used for all prediction output (prediction confidence are in shades of green, shades of red indicate relative rarity of words).
Model: Information about each language model and the corpus used to train that model.
Notes: Some helpful information and suggestions, also implementation notes and links to source materials.
It can illustrative in the “Active” interface to start a sentence and switch between the available models to see the differences in predictions.
The “Batch” interface can be useful to compare probability estimates. Each model has certain kinds of text where it does well, and others where it struggles more. To check the extremes, play with the Shakespeare-based model, and the model based on the class forum.
The the predictions are based on an implementation of the Kneser-Ney Smoothing algorithm to precalculate the expected probabilities for all known N-grams. Because the models have widely varying degrees of pruning, this app uses a single fixed discount value like the original algorithm, not the multiple discount values in the 'modified' form of Kneser-Ney.
Runtime speed is achieved by pre-processing the language models. When the app is built a python script reads and parses each corpus, builds hashed dictionaries of the millions of N-grams, calculates the smoothed probabilities, and stores the results in SQL tables indexed by n-gram and partial n-gram. The script may churn for several minutes using gigabytes of memory, but the final tables fit within the tight resource limitations of Shinyapps.io and enable responsive behavior.