A Word Prediction Algorithm based on Stupid Backoff n-gram Model

Renato P. dos Santos
1st March 2017

Built to fulfill the Johns Hopkins Data Science Specialization Capstone Project requirements.

Introduction

As people around the world are spending an increasing amount of time on their mobile and accessibility devices, predictive text becomes a most-needed input technology. However, old systems, such as T9 and WordWise, sometimes produced hilarious “damn you autocorrect” results.

This more advanced, probabilistic-language-modeling-based approach 'knows' how certain words tend to be combined together in our language and tends to exhibit a far greater accuracy.

It was developed within a partnership of Johns Hopkins with SwiftKey, the leading company on predictive text input for Android and iOS keyboards.

Highlights

Speed: The probabilities were all previously computed and are loaded before execution. The app searches through millions of words down the tables to instantly recover the most likely next word.
Versatility: The algorithm is handles many contractions used in Internet language: e.g. “2 b or not 2” will be translated as “to be or not to” and “be” will be suggested as the next word.
Safety: Profanities and bleeped words (e.g. 'f***' and 'f#@%') are removed from user input as were also previously removed from the tables.
Naturalness: Stopwords, on the other hand, were left in, as they are present in normal language and could be the expected next input from a user.

The Algorithm

It is based on the widely used 4-gram language model and on the “Stupid Backoff” approach. More details can be found here.

Step 1: If the user enters e.g. “to be or not to”, it is chopped to the last 3 words (“or not to”).
Step 2: A search is done for a match to the chopped input.
Step 3: If a match is found, the algorith skips to Step 4.
- Otherwise, the user input is shortened again (“not to”, etc.), and the algorithm go back to Step 2.
Step 4: If a match was found, it is returned to the user interface.
- Otherwise, the most frequent word in the corpora is returned.

The Shiny App

The goal of this project was to create a product to highlight the prediction algorithm built and to provide an interface that can be easily accessed by others.
The working app is available here
It is simple and intuitive to use. Just type in the first few words of a sentence and the suggested next word will immediatly show up on the right, as shown below.