Predictive Text Modeling Using N-grams

Richard A. Lent
2 April 2019

Coursera Data Science Capstone Project

Introduction

N-grams are sequences of words in a body of text, or corpus. The “n” in n-gram refers to the number of words in the n-gram. Thus a 2-gram is a sequence of two words, a 3-gram is a sequence of three words, and so on. The n-gram predictive text model reduces the complexity of natural language to the most recent few words typed by a user, and uses those words to predict the next word in the sequence (see n-gram).

predictWord: A Shiny web app

predictWord accepts typed user input and predicts the next word, using an n-gram model and a simple backoff algorithm.
To access the web app, click here. Allow ~ 15 seconds for the app to load.
Type some words into the text box, then press the Predict button.
The predicted word will be displayed along with the history of the backoff algorithm.

How the Prediction Algorithm Works

After obtaining input from the user, the app searches an internal data table for a matching n-gram. The data table contains over five million n-grams produced from a text corpus of blog postings, Twitter posts, and news feeds. If a match is found the corresponding predicted word is returned and the backoff history displayed. If a match is not found, the first word is removed from the n-gram and the search repeated (the backoff model). This continues until either a match is found or no match is found and the n-gram has been reduced to a single word. At this point the most common word in the English language, the, is returned as the “best guess.”

Code and Data

For more information on the development of this web app, including R code for data processing and implementation of the word prediction model, see Milestone Report: Predictive Text Modeling Using N-grams.