Next Word Predictor

Jeff Register
6 June 2021

Johns Hopkins University Data Science Specialization

Capstone Project

What is Next Word Predictor?

The objective of the Johns Hopkins University Data Science capstone project is to create a “next word” predictor, similar (but not identical) to how a mobile phone or Google search “suggests” the next word when you are typing.

The Next Word Predictor (link) is a web-based application which an English-speaking user can interact with the prediction model.

The Challenge

The application has 2 key requirements for the Next Word Predictor to be considered a success:

reasonably accurate (i.e., the suggested next word are rationally expected for an English-speaking person)
responds quickly so that the user is not waiting for a response from the application

How It Works (User View)

Using the Next Word Predictor is easy.

Simply start typing in the text box
App suggests up to 3 words which to continue the user's thought.

Suggested words appear below the text box and are in bold blue. The phrase which the user has typed appears before the suggested word

In addition, a frequency percentage appears next to each suggestion which shows percentage of time that the suggested next word appears in this context.

Approach to Build Predictor

To build the predicting model, extracts of three (3) English text sources provided by JHU and Swiftkey are used as the representative “normal” English language/speech.

These raw text sources are input into the R natural language processing (NLP) quanteda package to “normalize” the text for analysis.
N-grams represent sequences of tokens (words) which occur within text extracts. 2-gram, 3-gram and 4-gram sequences have been generated.

Predictive model is based on the concept of Markov chains. Based on the frequency that an n-gram occurs, a probability is assigned to the next word.

How It Works (Data Scientist's View)

A simple Markov model is the prediction engine (see figure).

N-grams are decomposed into search terms (the first n-1 words) and the next word (nth word in the n-gram) and stored in a table.
For each n-gram and its “next word”, two (2) probabilities are calculated: Maximum Likelihood Estimate and Modified interpolated Kneser-Ney smoothing estimate

Markov Table

Searching the Chain using Backoff

Illustrative Markov Model The last n-1 words are taken from the user input string and used to match with the search terms.

1. If a match is found, the highest Kneser-Ney estimates are selected and returned.

2. If no match, then the number of words from user input string is reduced by 1 and the search is repeated.