Predicting the Next Word: An Interactive Tool

Lori Ziegelmeier
April 26, 2015

A project for the Coursera Data Science Specialization

Overview

Goal of This Project: Develop a predictive text application that predicts the next word in a phrase.

Why?: Millions of smart phone and tablet users around the world input text on small devices. With 'fat fingers', inputting text can be cumbersome, and thus, methods–such as predictive text applications–to speed-up typing are warranted.

In Fact: Entire companies, such as our corporate partner SwiftKey, have been formed with this purpose in mind.

Nuts and Bolts of the Application

The foundation for this application is a corpus of English text drawn from 3 sources: news feeds, blog posts, and Twitter messages.

15% of the entries from each corpus is randomly sampled to construct a subcorpus of over 640,000 lines of text which is cleaned and tokenized.

A table of \( n \)-grams (phrases consisting of \( n \) consecutive words appearing in our corpus) is constructed for \( n=1,\ldots,5 \).

Probabilities based on frequency counts are recorded with each \( n \)-gram, and tables are sorted with decreasing probabilities.

Compilation of the \( n \)-gram tables forms a database which is loaded into the app. Only “look-up” needs to be accomplished inside the app, speeding up computations.

Using the Application

The user inputs a phrase. The app cleans and tokenizes the input phrase.

Only the last four words in the phrase are used to predict the next word.

The phrase is matched with existing phrases in the appropriate \( n \)-gram table. If at least one match exists, the word following the match with the highest probability is our predicted word.

If no match exists, a backoff model is employed, searching through the \( (n-1) \)-gram table, the \( (n-2) \)-gram table, and so on, until a match is found. If no match is found, the application predicts the most common word in the English language, the.

Sample Application Performance

Consider the two examples at right:

a match in the 5-grams
a match in the 3-grams

In each case, a reasonable prediction was output, and in fact, the top three predicted words were also displayed.

Now, you can try it!

Just go here to enjoy predicting the next word.

alt text