Text Prediction Application

Data Science Specialization: Capstone Project

By: Kevin Markham

Introduction

The goal of this project was to allow a user to input a phrase into the application, and it would predict the next word that they “most likely” want to type.

The primary use case for this application is text messaging on mobile phones, in which successfully predicting the next word a user wants to type will save them from actually having to type that word, increasing their overall speed.

The data available for training the predictive model is millions of tweets, blog posts, and news articles in English. (Other language files were available but were not used.)

Model Training

The first step in model training was learning all of the 2-grams (word pairs), 3-grams (word triplets), and 4-grams (word quadruplets) in about half of the training data, as well as their frequencies.

Each 4-gram was then broken into a 3-gram (its first 3 words) and the final word. For each of the resulting 3-grams, the most common final word was calculated.

This process was repeated for the original set of 3-grams, producing a set of 2-grams and the most common next word for each 2-gram.

This process was also repeated for the original set of 2-grams.

Using the Application and Making Predictions

When a user types a phrase into the application, the application quickly makes a single prediction for the next word. The prediction algorithm is simple:

Examine the final three words typed by the user. If that 3-gram was present in the training data, predict the most common next word. If not, continue:
Examine the final two words typed. If that 2-gram was present in the training data, predict the most common next word. If not, continue:
Examine the last word typed. If that word was present in the training data, predict the most common next word. If not, predict the word “the”.

Strengths and Possible Enhancements

Because all predictions are “pre-calculated” and stored in a lookup table, the application can make predictions very quickly since it only requires checking for the presence of the previous 1, 2, and 3 words in the lookup table.

There are many possible enhancements to the application:

Take into account earlier words in the sentence (rather than just the final 3 words).
“Weight” the 4-grams, 3-grams, and 2-grams to produce an overall “score”, such that (for example) a frequent 3-gram is weighted higher than an infrequent 4-gram.
Offer the user multiple suggested words, rather than just one.