Next Word Prediction Proof-of-Concept

David W. Williams, PMP

Introduction

This is a demonstration of a text prediction proof-of-concept I produced for my Coursera Data Science Capstone Project. The Capstone Project concludes a sequence of 10 classes covering the fundamentals of Data Science.

Text prediction is an area of study within Natural Language Processing that aims to provide systems that guesses the next most likely word given the words found in the prior context. The most widely understood use case for this is the predictive text functions found on modern phones that allows users to select the next most likely word instead of type it themselves. This is illustrated in the following screen capture where the user can select "courses", "delivered", or "is" with a single touch on their phone.

This presentation will allow you to try the text prediction app that I created for the Capstone and explain how I created it.

Try the Text Prediction App Right Here!

The text prediction app can be tested on this slide or by visiting the app's home page on shinyapps.io. After a short initialization period that is usually about 1 second on the shinyapps server, a demo phrase and the next most likely word is predicted. You can then erase the demo phrase and interactively type any phrase or copy and paste a phrase in order to see the next word.

One feature of this app is that the prediction is so rapid that the next word is looked up as soon as you type something. No "predict" button is required to make the algorithm function.

Engineering

This app was created by analyzing a large corpus of text taken from US English language news, blogs, and twitter feeds. This data amounted to about 560MB of raw text. The libraries used to parse and analyze these data often run into memory limitations so a sample of the data is often used ranging from 1 to 30%. My applications uses all the data found in the capstone in order to create the predictive text data structures. Furthermore, this analysis was performed on all observed single words, 2 word combinations (bigrams), 3 word combinations (trigrams) and up to all 5 word combinations. This was accomplished by renting an Amazon EC2 memory instance with 244GB of ram.

The data resulting from the use of standard text processing libraries (I used Quanteda) is not well suited for quick lookup of a word given a prior context. I quickly realized that a hash table is the optimal data structure for this purpose. I tried two in-memory hash table implementations R environments and the Hash package. While both offered excellent lookup speeds, they suffered from disk loading and memory size limitations. I decided to try a key value disk storage solution and selected Rodger Peng's FileHash implementation. It also was too limited from a disk loading perspective. Finally, I used an SQLLite database loading all the ngrams into SQL structures:

Language Model

A simplified version of Katz's Back-off Model is used for text prediction. The algorithm is quite simple to understand and works with reasonable accuracy. It works as follows:

Try to find the previous 4 words in the map of 5 grams, and, if found, return the last word of the most frequent matching 5 gram.
Try to find the previous 3 words in the map of 4 grams, and, if found, return the last word of the most frequent matching 4 gram.
Try to find the previous 2 words in the map of 3 grams, and, if found, return the last word of the most frequent matching 3 gram.
Try to find the previous 1 words in the map of 2 grams, and, if found, return the last word of the most frequent matching 2 gram.

Using this model, I was able to obtain a prediction accuracy of 13.1% using the course benchmark. The best published accuracies within my class was around 19%.

Benefits

The benefits of my solution approach are:

Instantaneous performance: The next word is predicted as quickly as the user can type.
Low memory footprint: No large data structure is loaded into this app's memory. All data is stored in a disk based SQLlite database.
Low disk space footprint: The on-disk persistent storage is a modest 30 MB. This is well within the typical disk storage limits of most phones.
Standard mobile technologies: SQLite datbases are a very standard development development technology that can be accessed by just about any development platform including mobile platforms.
Ease of implementation: The solution is very easy to implement and understand.

Next Steps

The most obvious improvement to be made is in the prediction accuracy. Using the Good Turing method can work with the engineering solution (pre-processed SQLlite database) of this app.