David Williams
2016-04-24
This is a demonstration of a text prediction proof-of-concept I produced for my Coursera Data Science Capstone Project. The Capstone Project concludes a sequence of 10 classes covering the fundamentals of Data Science.
Text prediction is an area of study within Natural Language Processing that aims to provide systems that guesses the next most likely word given the words found in the prior context. The most widely understood use case for this is the predictive text functions found on modern phones that allows users to select the next most likely word instead of type it themselves. This is illustrated in the following screen capture where the user can select “courses”, “delivered”, or “is” with a single touch on their phone.
1. Instantaneous performance: The next word is predicted as quickly as the user can type.
2. Low memory footprint: No large data structure is loaded into this app’s memory. All data is stored in a disk based SQLlite database.
3. Low disk space footprint: The on-disk persistent storage is a modest 30 MB. This is well within the typical disk storage limits of most phones.
4. Standard mobile technologies: SQLite datbases are a very standard development development technology that can be accessed by just about any development platform including mobile platforms.
5. Ease of implementation: The solution is very easy to implement and understand.
The text prediction app can be tried by visiting the app's home page on shinyapps.io. After a short initialization period that is usually about 1 second on the shinyapps server, a demo phrase and the next most likely word is predicted. You can then erase the demo phrase and interactively type any phrase or copy and paste a phrase in order to see the next word.
One feature of this app is that the prediction is so rapid that the next word is looked up as soon as you type something. No “predict” button is required to make the algorithm function.
This app was created by analyzing a large corpus of text taken from US English language news, blogs, and twitter feeds. This data amounted to about 560MB of raw text. The libraries used to parse and analyze these data often run into memory limitations so a sample of the data is often used ranging from 1 to 30%. My applications uses **all** the data found in the capstone in order to create the predictive text data structures. Furthermore, this analysis was performed on all observed single words, 2 word combinations (bigrams), 3 word combinations (trigrams) and up to all 5 word combinations. This was accomplished by renting an Amazon EC2 memory instance with 244GB of ram.
The data resulting from the use of standard text processing libraries (I used Quantenda) is not well suited for quick lookup of a word given a prior context. I quickly realized that a hash table is the optimal data structure for this purpose. I tried two in-memory hash table implementations R environments) and the Hash Package. While both offered excellent lookup speeds, they suffered from disk loading and memory size limitations. I decided to try a key value disk storage solution and selected Rodger Peng’s FileHash implementation. It also was too limited from a disk loading perspective. Finally, I used an SQLite database loading all the ngrams into SQL structures:
A simplified version of Katz's Back-off Model along with ngrams is used for text prediction. The algorithm is quite simple to understand and works with reasonable accuracy. It works as follows:
Using this model, I was able to obtain a prediction accuracy of 13.9% using the course benchmark.