Capstone Project: Next word predictor.

rmclaren
12/14/2014

Description

The goal of this project is to predict the next word in a phrase in a manner simular to how the SwiftKey smartphone keyboard does.

In order to accomplish this we where given a data set containing text files in several different languages. Since Im most familier with the english language I elected to focus exclusively on that set of data. The following table is a breakdown of the english data files.

File Name Size (MB) No. Lines
en_US.blogs.txt 210.2 899288
en_US.news.txt 205.8 1010242
en_US.twitter.txt 167.1 2360148

Model Creation

The model in my prediction algorithm consisted of 3 hash tables containing ngrams of various lengths as the keys and what the most likely next word is represented as the value. The first table would hold ngrams with a length of 3 words as keys, the next 2, and the last 1. (for this I used the R “hash” package)

The hash tables where created by using the R tm package and RWeka (for its ngram tokenizer) to parse through the data files and return the ngrams starting at ngrams of length 4 and going down to length 2 ngrams. Each NGram was then split into a prediction gram (first n - 1 words) and the result (last word in ngram). The number of occurrences of each unique ngram was summed and ordered according to the prediction gram, and only the most common result for each prediction gram was stored in the resulting prediction hash. NGrams had to occur at least twice to be counted.

Result Prediction

Now that we have the 3 hash table's predicting the result is easy. Basically the shiny server is given the phrase the user enters on the website. It proceeds by going through its list of hash tables starting with the hash table with the greatest number of ngram words. It looks at the last n words of the phrase given (where n coresponds to the number of words in the prediction gram stored in the key of the hash table), and looks the subphrase up in the hash table. If it finds the sub-phrase it uses the corresponding value in the hash table as the answer. If not it goes to the next hash table and tries again, going through each hash table in this manner until it finds a result. If no result can be found the word “the” is returned as a guess (the most common word in the language).

Usage

Using the shiny app from the users perspective is fairly trivial. Basically the app consists of on text input box, and a text output box. To use the app the user simply types or copies and pastes their phrase of interest in the text box and a little while later a prediction will be displayed.

Hash table lookups are really fast so the user sees the result very quickly (assuming their network connection is ok).

One caviate to note is that when the user first starts using the app, the server must load these fairly large hash tables into memory.. so the app may appear unresponsive until this happens (usually only a few seconds).

Improvements that could be made

Hash tables where used because they provide very fast lookups and they are also quick when writing data into them. Unfortunatly in order to use them you must load the entire data set into memory. This limits the number of potential ngrams that can be used to do predictions. Instead it would be an improvement to use a SQLite (or other kind of) database instead.