Capstone Project

Ljiljana
Apr 23, 2015

In this Capstone project we worked on understanding and building predictive text models like those used by SwiftKey.
For example, when a user types:
I went to the
the application presents three options for what the next word might be.

Project Description

The text data comes from HC Corpora. The training data containing US blogs, news and tweets can be downloaded here.

The project can be roughly divided in the following steps:

Cleaning and preparing the data
Getting the uni/bi/tri-gram counts
Creating the bi/tri-gram model
Sampling from calculated n-gram models and presenting the suggestions to the user

Cleaning and preparing the data

In order to save memory and to speed-up the computation we subsampled 50000 lines from the entire data. Then we performed various transformations on the raw text, including:

transforming all letters to lower case,
removing profanity words,
tagging the numbers, URLs and end-of-sentence, and
removing punctuations.

For this prototype application we decided to use only bi-gram and tri-gram models when making suggestions to the user.

Application details

Uni-, bi- and tri- grams were computed manually (without the use of text mining packages) and stored in a hash data structure.
Bi- and tri- gram probabilities/models were computed based on these uni-, bi- and tri- gram counts.
The entire application was written in R and is deployed on Shiny, it can be found here: text prediction app
As the user inputs a word or a phrase the app simply looks up the last two words and checks if they exists in our tri-gram model. If they do then we extract the 3 predictions with the highest tri-gram probability. If the term does not exist in the tri-gram model we back-off to a bi-gram model. In case a bi-gram model does not have the input term we suggest one of the top 10 unigrams.

Concluding remarks

various R text mining packages proved to be too slow for our application,
the only packages used here were stringi and hash,
n-grams were created from scratch
due to Shiny memory limitation as well as time it took to process entire corpus we had to subsample the data
the overall app performance is acceptable
this was a fun project!