Word Prediction App

Vadim Bondarenko
July 25, 2016

Introduction

Data scientists are often required to handle large amours of messy, unstructured data. One example of such data could be a large collection of natural human language stored various text files. This area of data science is commonly called the Natural Language Processing (NLP).

For this projects I had the opportunity to:

Analyze a large corpus of unstructured text files
Research various NLP models and software tools
Clean and transform the text into a model-ready form
Implement several word prediction language models and evaluate their accuracy
Balance between model accuracy and limited computing resources
Develop an interactive web-based app

The Data

Data Source: I used the text files from three different sources:

News articles
Blogs posts
Twitter

Data Preprocessing: I took the following steps to clean the text files:

Take random samples from each one of the tree text sources
Remove non-English characters
Split paragraphs into sentences
Convert all to lower case
Remove punctuation and white space

N-Grams

Tokenization: I split the text into N-grams, which are unique N-length combinations of words observed in the training data. The number of unique tokens in my training set was:

74K uni-grams (for example, a common word “the”)
635K bi-grams (i.e. “of the”)
1.2M tri-grams (i.e “one of the”)
1.3M quad-grams (i.e “the end of the”)

Relative Frequencies: Once the corpus is tokenized into N-grams, it's straightforward to compute their frequencies of occurring in the training set and assume those to be the probability density mass of natural language.

Prediction Algorithm

The algorithm predicts the most likely word that follows the previous 1, 2, or 3 words provided as inputs. Given the last N words, the model returns the most frequent N+1 - gram that begins with those N words.

“Obama” -> “Obama administration” bi-gram
“hall of” -> is “hall of fame” tri-gram
“much ado about” -> “much ado about nothing” quad-gram

Katz's Backoff Model: Often a prediction is needed for a combination of words that is not observed in the training corpus. For those cases, I implemented a form of the Katz's Backoff algorithm. When a given N-gram is not found, the model “backs off” to the next level of N-1 - grams.

If no quad-grams exist like “keep calm and …”, back off to tri-grams “calm and …”
Return the most frequent tri-gram if exists, else back off to bi-grams “and …”
Return the most frequent bi-gram if exist, else back off to the most frequent uni-gram

The App

Finally, after testing prediction accuracy, I turned it into a basic interactive application and deployed it to be accessible by anyone over the internet. My main requirements for the app were:

Given the user's text input, return the next most likely word, using the back-off N-gram algorithm decried above
Lightweight implementation
Adequate accuracy
Be able to run on mobile
Near-real-time predictions from user input
Handle words not found in the training set
Intuitive user interface (UI)

As mentioned before, I had to balance between increasing prediction accuracy (larger sample size) and relatively lightweight computing resources (reducing sample size).

The App can be accessed at https://vadimus202.shinyapps.io/word_predict/. Please enjoy!