Word Prediction App

Vadim Bondarenko
July 25, 2016

Introduction

Data scientists are often required to handle large amours of messy, unstructured data. One example of such data could be a large collection of natural human language stored various text files. This area of data science is commonly called the Natural Language Processing (NLP).

For this projects I had the opportunity to:

  1. Analyze a large corpus of unstructured text files
  2. Research various NLP models and software tools
  3. Clean and transform the text into a model-ready form
  4. Implement several word prediction language models and evaluate their accuracy
  5. Balance between model accuracy and limited computing resources
  6. Develop an interactive web-based app

The Data

Data Source: I used the text files from three different sources:

  1. News articles
  2. Blogs posts
  3. Twitter

Data Preprocessing: I took the following steps to clean the text files:

  • Take random samples from each one of the tree text sources
  • Remove non-English characters
  • Split paragraphs into sentences
  • Convert all to lower case
  • Remove punctuation and white space

N-Grams

Tokenization: I split the text into N-grams, which are unique N-length combinations of words observed in the training data. The number of unique tokens in my training set was:

  • 74K uni-grams (for example, a common word “the”)
  • 635K bi-grams (i.e. “of the”)
  • 1.2M tri-grams (i.e “one of the”)
  • 1.3M quad-grams (i.e “the end of the”)

Relative Frequencies: Once the corpus is tokenized into N-grams, it's straightforward to compute their frequencies of occurring in the training set and assume those to be the probability density mass of natural language.

Prediction Algorithm

The algorithm predicts the most likely word that follows the previous 1, 2, or 3 words provided as inputs. Given the last N words, the model returns the most frequent N+1 - gram that begins with those N words.

  • “Obama” -> “Obama administration bi-gram
  • “hall of” -> is “hall of fame tri-gram
  • “much ado about” -> “much ado about nothing quad-gram

Katz's Backoff Model: Often a prediction is needed for a combination of words that is not observed in the training corpus. For those cases, I implemented a form of the Katz's Backoff algorithm. When a given N-gram is not found, the model “backs off” to the next level of N-1 - grams.

  • If no quad-grams exist like “keep calm and …”, back off to tri-grams “calm and …”
  • Return the most frequent tri-gram if exists, else back off to bi-grams “and …”
  • Return the most frequent bi-gram if exist, else back off to the most frequent uni-gram

The App

Finally, after testing prediction accuracy, I turned it into a basic interactive application and deployed it to be accessible by anyone over the internet. My main requirements for the app were:

  • Given the user's text input, return the next most likely word, using the back-off N-gram algorithm decried above
  • Lightweight implementation
  • Adequate accuracy
  • Be able to run on mobile
  • Near-real-time predictions from user input
  • Handle words not found in the training set
  • Intuitive user interface (UI)

As mentioned before, I had to balance between increasing prediction accuracy (larger sample size) and relatively lightweight computing resources (reducing sample size).

The App can be accessed at https://vadimus202.shinyapps.io/word_predict/. Please enjoy!