Word Prediction Application

M. Dorovkov
August 23 2015

Capstone project for JHU Data Science Specialization, Coursera.

NLP and App Development for the Web.

Introduction

The goal of the project is to develop a word prediction algorithm and to deploy it as an App on the Web, that would suggest the next word as the text is typed/pasted in.

The performance goals for the App:

To achieve a high accuracy of predictions.
Fast response to the user input.
To have a small memory foot-print and be able to run on a rather limited resources of such as Shiny App or mobile devices.

The corpora used for the project consisted of twitter, blog, and news files with a combined size of about 700 Mb and roughly 100 Millions of words. Natural Language Processing (NLP) was used for corpora processing and model building. Due to the big size of the corpora, I decided to use Python for text processing (for performance reasons).

Data Processing

In addition to standard text processing (tokenization)), the following steps were done (keeping in mind the accuracy and performance goals).

To achieve a higher accuracy, the whole corpora was processed.
The Start-Of-The-Sentence (SOTS) tokens were introduced to improve accuracy at the beginning of sentences.
The low-frequency tokens (<25) in corpora (that are predominantly misspellings or nonsense words) were substituted with token 'UNK' to improve accuracy of prediction for words that were absent from the training corpora.
During the text clean-up, contractions (such as “I'm”) and words with dashes (such as “e-mail”) were carefully preserved by using regular expressions.
The corpora was split into coherent logically connected text blocks (by punctuation) to improve the predictive power of n-grams (by avoiding cross sentence boundary n-grams). The n-grams were pruned to achieve a smaller memory footprint without a significant reduction of model accuracy.

Algorithm and Performance

The approach is based on n-gram model. The Stupid Backoff algorithm was used due to its fast computational performance. On big corpora it achieves accuracy similar to more sophisticated, but computationally intense algorithms; reference to Google paper.

The algorithm is based on calculating the frequency of the last word of n-gram relative to the frequency of previous words in the n-gram, and if the word is not present in the n-gram, a lower order n-gram then considered with some reduction coefficient lambda, which is usually 0.4 for a Stupid Backoff.

1-, 2-, 3-, 4- grams were used. About 1,000,000 n-grams were used for building a language model. The final vocabulary in the App consists of 30,000 words. The total memory footprint is only 10 Mb!
To achieve speedy performance, I decided to use look-up tables. The algorithm was run on the n-grams and look-up tables were created for each one-, two-, and three- words inputs. Now, instead of running the algorithm on every user input, the predicted words can be found very fast in the look-up tables.

The App

The App was developed in R using “shiny” package and deployed at www.shinyapps.io and is available on the web for everyone free of charge. App The app allows user to type or paste text in the “Enter text” field and the predicted word will be displayed on the right after clicking the “submit” button. To try the App out visit this link.