WordPredictR : Data Science Capstone Project

Rudy Martin @realrudymartin
12/28/2016

WordPredictR : Natural Language Processing Application

The purpose of this presentation is to:

Review key NLP concepts used in this project
Illustrate a live model using the Shiny application server
Publish slides online via the RPubs service
Complete Data Science Capstone Course Requirements

Language Modeling

The goal of language modeling is to compute the probability of a sentence or sequence of words: p(W)=p(w1,w2,…,wn)

A related task, word prediction, involves determining the probability of an upcoming word. E.g., given a trigram, a sequence of 3 words, predict the 4th word: p(W)=p(w4|w1,w2,w3)

In this example we are using a back-off model to illustrate forecasting.

Model Development: Data Preparation

The model data is taken from the HC Corpora which consists of blogs, news and twitter items. This set initially contained over 100 Million English language words of which only 10% were randomly sampled and used to train the model.

A corpus was created from these words after removing non-Ascii characters, numbers, extra white spaces and converting text to lower case. Pre-processing also included substituting punctuation with indicators to preserve sentence structures.

From this, 1-4 gram term-document matrices were created for summing counts and other statistics. The matrics were filtered to cover 99% of the vocabulary and included only words and phrases that existed in lower order ngram histories, reinforcing the value of a word-specific approach.

Model Development: Language Model Creation

The model backs off to smaller histories when larger histories are not available, and orders results based on the maximum likelihood estimate of candidate ngrams.

In our application, we created an index dataset which focused on the probability of a specific word following a preceeding phrase relative to all other ngrams the word can occur with.

Given the limitations of shiny, the initial data used for the model is a very fast-loading set. After the initial load, users are encouraged explore with another larger dataset. This swap feature can be extended to include other sources while using the same model creation code engine.

My Figure

WordPredictR: a Shiny Application

This application utilizes a predictive text model based on word frequency and context that reduces the number of required keystrokes for next word entry.

The app developed is available at:

https://rudymartin.shinyapps.io/wordPredictR
Input the text in the box below 'Type the text here' section. You will see possible words below the text box.
Users are encouraged to select a larger phrase set for better results.

Additional Information

Source code for ui.R and server.R and other files are available on GitHub:

https://github.com/RudyMartin/DataScience
For additional questions or comments contact me at:

realrudymartin@gmail.com https://linkedin.com/in/rudymartin