Rudy Martin @realrudymartin
12/28/2016
The purpose of this presentation is to:
The goal of language modeling is to compute the probability of a sentence or sequence of words: p(W)=p(w1,w2,…,wn)
A related task, word prediction, involves determining the probability of an upcoming word. E.g., given a trigram, a sequence of 3 words, predict the 4th word: p(W)=p(w4|w1,w2,w3)
In this example we are using a back-off model to illustrate forecasting.
The model data is taken from the HC Corpora which consists of blogs, news and twitter items. This set initially contained over 100 Million English language words of which only 10% were randomly sampled and used to train the model.
A corpus was created from these words after removing non-Ascii characters, numbers, extra white spaces and converting text to lower case. Pre-processing also included substituting punctuation with
From this, 1-4 gram term-document matrices were created for summing counts and other statistics. The matrics were filtered to cover 99% of the vocabulary and included only words and phrases that existed in lower order ngram histories, reinforcing the value of a word-specific approach.
The model backs off to smaller histories when larger histories are not available, and orders results based on the maximum likelihood estimate of candidate ngrams.
In our application, we created an index dataset which focused on the probability of a specific word following a preceeding phrase relative to all other ngrams the word can occur with.
Given the limitations of shiny, the initial data used for the model is a very fast-loading set. After the initial load, users are encouraged explore with another larger dataset. This swap feature can be extended to include other sources while using the same model creation code engine.
This application utilizes a predictive text model based on word frequency and context that reduces the number of required keystrokes for next word entry.
The app developed is available at:
Input the text in the box below 'Type the text here' section. You will see possible words below the text box.
Users are encouraged to select a larger phrase set for better results.
Source code for ui.R and server.R and other files are available on GitHub:
For additional questions or comments contact me at: