Word RippeR

Winston A. Saunders
April 19, 2015

R version 3.2.0 (2015-04-16)

alt text

Word RippeR: Use Instructions

Toolkit:

Corpus-RippeR: creates computable corpus-based text samples.
n-gram-creatoR: builds n-gram frequency tables
n-gram-reduceR: combines & reduces mulitple n-gram tables.

Web Interface:

Word-RippeR: Web based interface providing context or nearest word based predictions.

Agile, Flexible, and Compact Natural Language Prediction

Word RippeR: Algorithm

Word-Match Algorithm has the following steps:

1. Extracts the last three words from text string.
2. Count matches to four-gram “stems”.
3. Repeat for three- and two-grams.
4. Calculate conditional probabilities and sort results.
5. Select highest probability of the highest order matched n-gram as best match.

Context Match Algorithm works similarly, except stop words are removed from n-grams.

Algorithm Adapability for Different Use-Cases

Word RippeR: n-gram tables

Computed n-gram data tables stored as integers

freq <- as.integer(2000*log10(word_count))

giving faster look-up based probability analysis

n_gram	frequency	stem	root	root_freq
i dont want	2868	i dont	want	4201
front of the	2647	front of	the	5855
but i think	2711	but i	think	4273
the middle of	2850	the middle	of	5479

and higher algorthim performance.

log_cond_prob <- frequency - root_freq

Word RippeR: Example Results

Phrase: There's a lady who's sure all that glitters is gold and …
…silver (word based prediction)
…medals (context based prediction)

In the root-stem graph below, a lower log value corresponds to a higher probability.

plot of chunk unnamed-chunk-8