Word RippeR



Winston A. Saunders
April 19, 2015


R version 3.2.0 (2015-04-16)



alt text

Word RippeR: Use Instructions

Toolkit:

  • Corpus-RippeR: creates computable corpus-based text samples.
  • n-gram-creatoR: builds n-gram frequency tables
  • n-gram-reduceR: combines & reduces mulitple n-gram tables.

Web Interface:

Agile, Flexible, and Compact Natural Language Prediction

Word RippeR: Algorithm

Word-Match Algorithm has the following steps:


1. Extracts the last three words from text string.
2. Count matches to four-gram “stems”.
3. Repeat for three- and two-grams.
4. Calculate conditional probabilities and sort results.
5. Select highest probability of the highest order matched n-gram as best match.



Context Match Algorithm works similarly, except stop words are removed from n-grams.

Algorithm Adapability for Different Use-Cases

Word RippeR: n-gram tables

Computed n-gram data tables stored as integers

freq <- as.integer(2000*log10(word_count))

giving faster look-up based probability analysis

n_gram frequency stem root root_freq
i dont want 2868 i dont want 4201
front of the 2647 front of the 5855
but i think 2711 but i think 4273
the middle of 2850 the middle of 5479

and higher algorthim performance.

log_cond_prob <- frequency - root_freq

Word RippeR: Example Results

Phrase: There's a lady who's sure all that glitters is gold and …
…silver (word based prediction)
…medals (context based prediction)

In the root-stem graph below, a lower log value corresponds to a higher probability.

plot of chunk unnamed-chunk-8