Winston A. Saunders
April 19, 2015
R version 3.2.0 (2015-04-16)
Toolkit:
Web Interface:
Word-Match Algorithm has the following steps:
1. Extracts the last three words from text string.
2. Count matches to four-gram “stems”.
3. Repeat for three- and two-grams.
4. Calculate conditional probabilities and sort results.
5. Select highest probability of the highest order matched n-gram as best match.
Context Match Algorithm works similarly, except stop words are removed from n-grams.
Computed n-gram data tables stored as integers
freq <- as.integer(2000*log10(word_count))
giving faster look-up based probability analysis
n_gram | frequency | stem | root | root_freq |
---|---|---|---|---|
i dont want | 2868 | i dont | want | 4201 |
front of the | 2647 | front of | the | 5855 |
but i think | 2711 | but i | think | 4273 |
the middle of | 2850 | the middle | of | 5479 |
and higher algorthim performance.
log_cond_prob <- frequency - root_freq
Phrase: There's a lady who's sure all that glitters is gold and …
…silver (word based prediction)
…medals (context based prediction)
In the root-stem graph below, a lower log value corresponds to a higher probability.