Displays top 30 word possibilities with weighted scores
WordCloud display for fast visualization of suggested words
Handles and respects punctuation and special characters in user input
The size of data used by language model is 99.8Mb with an average bytes/ngram: 80.5 bytes/ngram; this guarantees fast performance and minimal loading time.
Instructions
The Underlying Algorithm
The model was trained using a random sample of 5% from the HC corpora containing English documents from twitter, blogs and news feeds (descriptive statistics).
Data for each N-gram (range of N = 4 - 1) collection is stored in hash tables (R data.table and spooky.32 for efficiency).
WordSeer employs a Markov process and is based on a modified “Stupid Backoff” algorithm (Brants 2007). Using all input N-grams (4-grams - 1-grams) it calculates the probability for all the words following that N-gram (MLE). These probabilities are summed across all N-gram collections (1 to k) to obtain a weight (W).
\[ W_{word}=\sum_{n=1}^k(MLE_{word})_{n} \]
Further Exploration
Context information is discarded for n-grams where n > 4.
The average bytes per N-gram can be further decreased by storing unique words as numbers for the higher order M-grams.
Implementation of interpolation (Key-Neisser algorithm) may improve prediction accuracy but decrease performance.
Table of know contractions can allow for their use in the model.