Predictive Text
Goal: Create an algorithm to predict the next word in a series.
Constraints: Must run within a reasonable amount of time
Implementation - Creating N-grams
read each line of the various data sources (limited by amount of memory and time)
remove all punctuation and stop words process a combined corpus from all sources
process in batches to avoid freezing
use Kneser-Ney smoothing components
use tm create_ngram_model function to create ngrams of 1 through 5
Implementation - predicting text
Depending on the length of the user input, check it against appropriate N-gram. If possible check against the most compatible 5-gram. Backoff to lower n-grams.
Collect all possible phrases from the ngram matching, weighting them to give longer phrases a better score.
Take the top 5 scoring words and return them to the user.