JHU Capstone Project

Stephen O'Connell
4/22/2015

Created an index of all words in all three corpuses.
Pre-processed the data, removed profanity, punctuation, numbers, and white space
Created the 3-nGrams from the sampled data, counting their frequency of occurrence
Evaluated all words in the nGrams for misspellings, removed any nGrams with misspellings
Using the index of all words converted nGram words to indexed values, i.e. 'could' = 99.
Loaded the index and indexed nGrams, with frequency counts, into data.table and setkeys on the table
Created a compressed Rdata file with the index and nGram model

Input text is per-processed removing profanity, punctuation, numbers, and white space
The last two words in the phrase are converted to their indexed values
The indexed values are used as keys to the nGram model returning all nGrams starting with the keyed words
Result set is sorted by frequency in descending order
Indexed values for predictions are converted back to words
Top 4 words are returned to the UI

An error will appear if a word is misspelled, i.e. it won't predict for misspellings.
Only correctly spelled words in the sampled corpus are valid, i.e. you may spell the word correctly but the word was not in the corpus.
An error will occur if the phase is too short; it needs at least 3 words
Have fun!!