E. Chasen
3/26/2017
The following predictive text model and application was built using three datasets, supplied by the Coursera Data Science Capstone class: a blog text file, a twitter text file and a news text file.
After combining and cleaning the three datasets, four new datasets were created of varying n-gram lengths: a bigram, trigram, fourgram, and fivegram dataset. N-grams are pairs or groups of words that are found together in a text body.
Larger n-grams include more context. Therefore, the algorithm searches first for the last word in a fivegram model. If there are no matches in the fivegram dataset, the algorithm moves on to the fourgram dataset, then the trigram dataset, and lastly the bigram dataset.
If there are no matches from any of the existing data, the next word prediction comes from the list of most frequent unigram words.