Kenneth D. Graves
April 26, 2015
This Shiny Predictive Next Word Application demonstrates a basic algorithm written in the R Programming Language. It was produced for Coursera's John Hopkins Data Specialization Capstone course.
The goal of the Capstone was to study, design and implement a predictive application based on the following criteria:
The application has a very simple interface. To use, type or paste a short phrase into the text window and click Predict. The app will then predict what your next common word might be. The app only uses the last four words for predictivie purposes. It will also show you the phase of the algorthim from which the guess was made–called the Type.
The application utilizes a cascading algorthim to perform next word prediction:
Both the 4 and 3 n-gram hash tables and the Naive Bayes Classifier model's priors were built from data supplied by Swiftkey as part of this project. The two stage approach utlizes the hash lookup's high efficency with the Naive Bayes Classifier's better use of priors for unseen phrases.
Both of the two algorthims used by this application were based on three sampled collections of text data from news, blogs and twitter feeds. The sampled texts were cleaned from profanity with common contractions replaced with their non-contracted forms. Further processing included the removal of punctuation, numbers and whitespace.
The Naive Bayes Classifer model utilizes only texts from news and blogs, while the 4 and 3 n-gram hash table lookup were built from all three. This selective approach showed better results in cross-validation and testing.
For further information, you may contact the author here: kgraves@yahoo.com