Prognosticator - A Lexical Prediction Application
Justin R. Papreck
3/22/2019
This presentation will highlight the Prognosticator application and its capability to predict the next word in a customized string input. This is the final project in the Coursera Data Science Specialization through Johns Hopkins University in concordance with Swiftkey.
If you have ever tried to explain English grammar, you have probably realized the complexities of the language and, furthermore, the exceptions to almost every rule. For this reason, the field of Natural Language Processing is a very valuable field. By using algorithms and a data science approach to language, we use the ever-evolving language data from the internet to help make better predictions for the best word to use next.
Perhaps one of the best uses for this application is with idiomatic expressions and correct usage of idiomatic prepositions. For a non-native English speaker, these phrases can be extremely difficult to learn. For these lexical enigmas, we have the Prognosticator!
Prognosticator uses corpora from News, Blogs, and Twitter to analyze modern English text. During preprocessing, the text is reduced to lower case and punctuation is removed. The Quanteda package is used for ngram (grouped words) tokenization. The last word from each ngram is cut, and this word becomes the predictor for the matching previous string.
Prognosticator's predictive model is a Stupid Backoff Model. The Stupid Backoff is less complex and computationally expensive than other models. This model is ideal with very large datasets and well suited for this application. The backoff model calculates the highest ngram probabilities and then backs off to each smaller ngram, discounting the probability with each step.
Fast: The Stupid Backoff runs only compares the input text with precalculated tokens and their probabilities from an existing data set, so this model is lightweight and fast.
Filtered: Prognosticator will accept profanities as input strings, but the predictive word bank has been purged of all profanities. While the algorithm can make a prediction using profane text, it cannot deliver a profanity as a suggestion.
Forecasts: Prognosticator can predict the ideal preposition or article to use in idiomatic expressions.
Foolproof: In cases of unknown or misspelled words, Prognosticator will work backward through the last word to make predictions based on letter combinations of the input.