Daniel Alaiev
June 2016
This application was created to satisfy the Capstone requirement for the Data Science Specialization from Johns Hopkins. The application:
The data are a collection of unstructured sentences from US blogs, Twitter, and US news sources. The data were:
The algorithm uses a simple back-off model with a probability adjustment based on novelty. The simple back-off model looks for an n-gram one word longer than the input, recursively. Once at least one match is found, the algorithm then predicts the full n-gram(s) with the highest novelty adjusted MLE probability: Freq(n-gram)/Freq(n-1-gram) * Nov Adj. N-gram novelty is a count of how many times the last word in the n-gram shows up in other unique n-grams, without looking at frequencies. The algorithm has about 28% accuracy for a 5-option prediction.
Using the application is easy. Just open the link and type your sentence into the input box. The application will automatically process the input and report the results. It returns: the best prediction, an input check, and an output summary & table.
Future plans include: