Steven D Rankine
23 August 2015
*
The goal of John Hopkins University's Data Science Coursera capstone project is to build a predictive model for user input.
The dataset used for the training was provide by the SwiftKey Corporation in the form of three corpus extracted from blogs, news feeds and twitter feeds.
The prediction model is based on the character level analysis for single word phrases and the word level analysis for multi-word phrases. A Shiny App was created to demonstrate the prediction model.
twenty-Five batches of random samples (7000) from each the three Corpus were taken. This collection of samples were used to create a frequency-optimized term-document matrix (TDM) containing terms up to 3-gram.
Queries are made into the TDM based a users input, the typing context (e.g. blog, news, or twitter), and the maximum number of matches to search for.
Based on those inputs, the algorithm returns a data frame containing the most frequently occurring predictions for a given input phrase.
The accuracy of my prediction model was directly related to the following constarints:
On the other hand the computational speed of the model would be enhanced if the following could be implemented: