John Sinues
April 2015
John Hopkins Data Science Capstone Project
Goal
Develop an application that takes an input phrase and predicts the next word.
Considerations
Accuracy -vs- Performance -vs- Resources
Three files comprised of US blogs, news, and twitter entries provided the data to develop the model. Combined they were 558MB, contained over 4.2 million unique terms and had line lengths in excess of 40K characters.
Prior to building the model, the data was cleaned by removing offensive words and profanity, spell checked, and expundged of non-printable characters and punctuation.
After the data cleansing process, a word frequency table was created as well as a table of N-grams of two, three, and four word combinations.
Finally, a model was created using this data and the results presented as a Shiny application.
Utilize a back-off N-gram frequency model to estimate the end word.
Predicting The Word
Features