Puxin Xu
April 17, 2016
The main goal of this capstone project is to build a shiny application that is able to predict the next word.
This exercise was divided into seven sub tasks like data cleansing, exploratory analysis, the creation of a predictive model and more.
All text data that is used to create a frequency dictionary and thus to predict the next words comes from a corpus called HC Corpora.
All text mining and natural language processing was done with the usage of a variety of well-known R packages.
After creating a data sample from the HC Corpora data, this sample was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. This data sample was then tokenized into so-called n-grams.
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. (Source)
As the picture, the interface of the application is simple.
There are two steps to predict a word:
submit button.The next word prediction app is hosted on shinyapps.io: https://creatrol.shinyapps.io/NextWordPredictor
The whole code of this application, as well as all the milestone report, related scripts, this presentation etc. can be found in this GitHub repo: https://github.com/Creatrol/Next-Word-Predictor
This pitch deck is located here: http://rpubs.com/Creatrol/NextWordPredictor
Learn more about the Coursera Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1