Ann Elisa Rijo
25 September, 2018
The objective of this capstone is developing a Shiny app that can predict the next word, like that used in mobile keyboards applications implemented by the Swiftkey.
There are many tasks to be realized such as:
(1) Understanding the problem, getting and cleaning the data.
(2) Making of Exploratory Data Analysis(EDA).
(3) Tokenization of words and predictive text mining.
(4) Writing a milestone project and a prediction model.
(5) Developing a shiny application and Writing the Pitch.
The data came from HC Corpora with three files (Blogs, News and Twitter). The data was cleaned, processed, tokenized, and n-grams are created. The final report comes from the link Milestone Report.
The Shiny application allow the prediction of the next possible word in a sentence.
The user entered the text in an input box, and in the other one, the application returns the most probability word to be used.
The predicted word is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences.
While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is then provided for the user's choice.
After sample from the HC Corpora data is created, following cleansing steps are performed,
(1) conversion to lowercase,
(2) Removing punctuation, links, white space, numbers and any forms of special characters.
Then the algorithm developed to predict the next word in a user-entered text string is based on a classic N-gram model. As mentioned earlier, a subset of cleaned data from blogs, twitter, and news are tokenised into N-grams, that is, Maximum Likelihood Estimation(MLE) of unigrams, bigrams, and trigrams were computed.
A profanity filter was also utilized on all output using Google's bad words list. And at the end, suggested word completion is based on the unigrams.
I tried to create a Shiny app that takes as input a phrase(multiple words) in a text box input and outputs a prediction of the next word. The link to my shiny app : “https://rosmin28.shinyapps.io/Data_Science_Capstone/”
Screenshot of the user interface with the directions to provide a sentence or a word and get the prediction of the next likely word.