Tyler Byers
December 2014
The word prediction app originally built for the Coursera/Johns Hopkins University Data Science Capstone project
Contact: tybyers@gmail.com
This app, deployed on the R Shiny Server website, is a prototype app to show the capabilities of the Word Oracle prediction machine. It is not meant to be a stand-alone product. The prediction machine, if accepted, may be useful if further developed for the following:
Application url: https://tybyers.shinyapps.io/wordoracle/.
Note: Initial chart and table load may take several seconds.
Example: User enters “I love the Denver” into text box, clicks “Get Score Charts” button, and top result is “broncos”! Top result shown, and other less-probable results, with corresponding “scores” shown in table and chart.
Text data for prediction are derived from news articles and tweets in English supplied in the HC Corpora corpus (http://www.corpora.heliohost.org/). The data were cleaned and put into a look-up table using the following process.
Data cleaning:
Tokenization:
RWeka package to create 4-grams. The Word Oracle uses a look-up algorithm to determine the most likely word(s) for prediction. The 4-gram table of 2.51 million lines is used. The algorithm works fairly quickly, and accuracy would only improve with more data capacity. The algorithm is explained below. At this time, all results are returned as lower-case words with no punctuation or numerals.
If the user enters 3 or more words into Text Box:
If the user enters fewer than 3 words into Text Box:
When the user clicks the Get Score Charts button: