Len Greski
23 March 2017
Over four million texts, new algorithms and packages to consider, and a new problem domain. The Capstone project for the Johns Hopkins Data Science Specialization is the ultimate experience in the hacker mentality.
Problem: build a Shiny application that predicts the next word in a phrase or sentence, using the Heliohost corpus as the basis for predictions. The work product includes:
Given the constraints of Shiny, the algorithm and supporting data need to fit within a 1 gigabyte memory space. As a text predictor application, once data is loaded, end users expect subsecond response time. Therefore, we chose a backoff model as our algorithm to predict the next word in a string entered by the user.
Database: An R data frame built with the data.table package for high performance due to its indexing feature.
We used a 3 step process to build the database used for our prediction algorithm, including:
Due to the large size of the corpus (over 4 million texts) data was processed by type of text (blogs, news, twitter) and combined in step 3 above. Code for the project is located on the lgreski/dssCapstone github repository: https://github.com/lgreski/dssCapstone
Access the app at https://lgreski.shinyapps.io/textPredictor