Data Science Capstone - Text Prediction App

Ashwin Revo
31st December 2016

Goal of this project is to create an algorithm for predictive text for the data science capstone course
Training data for the algorithm was given in the course contains data from twitter, blogs and new sites
Using the training data a model of most likely sequence of 2, 3 and 4 word sets were generated
The models were used to predict the next word to be typed by the user

The entire data set was used to generate data frames of 2-gram, 3-gram, 4-gram using the ngram library in R
The complete data set consists of 4269678 lines which generated more than 20 million unique word sequences
To ensure fast page load times, sequences with a frequency less than 10 were discarded. - The algorithm looks to match the longest sequence of words for prediction which means if the input text matches a 4-gram the result from 4-gram will be showed first. If the training 4-gram fails to match the input text then 3-gram will be checked followed by 2-gram
Algorithm is case insensitive

The user types in the text in the sidebar panel which triggers the server side code to generate the predicted text
The predicted text is displayed on the main panel in semi colon separated format
In my testing the app loaded quickly in around 10 seconds after which the predictive text output was displayed instantaneously
Instant text predictions was achieved by preprocessing and optimizing the n-gram frequency data

Source code for the algorithm can be found here, https://github.com/arevo/DataScienceCapstone/blob/master/algo.R
The working implementation of the algorithm has been uploaded to shinyapps here, https://ashwinrevo.shinyapps.io/TextPredictor/
Reference: n-gram CRAN package - https://cran.r-project.org/web/packages/ngram/ngram.pdf