Data Science Capstone - Text Prediction App
Ashwin Revo
31st December 2016
Introduction
- Goal of this project is to create an algorithm for predictive text for the data science capstone course
- Training data for the algorithm was given in the course contains data from twitter, blogs and new sites
- Using the training data a model of most likely sequence of 2, 3 and 4 word sets were generated
- The models were used to predict the next word to be typed by the user
Algorithm
- The entire data set was used to generate data frames of 2-gram, 3-gram, 4-gram using the ngram library in R
- The complete data set consists of 4269678 lines which generated more than 20 million unique word sequences
- To ensure fast page load times, sequences with a frequency less than 10 were discarded. - The algorithm looks to match the longest sequence of words for prediction which means if the input text matches a 4-gram the result from 4-gram will be showed first. If the training 4-gram fails to match the input text then 3-gram will be checked followed by 2-gram
- Algorithm is case insensitive
Shiny App
- The user types in the text in the sidebar panel which triggers the server side code to generate the predicted text
- The predicted text is displayed on the main panel in semi colon separated format
- In my testing the app loaded quickly in around 10 seconds after which the predictive text output was displayed instantaneously
- Instant text predictions was achieved by preprocessing and optimizing the n-gram frequency data