Data Science Capstone - Text Prediction App

Ashwin Revo
31st December 2016

Introduction

  • Goal of this project is to create an algorithm for predictive text for the data science capstone course
  • Training data for the algorithm was given in the course contains data from twitter, blogs and new sites
  • Using the training data a model of most likely sequence of 2, 3 and 4 word sets were generated
  • The models were used to predict the next word to be typed by the user

Algorithm

  • The entire data set was used to generate data frames of 2-gram, 3-gram, 4-gram using the ngram library in R
  • The complete data set consists of 4269678 lines which generated more than 20 million unique word sequences
  • To ensure fast page load times, sequences with a frequency less than 10 were discarded. - The algorithm looks to match the longest sequence of words for prediction which means if the input text matches a 4-gram the result from 4-gram will be showed first. If the training 4-gram fails to match the input text then 3-gram will be checked followed by 2-gram
  • Algorithm is case insensitive

Shiny App

  • The user types in the text in the sidebar panel which triggers the server side code to generate the predicted text
  • The predicted text is displayed on the main panel in semi colon separated format
  • In my testing the app loaded quickly in around 10 seconds after which the predictive text output was displayed instantaneously
  • Instant text predictions was achieved by preprocessing and optimizing the n-gram frequency data

Conclusion