Angela Di Serio
May, 2016
Data Science Capstone
Word Prediction
SpeedyWords is a simple word prediction app that suggests the next word after you input a set of characters, words or a phrase.
The prediction model was built using three sources of data (blogs, news and twitter feeds). Since the dataset was fairly large and in order to reduce the time needed for preprocessing and cleaning, a text corpus was created by combining a 20% sample from each of the three sources.
The corpus was cleaned to remove profanity words, punctuations, numbers, all non UTF-8 characters, convert all text to lowercase, and remove aditional whitespaces. Stopwords were left in since these are used in normal language. In this first version of the app, stemming was not applied to reduce words to the root form.
The cleaned data sample was then tokenized into N-grams: 4-grams, 3-grams, 2-grams and 1-grams.
SpeedyWords is based on a Backoff N-gram model with Kneser-Ney smoothing.
In a backoff N-gram model, if the N-gram needed has zero counts, the app aproximates it by backing off to the N-1 gram. It continues backing off until it reaches a history that has some counts.
SpeedyWords app is only 8.7MB in size.
The time it takes to predict is between 10 and 250 msec.
To play with the application go to https://adiserio.shinyapps.io/SpeedyWords/
Just start typing a word and SpeedyWords will provide at most three suggestions.
No submit button, since the application will be predicting while you are in the middle of writing a word.
A space should be included at the end of the last word to predict the next word.