Ethan Bench
3/19/2017
This app aims to answer a common question in modern Natural Language Processing, “Given an input of text, what is the most likely next word?”
As a starting point, I was given a corpus of English text from the HC Corpora data set (www.corpora.heliohost.org), containing 800MB of News articles, Blog articles and Twitter data. For reasons of computing power and memory, I sampled 25% of the data set to use for the app. Then I cleaned and processed the data to remove capital letters, punctuation and numbers to standardize the text.
My solution to the problem uses the Ngram predition method in conjunction with Katz's backoff modeling. From the sampled data set, I created a list of all one, two and three Ngrams that appeared in the data set, keeping results that occured more than once to eliminate uncommon words with little predictive power.
The app takes any text input, assumes that all text blocks entered are whole words, and then predicts the most likely word to follow in standard English text. The output is a list of the 8 best choices in descending order of relevance. Any length of input can be handled; however, only the last 2 words in the sentence will impact the results of the next word list, which is similar to SwiftKey's text prediction app.
The app is constructed using 3 data frames that have been compressed. Given that the app only returns a maximum of 8 possible outcomes, predictive word(s) that occured more than 8 times in a given data frame were removed. This reduced my data tables from 34MB to 12MB.
I tested my app against a test set of 20,646 samples from the original data set (0.5% of original data, not included in my training set) and found a 47.9% accuracy rate. The model is “accurate” when the actual next word is one of the 8 predicted words.
The app is simple to use. The app can be found at: https://huskydawg44.shinyapps.io/WordPredictor/. Enter any amount of text into the text box on the left pane, and the prediction algorithm will return a list of 8 probable next words. The app calculates immediately upon input change.
The app is taking the input text, running it through the same cleaning algroithm I used on the training data set, identifying the last two words (or one if only one is entered), and putting it/them through the prediction alogithm. Blank and invalid entries have appropirate error messages.
A fun exercise is to make an initial input, then continue to add one word at a time by typing one of the predicted words, and repeating this process to see what forms.
Ngram - https://en.wikipedia.org/wiki/N-gram
Katz's Backoff model - https://en.wikipedia.org/wiki/Katz%27s_back-off_model
SwiftKey - https://en.wikipedia.org/wiki/SwiftKey
Specific data set used - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Full Code of project - https://github.com/huskydawg44/WordPrediction-Shiny