Mitch Fawcett
April 14, 2016
A Word Prediction Application Using R
Introduction
A feature seen in many texting applications is the ability to suggest the next word that a user will type based on words they have already entered.
Word Genie is a demonstration of that capability using R, the statistical language.
Feel free to explore the application from here, but be sure to come back for a description of how it works.
https://myshinyacct.shinyapps.io/WordGenie2/
I hope you will consider incorporating Word Genie technology into your text driven application!
Methodology
To develop Word Genie, I began by analyzing a large corpus of phrases drawn from the Internet, including Twitter messages, news stories and blogs. The total amount of text involved was approximately 75 million words and it provided all the necessary raw data for training the word prediction algorithm behind Word Genie.
Using the tm package in R, the corpus was first preprocessed to remove punctuation, numbers, extra whitespace and profanity. The purified data was then used to identify approximately 10 million unique n-grams (2 word, 3 word and 4 word phrases), and calculate their frequencies.
Backoff Models
The millions of n-grams were programmed into a Backoff model. Backoff models are based on the probability of the next word in a sentence being mostly dependent on the words immediately preceding it (think Markov chains). Words that are are more than 4 or 5 words earlier can be reasonably ignored in making the prediction.
In Word Genie, the last three words the user typed are compared against the first three words in each 4-gram (4 word phrase). If a match is found, the 4th word in the 4-gram is returned as the predicted next word. If multiple matches are found, then the most frequently occurring 4th word is returned.
If no match is found in the 4-grams, the process is repeated with the 3-grams, and if no match is found in the 3-grams, the 2-grams are used. Finally, if no match is found in the 2-grams, a random word is returned from a list of 25 most frequent words.
Results
When tested against 1,000 randomly selected phrases from Twitter there were 158 correct next word predictions. With News phrases there were 167 out of 1,000 correct predictions, and with Blog phrases, 168 out of 1,000 predictions were correct. These results are very good considering the simplicity and low cost of the model. Accuracy can easily be improved by expanding the size of the training corpus and identifying additional n-grams. Response time can be further improved by using a binary search instead of sequential scans and through the use of hashing.