Seth Green
December 2014
The beauty is in the simplicity. A simple clean interface and with a simple quick response. WatchaGon takes text input in real time and runs it through a carefully selected and cleaned table of 2-, 3- and 4-grams to determine the most likely next word in the sentence. There is no need for real-time processing or probability calculation under the hood. Just quick easy matching and sorting.
The secret is that all of the heavy lifting is done in pre-processing. A corpus of close to 3 million news articles, blog entries, and tweets was analysed and counts of the most common 2-, 3- and 4-grams were tabulated. The ngrams were then sorted by frequency counts into a data frame with each word in a seperate column.
Freq w1 w2 w3 w4 tag
1 4703 the rest of the 4
2 4212 for the first time 4
3 20185 <NA> a lot of 3
4 13571 <NA> to be a 3
5 287913 <NA> <NA> in the 2
6 60648 <NA> <NA> I have 2
When text is entered into WatchaGon, the last three words typed are assigned to w1, w2 and w3 respectively, while w4 functions as the predicted word. The algorithm then looks for an exact match of each word in its respective column, starting with w3.
if(length(grep(pwords[3], preds$w3))>0){
preds <- preds[preds$w3==pwords[3], ]
}else{preds <- preds[!preds$tag==2, ]}
If no matches of the final word are found in w3, all 2-grams (signified in the data frame as tag==2) are thrown out of the running. However, the algorithm still looks for matches of the second to last word in w2, and the third to last word in w1. This creates some semi-long range depencies, which would be unavailable in a strict “backoff” model.
The phrase “a really special day” came through in a random test set.
predictor(c("a", "really", "special"))
Freq w1 w2 w3 w4 tag
61654 15 a very special day 4
As you can see, the results correctly predicted “day” as w4, even though there was no match for w2. However, w1 matched “a”, making the match essentially “a NA special” which triggered the correct 4-gram. A simple backoff would have either reverted to looking for “really” earlier in the sentence, losing the syntactic structure, or settled for a 2-gram of “special __”, which in our model produces “special needs.”
Freq w1 w2 w3 w4 tag
1131974 278 <NA> <NA> special needs 2
If no exact match is found in any column, the algorithm starts over, using pattern matching to attempt to find a partial match.
grep(pwords[3], preds$w3, ignore.case=T)
This accounts for some stemming and lemma matches and also saves some misses that would've occured from mis-spellings and slang.
The data frame is ordered by tag (indicating 4-, 3- or 2-gram) and then by frequency count. Therefore, a matching 4-gram will always trump a 2-gram, even if the 2-gram has greater frequency. However, the organization and matching by columns, instead of simple backoff or interpolation, gives some interesting results, as shown earlier in the the “special day” example.
The predictor() function at the heart of WatchaGon was run through a testing function which randomly selects strings of four consecutive words from a held-out portion of the original corpus and attempts to predict the fourth from the previous three.
WatchaGon averaged between 12.6% and 15% accuracy when randomly tested on the held-out corpus. This is in line with most n-gram based prediction models.
And happy predicting!