Text predictor

Óscar Villa
14 March 2017

Implementation of the Stupid backoff model for an algorithm which predicts possibles next words, based on a wise pruning of N-grams for improve the speed and the RAM consumption.

Text cleaning and sampling

The data was taken from a corpus called HC Corpora at https://www.heliohost.org. It contains

2.360.148 tweets
1.010.242 news
899.288 blogs

from which it was taken 0.6 for “training” the model, 0.2 for measure and 0.2 for final performance measuring. Then all hyphens, apostrophe, dots, #, \, ?, /… and any else punctuation, www, RT, numbers and extra spaces were removed. The end of sentences (EOS) were marked up and taken as spliters of phrases. And finally everything was tolowerized. Having in account the aim of the predictor, the so called “stop words” were not removed, because we could want to predict some “stop words” as “the”, “to”, “of”…

Tokenization and pruning

Because of the size of the sample taken and the pc in which the preparation all was done (intel i7 with 8 Gb RAM), text2vec library (http://text2vec.org/) was very helpful: even when it was necessary to split on chunks the data and parallelize the code for get the vocabulary (number of times each N-gram appears on the corpus) for one-gram, two-gram and tri-gram. Then bigrams and trigrams were splitted on root and word. For each root we just keep the three more frequent words, reducing the bigrams to 0.10 out of the sample and trigrams to 0.60. Finally, we left out the trigrams that appears just one time.

    class      token   root  word
1  bigram   my_dream     my dream
2 trigram by_the_way by_the   way

The app and its performance

When you type or paste some text on the textbox, the algo cleans up the text as did with the training text. Then takes the last two words of the phrase and looks for them on trigrams roots giving back the three more probable or frequent words. If there are not such three words (keep just the last word) or if the phrase have a length of just one word, the algo look for this word on the roots of bigrams and gives back the words. If all this fails, the algo just give the three more frequents onegrams.

First time load: around 11 seconds. Refreshing to predict the next three words: less than a second (~0.46). Memory on Shinyapps.io instance: 450MB. Accuracy of prediction with first three words: 0.14.

How it is about

The algorithm of the prediction of the next word imply a tradeoff between the size of the sample, the computational cost of their treatment, the in memory treatment constraints inherent to R and the speed and the memory consumption for the final allocation on the app.

I realize that, due to we'll predict just the three more frequent/probable words, we don't need all the sample of N-grams and thanks to that, we solve the problem about the speed and memory constraints of the final app.

Now, please, just go to https://oscarvilla.shinyapps.io/textpredictor/ and type or paste some text in the textbox. In less than a half of second (litterally) at the right, there will appear three words sorted from the most to the less probable, labeled each one. It's so easy. Please hurry up!