Ian Donaldson
December 14th, 2014
You can try it here: https://iandonaldson.shinyapps.io/wordPredict/
The text entered by the user is pre-processed using the same set of regular expressions that were used to pre-process the corpus.
If the user enters: Now iS … :) the “time”
it becomes: now is the time
and this quadragram is compared to an index of all pentagrams that were observed in the training corpus and the word that most frequently occurs after this quadragram is returned.
If the user enters fewer words (say n words), then the corresponding (n+1)gram table is consulted. The result is returned for the longest matched n-gram that was observed at least 5 times.
40% of the blogs.txt corpus was used as a training set.
The index of all unigrams up to pentagrams was generated in about 1 hour and had over 15000 pentagrams that had been observed at least 5 times.
This index is 15 MB
see the index.R file for code (link at the end of the presentation)
An optimization was found that sped up the slowest step of the indexing step by 300 fold - this involved using the fast aggregation capabilities of the data.table package.
Unfortunately, there was not enough time to implement these code changes and test them. This optimization would allow much larger corpus sizes to be generated, tested and compared.
More about the corpus and an interim report on the analysis can be found here: https://rpubs.com/ian/41353
You can find code here: https://github.com/iandonaldson/wordPredict This is not all the code, but it will allow you to reproduce the application and index.
The app is here: https://iandonaldson.shinyapps.io/wordPredict/
Thanks