Jose Gustavo Z. Rosa
May/2017
The Project concept is based on the needs of the SwiftKey company related to “predict” the next word in a text, To do so, a textual dataset were provided in 4 different languages (English, Deusth, Russian and Finland). The key aspect is to use Text Mining and some NLP techniques to build a dataset good enough for :
I approach this project from a streamline point of view, in which I separete different objectives in diferent files. So at this point the project contains 4 folder structures as follows
After investing some time researching more sofisticated algorithms and R packages I found the Katz Backoff model which I believe fit's the bill properly and there are different implementations of it In R.
Due to size of data files and performance, I had to reduce my data files in order to the Katz Model works a bit faster, to do so, I had to work reducing the text sparsity, specially for bigrans from 0.56 down to 0.28.
The shiny app ran OK with that amout of data, and I choose not to brush up the UI, and rather I invest more time on trying to refine a little more the ngram generation script using not only the TM package but also the Quanteda package which I think ran better than TM specially for Term matrix analisys and overall clean up operations.
And here is my App : TheNextWord
Please remember : Patience is virtue !