Bruno Hanzen
2017/03/21
The Coursera Datascience specialization provides a global training in the basic skills of the discipline, including, but not limited to R language,statistics, presentation skills and Web application development.
The program is concluded with the Capstone Project. The objective is to provide a predictive text model that “predicts” the next word to be entered at the keyboard, based on the preceding words.
We built a parametrized model, that is highly flexible and can be adapted to suit the needs of different applications. It is basically made up of to parts:
Coursera provided a compilation of news, blogs and tweets prepared by Swiftkey (556 MB).
Some tips about parameters selection:
In the training phase, we analyse a text thesaurus and extract the information that will be provided to the runtime. It is the most highly resource-intensive part of the process, and can require lots of computing resources (memory, processor).
In our case, we used a 8 GB RAM PC for the training phase. We had to apply some “tricks” to be able to process the files provided by Coursera and Swiftkey.
This is where you find the proper prediction algorithm. It implements the Katz-Backoff algorithm. You can find the exact description in Wikipedia (https://en.wikipedia.org/wiki/Katz%27s_back-off_model).
The basic principle is to look in the n-gram tables for occurences of sequence of the precedently typed 1, 2, 3, … words (n-grams), and assign a probability of the successor word fount in the n-gram tables, based on the occurence frequency in the thesaurus. The highest-order ngrams have precedence. A “discount” process controls the comparison of the probability of different orders n-grams.
We used constant discount coefficients. It would have been possible to use a more sophisticated algorithm like Good-Turing, but it would have added to the algorithm complexity, and our tests have shown a rather small sensitivity of the success to the dicsount parameters.
We used the Coursera-supplied benchmark to test precision and runtime. We could achieve an “Overall top-3 precision” of 19.34 % with an average 57.41 msec runtime. The maximum precision we could achieve was 20.5%, withmore than 1,000 ms runtime.
We ran a baseline: the predicted word was simply the most frequent word in the thesaurus (the). The precision was 10.78%, with 8 ms runtime. Our application precision is nearly twice as good.
The application is deployed on https://bruno-hanzen.shinyapps.io/NextWord/