Bastian Huntgeburth
2019-05-11
This presentation gives a short overview about my results of the capstone project of the Data Science specialization course provided by the Johns Hopkins University. The goal of the project was to develop a data product that predicts the next word in a human written text like the well-known SwiftKey product. The resulting application can accessed with this link.
The following main steps were done during the data science project.
After the data was loaded into a corpus and cleaned, it was ready for building a model.
With a first approach we choose the Marcov Chain as a appropriate model to predict a next word. The Marcov Chain says that a next state is based on a limited number of previous states. \[ P(w_i|w_1 w_2 ... w_{i-1})\approx P(w_i|w_{i-1}... _{i-n}) \]
The states of the textual data were represented by tokens (words) of the corpus. We build every possible Marcov chain with up to 4 consecutive words (bi-, tri- and fourgrams). The last word of each n-gram was the prediction based on the other words bevor. After that the count of each chain was calculated and the resulting frequency was transformed in the relative probability. All chains were saved in big lookup tables, for later prediction.
You can find the application behind this link.
To use the data product, you just need to type in your text in the textarea. Don't forget the space behind your words. The prediction of the next word is done automatically. The predicted word will be shown as clickable buttons ordered by their probability. If you click on a button, the word will be appended to your text. The application was tested with the firefox browser.
For reproducibility I'am sharing all my material.
Further steps