B. Schwenk
2-1-2019
For the Data Science Capstone project I created a word prediction app. In general the following steps were taken to get to the end result:
The app is easy to use. Just enter a sentence and click the “predict”-button. The algoritm will automatically predict the best next word. It also gives a few alternative words and gives some quality metrics.
The first step was data exploration of the very large Twitter, Blogs and News corpora. A few results:
The main step in data extraction is the creation of N-grams with R tm package. After basic text cleaning/preparation, I focussed on extracting 2,3,4 and 5-grams. NB. The onegrams are less usefull for prediction and are only used to solve equal probability problems or are not in the dataset at all.
The 3-4-5 grams proved to be most usefull to be able to get next words that really fit into the sentence. Because the amount of N-grams set is however limited, to get a good balance between speed and prediction performance, not all sentences have a 3,4 or 5 gram match. The bi-grams solved this and gave a higher hit rate on rare words. Counts: 3,4,5 grams: 300k, bigram: 111k, unigram: 28k. A random sample (about 10%) was set a side to serve as a testset.
I chose to gather all ngrams, filter on only the most important and save these to file. The file contains also e.g. the next word to predict, counts and probablities. This saves considerable execution time when running the model.
The app uses the N-grams for predicting the next word.