B. Schwenk
2-1-2019
For the Data Science Capstone project I created a word prediction app. In general the following steps were taken to get to the end result:
The app is easy to use. Just enter a sentence and click the “predict”-button. The algoritm will automatically predict the best next word. It also gives a few alternative words and gives some quality metrics.
The first step was data exploration of the very large Twitter, Blogs and News corpora. A few results:
The main step in data extraction is the creation of N-grams with R tm package. After basic text cleaning/preparation, I focussed on extracting 2,3,4 and 5-grams. NB. The onegrams are less usefull for prediction and are only used to solve equal probability problems or are not in the dataset at all.
The 3-4-5 grams proved to be most usefull to be able to get next words that really fit into the sentence. Because the amount of N-grams set is however limited, to get a good balance between speed and prediction performance, not all sentences have a 3,4 or 5 gram match. The bi-grams are necessary to get arround this and also have a higher hit rate on rare words. Counts for ngrams are: 3,4,5 grams: 300k, bigram: 111k, unigram: 28k.
I chose to gather all ngrams, filter on only the most important and save these to file. The file contains also e.g. the next word to predict, counts and probablities. This saves considerable execution time when running the model. It takes about 20 seconds to download the dataset when loading the app. This is I think an acceptable time considering an acceptable prediction performance.
A random sample (about 10%) was set a side to serve as a testset.
The app uses the N-grams for predicting the next word.