Guennie59
March 18th, 2018
Starting point is the objective to predict the next word in an app on the shinyapps.io platform. The following are the main considerations and design ideas:
For reference see also http://rpubs.com/Guennie59/361223
Dictionaries have been prepared by splitting the cleaned up 3% text-samples into 4-,3- and 2-grams across all sources (blogs, news, twitter). Subsequently the n-grams have been aggregated and sorted in decreasing order, i.e the most frequent at the top. For these samples basic values were
That way the algorithm is basically a 'grep' of the input word sequence throughout the n-gram column of the dictionary data-frame. (data.tables appear to slower in this case). The first left-side anchored grep result is being used, as it relates to the most frequent n-gram in the list.
Furtheron partial matches are being included as well, i.e. in case of no-match of a 3-gram, 2-grams are being build, first rightmost 2 words, then leftmost 2 words then left and right word. For 2-grams similarly unigrams are being used as fall-back.
N-gram data-frames have been split into 'training'- (i.e. to build the dictionaries, 90%) and test-sets (10%). The following plots show the performance of the algorithm based on 1000 calls to the prediction function. The full dictionary has been used and also a more compact version with single n-gram occurences removed (majority for 4-grams, so more compute power and larger sample size would have been helpful here).
| Full-dict | Size [MB] | lines | exact match | CPU-time | all match | CPU-time |
|---|---|---|---|---|---|---|
| 4-gram | 218 | 2512869 | 3.4% | 0.810 s | 41.2% | 1.394 s |
| 3-gram | 168 | 2088126 | 8.1% | 0.601 s | 14.1% | 0.692 s |
| 2-gram | 76 | 1029126 | 10.7% | 0.261 s | 11.0% | 0.250 s |
| comp-dict | Size [MB] | lines | exact match | CPU-time | all match | CPU-time |
|---|---|---|---|---|---|---|
| 4-gram | 17.0 | 61383 | 2.4% | 0.020 s | 12.6% | 0.121 s |
| 3-gram | 14.2 | 177651 | 6.8% | 0.055 s | 13.4% | 0.075 s |
| 2-gram | 5.3 | 222606 | 10.2% | 0.053 s | 10.5% | 0.055 s |
Remarkable is truely the high “all-match” rate of the final algorithm with the full dictionary and in the test set, as well as the overall short processing times. (Please note: The dictionaries with about 100-200 MB size did not run on shinyapps.io)
Three variations of the word-prediction-app have been prototyped
With dictionaries based on 10% of samples that allows dictionary selection and mixing https://guenter59.shinyapps.io/word_prediction/
Based on 3% samples of the dictionary that is usually fast enough to perform real-time prediction (i.e. without a submit button) https://guenter59.shinyapps.io/lean_application/
A self-learning-app, whereby entered phrases by the user will add to the dictionaries with a 'learning-rate'. With high rates it should be possible to test the effect (app needs a restart) https://guenter59.shinyapps.io/self_learn_app/
The appearance of wordcloud indicates the complete loading of the dictionary(ies).
Business Potential:
My request to you as investors: A seed fund of $ 250 k in order to: