Word Prediction

Guennie59
March 18th, 2018

Application Design

Starting point is the objective to predict the next word in an app on the shinyapps.io platform. The following are the main considerations and design ideas:

Response time would be crucial for end-user acceptance
The prediction will be based on n-grams of cleaned up text
The only sensible solution would be to return the last word from the most frequent n-gram (given n-1 words)
The algorithm must be simple and fast in execution
The dictionary for looking up words/n-grams need to be based on 'good' statistics - but more than 10% of sample texts did prove to be unrealistic. Most work was done with 3%.

For reference see also http://rpubs.com/Guennie59/361223

Algorithm and Dictionaries

Dictionaries have been prepared by splitting the cleaned up 3% text-samples into 4-,3- and 2-grams across all sources (blogs, news, twitter). Subsequently the n-grams have been aggregated and sorted in decreasing order, i.e the most frequent at the top. For these samples basic values were

4-gram occurences : “the end of the” x 216, mean = 1.05, 2451486 single

3-gram occurences : “one of the” x 852, mean = 1.26, 1910475 single

2-gram occurences : “of the” x 11030, mean = 2.56, 806520 single

That way the algorithm is basically a 'grep' of the input word sequence throughout the n-gram column of the dictionary data-frame. (data.tables appear to slower in this case). The first left-side anchored grep result is being used, as it relates to the most frequent n-gram in the list.

Furtheron partial matches are being included as well, i.e. in case of no-match of a 3-gram, 2-grams are being build, first rightmost 2 words, then leftmost 2 words then left and right word. For 2-grams similarly unigrams are being used as fall-back.

Performance

N-gram data-frames have been split into 'training'- (i.e. to build the dictionaries, 90%) and test-sets (10%). The following plots show the performance of the algorithm based on 1000 calls to the prediction function. The full dictionary has been used and also a more compact version with single n-gram occurences removed (majority for 4-grams, so more compute power and larger sample size would have been helpful here).

Full-dict	Size [MB]	lines	exact match	CPU-time	all match	CPU-time
4-gram	218	2512869	3.4%	0.810 s	41.2%	1.394 s
3-gram	168	2088126	8.1%	0.601 s	14.1%	0.692 s
2-gram	76	1029126	10.7%	0.261 s	11.0%	0.250 s

comp-dict	Size [MB]	lines	exact match	CPU-time	all match	CPU-time
4-gram	17.0	61383	2.4%	0.020 s	12.6%	0.121 s
3-gram	14.2	177651	6.8%	0.055 s	13.4%	0.075 s
2-gram	5.3	222606	10.2%	0.053 s	10.5%	0.055 s

Remarkable is truely the high “all-match” rate of the final algorithm with the full dictionary and in the test set, as well as the overall short processing times. (Please note: The dictionaries with about 100-200 MB size did not run on shinyapps.io)

3 Application Prototypes

Three variations of the word-prediction-app have been prototyped

With dictionaries based on 10% of samples that allows dictionary selection and mixing https://guenter59.shinyapps.io/word_prediction/
Based on 3% samples of the dictionary that is usually fast enough to perform real-time prediction (i.e. without a submit button) https://guenter59.shinyapps.io/lean_application/
A self-learning-app, whereby entered phrases by the user will add to the dictionaries with a 'learning-rate'. With high rates it should be possible to test the effect (app needs a restart) https://guenter59.shinyapps.io/self_learn_app/

The appearance of wordcloud indicates the complete loading of the dictionary(ies).

Summary and Proposal

It has been demonstrated that with limited effort during evenings and weekends and very small computing power an application (3 variants) for word prediction could be developed that performs well on low capacity devices.

Business Potential:

The app could be further developed with User Interfaces that could be optimized for various devices and user preferences; it could be made available for free with ad-based sponsoring.

My request to you as investors: A seed fund of $ 250 k in order to:

implement language detection and prepare multiple mixed dictionaries
explore openNLP, markov-chain and deep learning models
investigate methods of transfer learning to adjust to personal language styles
improve statistical base, performance and memory footprint of the application
Improve look&feel of the user interface
a marketing campaign to increase usage and increase ad-based revenue