Word Prediction

Guennie59
March 18th, 2018

Application Design

Starting point is the objective to predict the next word in an app on the shinyapps.io platform. The following are the main considerations and design ideas:

  • Response time would be crucial for end-user acceptance
  • The prediction will be based on n-grams of cleaned up text
  • The only sensible solution would be to return the last word from the most frequent n-gram (given n-1 words)
  • The algorithm must be simple and fast in execution
  • The dictionary for looking up words/n-grams need to be based on 'good' statistics - but more than 10% of sample texts did prove to be unrealistic. Most work was done with 3%.

For reference see also http://rpubs.com/Guennie59/361223

Algorithm and Dictionaries

Dictionaries have been prepared by splitting the cleaned up 3% text-samples into 4-,3- and 2-grams across all sources (blogs, news, twitter). Subsequently the n-grams have been aggregated and sorted in decreasing order, i.e the most frequent at the top. For these samples basic values were

  • 4-gram occurences : “the end of the” x 216, mean = 1.05, 2451486 single
  • 3-gram occurences : “one of the” x 852, mean = 1.26, 1910475 single
  • 2-gram occurences : “of the” x 11030, mean = 2.56, 806520 single

That way the algorithm is basically a 'grep' of the input word sequence throughout the n-gram column of the dictionary data-frame. (data.tables appear to slower in this case). The first left-side anchored grep result is being used, as it relates to the most frequent n-gram in the list.

Furtheron partial matches are being included as well, i.e. in case of no-match of a 3-gram, 2-grams are being build, first rightmost 2 words, then leftmost 2 words then left and right word. For 2-grams similarly unigrams are being used as fall-back.

Performance

N-gram data-frames have been split into 'training'- (i.e. to build the dictionaries, 90%) and test-sets (10%). The following plots show the performance of the algorithm based on 1000 calls to the prediction function. The full dictionary has been used and also a more compact version with single n-gram occurences removed (majority for 4-grams, so more compute power and larger sample size would have been helpful here).

Full-dict Size [MB] lines exact match CPU-time all match CPU-time
4-gram 218 2512869 3.4% 0.810 s 41.2% 1.394 s
3-gram 168 2088126 8.1% 0.601 s 14.1% 0.692 s
2-gram 76 1029126 10.7% 0.261 s 11.0% 0.250 s
comp-dict Size [MB] lines exact match CPU-time all match CPU-time
4-gram 17.0 61383 2.4% 0.020 s 12.6% 0.121 s
3-gram 14.2 177651 6.8% 0.055 s 13.4% 0.075 s
2-gram 5.3 222606 10.2% 0.053 s 10.5% 0.055 s

Remarkable is truely the high “all-match” rate of the final algorithm with the full dictionary and in the test set, as well as the overall short processing times. (Please note: The dictionaries with about 100-200 MB size did not run on shinyapps.io)

3 Application Prototypes

Three variations of the word-prediction-app have been prototyped

The appearance of wordcloud indicates the complete loading of the dictionary(ies).

Summary and Proposal

  • It has been demonstrated that with limited effort during evenings and weekends and very small computing power an application (3 variants) for word prediction could be developed that performs well on low capacity devices.

Business Potential:

  • The app could be further developed with User Interfaces that could be optimized for various devices and user preferences; it could be made available for free with ad-based sponsoring.

My request to you as investors: A seed fund of $ 250 k in order to:

  • implement language detection and prepare multiple mixed dictionaries
  • explore openNLP, markov-chain and deep learning models
  • investigate methods of transfer learning to adjust to personal language styles
  • improve statistical base, performance and memory footprint of the application
  • Improve look&feel of the user interface
  • a marketing campaign to increase usage and increase ad-based revenue