Janos Brezniczky
25/04/2016 GMT
Use word prediction to provide help to users entering text potentially on a mobile device. Limited resources are to be envisionaged.
Collecting unigrams, bigrams etc. was the first step.
The unigram statistics shed light on the repetitive nature of the terms
(The initial report is here.)
In the further steps, only top terms, yet covering 95% of the corpus, have been considered
Bad words are ignored (as if they weren't there)
2,3,4-grams consisting entirely of the filtered terms, were counted over the corpus then
to reduce on these latter frequency tables, less frequent 2,3,4-grams (count < 3) were dropped
These are based on co-occurrence of terms in a single sentence. The classic 3 (.!?) as well as line-breaks were considered sentence endings. (A complete document-term matrix seemed too big to start with - I had to consider RAM size limitations.)
This yielded a 400 MB CSV, which could was reduced by dropping the rate, similarly as above, down to 100 MB.
The files are stored together in a single .RData file on the server, providing compression, < 50 MB in size.
The data is read up on app launch.
On each key entry: text -> words, words -> word_id's The word_id's are matched against the 4, 3, and 2-gram tables using indexed data.table objects for performance. (A stupid backoff model is used, without smoothing.)
The 4 best matches are shown, ranked by probability.
Associations are similar, but as they gave very silly results, only those are returned which constitute a valid 2-gram by the 2-gram table. A potential improvement is using 3-grams for this. The 8 best matches appear over the bottom buttons. Here prediction is attempted from the very last sentence. Stopwords (like “the”, “I”, “etc.”) are ignored: suggestions update less frequently.
After the application has loaded the data, the buttons turn up.
Type some text into the middle edit box - after a little while the predictions should update.
Using the mouse (or touch screen) further words can be entered quickly just by clicking the buttons.
The hits are ranked starting with the best candidate, left-to-right, top-to-bottom.
Please find the application at
https://brezniczky.shinyapps.io/deployed/
Thank you!