Entry assistant presentation

Janos Brezniczky
25/04/2016 GMT

Objective and concept

The objective

Use word prediction to provide help to users entering text potentially on a mobile device. Limited resources are to be envisionaged.

The concept

  1. for simplicity, I chose the English language (a limited character set saves a lot of trouble)
  2. n-gram based modelling has taken place
  3. additionally, some kind of term associations were mined from the corpus and utilized for prediction - this does not seem to be too useful in practice (yet?)

Prior data processing 1. (n-grams)

Collecting unigrams, bigrams etc. was the first step.

  • The unigram statistics shed light on the repetitive nature of the terms

    (The initial report is here.)

  • In the further steps, only top terms, yet covering 95% of the corpus, have been considered

  • Bad words are ignored (as if they weren't there)

  • 2,3,4-grams consisting entirely of the filtered terms, were counted over the corpus then

  • to reduce on these latter frequency tables, less frequent 2,3,4-grams (count < 3) were dropped

Prior data processing 2.

The “associations”

These are based on co-occurrence of terms in a single sentence. The classic 3 (.!?) as well as line-breaks were considered sentence endings. (A complete document-term matrix seemed too big to start with - I had to consider RAM size limitations.)

This yielded a 400 MB CSV, which could was reduced by dropping the rate, similarly as above, down to 100 MB.

The files are stored together in a single .RData file on the server, providing compression, < 50 MB in size.

In action

The data is read up on app launch.

On each key entry: text -> words, words -> word_id's The word_id's are matched against the 4, 3, and 2-gram tables using indexed data.table objects for performance. (A stupid backoff model is used, without smoothing.)

The 4 best matches are shown, ranked by probability.

Associations are similar, but as they gave very silly results, only those are returned which constitute a valid 2-gram by the 2-gram table. A potential improvement is using 3-grams for this. The 8 best matches appear over the bottom buttons. Here prediction is attempted from the very last sentence. Stopwords (like “the”, “I”, “etc.”) are ignored: suggestions update less frequently.

How to use

After the application has loaded the data, the buttons turn up.

Type some text into the middle edit box - after a little while the predictions should update.

Using the mouse (or touch screen) further words can be entered quickly just by clicking the buttons.

The hits are ranked starting with the best candidate, left-to-right, top-to-bottom.

Please find the application at

https://brezniczky.shinyapps.io/deployed/

Thank you!