The Typing Caddie

Tomás A. Maccor
14-Jun-2020

Using NLP to predict text

The challenge

  • NLP (natural language processing) makes use of automatic computational processing of human languages. It is the technology used to aid computers to understand natural languages.

  • Swiftkey, a leading company in text-recognition technology, provided a 600 Mb Corpora (collection of computer-readable groups of texts) taken from Twitter, Internet Blogs & News posts.

  • The task was to teach our computer to “learn” the English language, use a prediction algorithm (via a model) to predict the next word (given a phrase/sentence of 1 to N words in length) and finally build a text prediction ShinyApp from it.

Text data cleaning/analysis

  • 6% of text from the Corpora was used to build the model (trial & error demonstrated that this amount was sufficient to achieve predictive power
  • This data subset was cleaned & tokenized (texts were split into smaller units -words in this case). Profane language was removed
  • English stopwords were left in & no stemming was performed
  • The resulting text was used to generate 1-grams to 4-grams -as evaluation of perplexity showed that incorporating >= 5-grams did not result in a significant decrease in perplexity & used much more RAM

The Model

  • The Typing Caddie uses an n-gram language model that assigns probabilities to sequences of words.
  • Markov Chain stochastical model assumptions (whereby the probability of a word depends only on the previous word) are also utilized. This keeps the model smaller and more efficient.
  • In a backoff n-gram model, if the n-grams that could be predicted have 0 (zero) counts, we approximate by backing off to the (N-1)-gram. We continue backing off until we reach an lower n-gram history that has some counts.
  • The backoff method used for this model was Katz's backoff

The Algorithm

Discounting keeps our language model from assigning zero probability to unseen sequence of words. It works by taking off a bit of probability mass from some more frequent words sequences & assigning it to unseen ones. This model uses Simple Good-Turing (Gale and Sampson 1995), which is derived from the original Good-Turing algorithm:

\[ \tiny{ P_{GT}={c^*\over N} \space; \space\space\space\space\space\space\space\space\space\space\space c^*=(c+1){N_{c+1}\over N_c} \space; \space\space\space\space\space\space\space\space\space\space\textrm{and} \space\space\space\space\space\space\space\space\space\space Z_r={N_c\over 0.5*(t-q)} \space\space\, } \]
- Usage of data.tables to store the frequency tables of all n-grams obtained improves computing speed - SETKEY was also used -it has 20x faster performance than a data frame. Sorts and marks as sorted with a 'sorted' attribute. The sorted columns are the key & the tables are only changed by reference. It is very memory efficient because (#1) binary search and joins are faster when they detect they can use an existing key, and (#2) grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM

ShinyApp

  • The Typing Caddie is an extremely user-friendly app
  • User instructions are self-explanatory & are included in the app's main panel.
  • Just open it up, start typing, and “the Caddie” will assist you with a choice of up 3 suggested next words


    Try it out at The Typing Caddie ShinyApp…and enjoy your typing session !