Tomás A. Maccor
14-Jun-2020
The challenge
NLP (natural language processing) makes use of automatic computational processing of human languages. It is the technology used to aid computers to understand natural languages.
Swiftkey, a leading company in text-recognition technology, provided a 600 Mb Corpora (collection of computer-readable groups of texts) taken from Twitter, Internet Blogs & News posts.
The task was to teach our computer to “learn” the English language, use a prediction algorithm (via a model) to predict the next word (given a phrase/sentence of 1 to N words in length) and finally build a text prediction ShinyApp from it.
Discounting keeps our language model from assigning zero probability to unseen sequence of words. It works by taking off a bit of probability mass from some more frequent words sequences & assigning it to unseen ones. This model uses Simple Good-Turing (Gale and Sampson 1995), which is derived from the original Good-Turing algorithm:
\[
\tiny{
P_{GT}={c^*\over N} \space; \space\space\space\space\space\space\space\space\space\space\space
c^*=(c+1){N_{c+1}\over N_c} \space;
\space\space\space\space\space\space\space\space\space\space\textrm{and}
\space\space\space\space\space\space\space\space\space\space
Z_r={N_c\over 0.5*(t-q)} \space\space\,
}
\]
- Usage of data.tables to store the frequency tables of all n-grams obtained improves computing speed
- SETKEY was also used -it has 20x faster performance than a data frame. Sorts and marks as sorted with a 'sorted' attribute. The sorted columns are the key & the tables are only changed by reference. It is very memory efficient because (#1) binary search and joins are faster when they detect they can use an existing key, and (#2) grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM