Data sets - the data sets were provided by SwiftKey from blogs, news, and Twitter sources;
Pre-processing - the text has been pre-processed using the most common text analysis techniques, part of the text processing pipeline - lower case, remove punctuation, remove stop and profanity words;
Tokenization, DFMs, N-grams - converted each word to token (tokenization), created document frequency matrix (DFM) with the frequency of each token, created combinations of two, three, and four tokens (N-grams) to improve the prediction model;
The model - a sequence of R functions, which take the input (word), pre-porcess it, run it through the diferent N-gram-s, and return the most frequent word, related to the input.