The training data

The data was made available through HC corpora, a text collection data base from numerous sources in a variety of languages.

Our specific data set consisted of text data from three sources:

Twitter (159.4 MB)
News (196.3 MB)
Blogs (200.4 MB)

Noteworthy observations: 1. The lines of text from these sources differ vastly in vocabulary (formality vs informality). 1. The text size of the different sources differs vastly, since tweets, for example, had a character limit of 140 up until 2017.

These three datasets were combined, from which 100.000 lines were sampled as training data for the acquisition of our model.

Preprocessing

The samples were preprocessed through the use of two major Natural Language Programming (NLP) packages in R, namely RWeka and tm.

The tm package allowed for the removal of undesired elements in the textlines:

removal of whitespace, numbers, punctuation and sparse occurrences
adjusting all text to be lower case
creation of text document matrices to assess frequencies of n-grams

The RWeka package allowed for the creation of n-gram tokenizers:

Example text: ‘I went shopping today’:

unigrams - ‘I’—–‘went’—–‘shopping’—–‘today’
bigrams -‘I went’—–‘went shopping’—–‘shopping today’
trigrams - ‘I went shopping’—–‘went shopping today’
…

The preprocessing steps are documented on github.

Custom dependencies

Three additional functions were created to process the Text Document Matrices (TDM) into parsable n-gram tables. These are look-up tables, designed to match a phrase with the prov column and to extract a predicted word from the pred column. The n-gram is divided in two part, the last word and the stretch of word(s) before that. Therefore quadgrams, as shown below, have a three word sequence, for which they can predict a fourth.

            prov   pred freq
1 thanks for the follow  172
2     the end of    the  150
3    the rest of    the  122
4     at the end     of  114
5  thank you for    the  113
6   cant wait to    see  111
7  for the first   time   98
8    is going to     be   97

wordCount(x) – counts the words in a provided string
ngram_df_maker(x) – transforms the DTM frequency matrix into a parsable table
predictWord(x,uni,bi,tri,quad) – a table parsing function that grabs the best match for the largest perceivable gram it can find

The app

Two files are loaded into the shiny app:

ngram_data.RData (contains the n-gram data frames, 487.1 KB)
Functions.R (contains the three custom functions for word prediction)

The computational bottleneck is entirely on the acquisition of the n-gram data frames. This calculation takes several minutes for the 100.000 samples in this training set, but would take hours for the complete data. The resulting data frames can however be tightly cached. Parsing the data tables for a word prediction is however only a matter of milliseconds.

system.time(print(predictWord('This presentation is great',
                              uni = unigram_DF,
                              bi = bigram_DF, 
                              tri = trigram_DF,
                              quad = quadgram_DF)))

[1] "because"

   user  system elapsed 
   0.04    0.00    0.05

The shiny app has a singular text input box and a submission button. The predicted word appears in the right hand panel.

Capstone project: Development of a text prediction algorithm through the use of an ngram model

The training data

Preprocessing

Custom dependencies

The app

Capstone project:
Development of a text prediction algorithm
through the use of an ngram model