Lukas
2024-03-01
The data was made available through HC corpora, a text collection data base from numerous sources in a variety of languages.
Our specific data set consisted of text data from three sources:
Noteworthy observations: 1. The lines of text from these sources differ vastly in vocabulary (formality vs informality). 1. The text size of the different sources differs vastly, since tweets, for example, had a character limit of 140 up until 2017.
These three datasets were combined, from which 100.000 lines were sampled as training data for the acquisition of our model.
The samples were preprocessed through the use of two major Natural Language Programming (NLP) packages in R, namely RWeka and tm.
The tm package allowed for the removal of undesired elements in the textlines:
The RWeka package allowed for the creation of n-gram tokenizers:
Example text: ‘I went shopping today’:
The preprocessing steps are documented on github.
Three additional functions were created to process the Text Document Matrices (TDM) into parsable n-gram tables. These are look-up tables, designed to match a phrase with the prov column and to extract a predicted word from the pred column. The n-gram is divided in two part, the last word and the stretch of word(s) before that. Therefore quadgrams, as shown below, have a three word sequence, for which they can predict a fourth.
prov pred freq
1 thanks for the follow 172
2 the end of the 150
3 the rest of the 122
4 at the end of 114
5 thank you for the 113
6 cant wait to see 111
7 for the first time 98
8 is going to be 97
Two files are loaded into the shiny app:
The computational bottleneck is entirely on the acquisition of the n-gram data frames. This calculation takes several minutes for the 100.000 samples in this training set, but would take hours for the complete data. The resulting data frames can however be tightly cached. Parsing the data tables for a word prediction is however only a matter of milliseconds.
system.time(print(predictWord('This presentation is great',
uni = unigram_DF,
bi = bigram_DF,
tri = trigram_DF,
quad = quadgram_DF)))
[1] "because"
user system elapsed
0.04 0.00 0.05
The shiny app has a singular text input box and a submission button. The predicted word appears in the right hand panel.