Beatriz Jiménez Franco
Link shinyapp:
Given the training corpus, the workflow for building the text-predictor consists of the following steps:
Also, i built a priori a model dictionary during the training process. It covers 74 % of the training corpus with a length of 1000 words.
Word coverage fraction
Dictionary length: 1000
Coverage fraction (w/ EOS): 74.2 %
Coverage fraction (w/o EOS): 72.4 %
The first two columns correspond to the words of 2-gram prefixes observed in the training corpus, and the other columns correspond to the top three predictions for these 2-grams. When a 2-gram w1w2 is given as input for text prediction, it is first looked for in the prefix columns of trigrams table. If not found, w2 is looked for in the prefix column of bigrams table. If this also fails, the prediction is performed without any prefix, that is, we simply predict the most frequent words in unigrams.
w1 w2 prediction1 prediction2 prediction3
[1,] 657 529 1 4 86
[2,] 713 238 1 4 10
[3,] 78 832 1 3 35
[4,] 904 127 1 3 2
[5,] 749 367 1 4 10
[6,] 105 600 1 9 2
Enter “i love” and press the submit buttom, it will returns the top 3 next word predictions:(NOT ONLY ONE!). Word “the” has the higher probability, “you” has the second top probability and “it” the third top probability.
[1] "the" "you" "it"
You will also see in the shiny the top 5 predictions and their probabilities:
# A tibble: 5 x 2
completion probability
<chr> <dbl>
1 the 0.135
2 you 0.0693
3 it 0.0613
4 to 0.0560
5 this 0.0480