Johan
July 11th, 2018
Summary
This small piece of software takes a sequence of word and predicts what could be the next word. The model has been trained on a sample of blog articles, news and tweets. The model shows poor prediction performance but is encouraging accounting the simplicity of the prediction model.
It tokenizes the text to create ngrams
It counts how many times the ngrams appear in the text and ranks them
head(head5GramsModel)
input prediction Freq
1 at the end of the day 102
2 on the other side of the 74
3 i just finished a mi run 60
4 just finished a mi run with 59
5 thank you so much for the 59
6 in the middle of the night 58
word <- "This is a short sequence of"
gramsPred(word,
model1=grams1model,
model2=grams2model,
model3=grams3model,
model4=grams4model,
model5=grams5model,
top=3)
input prediction Freq acc
138361 sequence of events 6 23.077
176548 sequence of the 5 19.231
5179014 sequence of 10 1 3.846
time correctPred
Min. :0.02898 Mode :logical
1st Qu.:0.40270 FALSE:797
Median :0.56838 TRUE :174
Mean :1.76139 NA's :12
3rd Qu.:3.51138
Max. :8.01202
This relatively low performance can be explained by the simplicity of the model - using only ngram. In this model, ngrams of 1 up to 5-words sequences and are trained on a sample of 20% of the original text documents. The model also takes too long to be executed. Acceptable time should be below 0.5 secs. Let's keep in mind that this is an exploratory exercise and with more resources better result can be expected.
The side panel: Adjusting the performance
The main panel: Inputing the word sequence and getting the prediction