We have created a web application that predicts a word that a preson is going to write. It works as very simple:
Train data contains about 550MB of text that are tweets, news, blogs. It is 4,000,000 lines that we process by:
The most popular word is βtheβ. So if there is no other data we provide it as the default next word. The 2-grams we treat as follows. We group then by the first word and then we find the most popular second word. This word is the predicted word if the prevous word is provided. Similary we treat 3-grams. We group them by the first and second word and we find the third most popular word.
Since the free shiny server has restictions of size of files we consider only grams that have appeared at least twice.
The accuracy of the model is 20%. We did not remove stop words since that leads to better accuracy.