Data
Train data contains about 550MB of text that are tweets, news, blogs. It is 4,000,000 lines that we process by:
- removing everything that is not a letter,
- replace upper case letters for lower case,
- from each line of data we create 1, 2, 3, 4 grams (it was hard to procces data using R so we have written a C code that creates Binary Search Trees that contains grams),
- we create csv files with created grams and frequecies of each gram.
Predictive Algorithm: Simple model
- The most popular word is βtheβ. So if there is no other data we provide it as the default next word.
- The 2-grams we treat as follows. We group then by the first word and then we find the most popular second word. This word is the predicted word if the prevous word is provided.
- Similary we treat 3-grams and 4-grams. We group them by firsts words and we find the next most popular word.
Since the free shiny server has restictions of size of files we consider only grams that had appeared at least twice in the train data.
Accuracy
The accuracy of the model is 20%. We did not remove stop words since that leads to better accuracy.