The objective is to predict the next word of an unfinished sentence. Data from twitter, blogs and news were used. The data were cleaned and used to build a n-gram model (natural language), which uses the frequency of appearance of some words with others.
With the 3 txt files, I created “sampled files” and with a ngram model, I built a data frame with 3 columns, X1/X2/Y (most frequent 3-words associations) with Y being the predicted word from the 2 others. Then I used a naivesBayes, which was a good way to build a “light” model.