In this project was just used the english files retrieved from blogs, twitter and news and a random sample of 1500 lines from each file was considered due to computational cost constrains.
Three text files were loaded with the function LaF::sample_lines:
en_US.blogs.txt 1500 random lines from 899288 total lines
en_US.twitter.txt 1500 random lines from 2360148 total lines
en_US.news.txt 1500 random lines from 1010242 total lines
The total of lines from the 3 files are 4269678. Just 4500 random sample lines were used. This correspond to approx 1% of the data.
On the data cleaning process step, lines with profanity words were removed. The lexicon of profanity words was retrieved using the lexicon library.
After the filter, the number of lines was reduced to 4,323.
Also, in the text data cleaning steps, number and special characters were removed and white spaces were striped. Stop words were not removed because in this type of prediction problem stop words can be used as features and labels on the model as predictors or targets.
The distribution of the data for the 3 files follows the common text distributions and the problem suggest the use of a N-gram model to predict the next word as na option to the user that will be typing text in the application.
Some sentences endings are more likely than others conditioned in what words came before.
An n-gram model assigns a probability score to each option based on the corpus text provided to the model as training data. Then based on thousands of combinations the model learns how to correct predict the most commom next word considering the previous words of a sentence.