There are 3 files containing the data: One contains data from a blog, one from news and the other from various twitter handles. The below is the code to read in the files and to show the number of lines and words in each file
The number of lines in the twitter file is 2360148 and the number of words is 3.021812510^{7}.
The number of lines in the blogs file is 899288 and the number of words is 3.815423810^{7}.
The number of lines in the news file is 77259 and the number of words is 2.69389810^{6}.
Would ideally want to create a histogram of the most frequently occuring words. In order to do that one would have to loop through all the lines in the for each word. Considering there are 2.3 Million lines in the twitter file alone and assuming that there are about 10 words in each line in the file, it would have led to 20 Million iterations using a basic “for”" loop. (algo: for each word go through each line and count number of occurences). Hence have taken only 5,000 lines from each file.
The next step would be to build a n-gram model. N-gram models help predict the next word a person is going to use given n-1 words he/she has already typed. But we can see in the EDA above that there are a lot of words like ‘The’, ‘And’, ‘To’ etc. occurring in the data provided. These might or might not be useful while training the n gram model. The first run would be including these words and in case we do not get good accuracy, we would be excluding these words. Also, i nthe model we would have to take care of the n-grams that have not appeared in the data provided so that the model does not fail when a new n - gram comes up.