We are given three files. One containing blogs, one with news articles, and another with Twitter tweets.
The blogs file contains 899288 lines containing 36815824 words. Of these words 397690 are unique. The average number of words per line is 40.9389.
The news file contains 1010242 lines containing 33468015 words. Of these words 311115 are unique. The average number of words per line is 33.1287.
The twitter file contains 2360148 lines containing 29354754 words. Of these words 478662 are unique. The average number of words per line is 12.4377.
In total there are 99638593 words, of those 904589 are unique. This gives us on average 110.1479 uses of each word.
To do more detailed analysis, we should remove stop words provided by the R tm package. We can look at the top 100 words excluding these stop words:
will said just one like can get time new good now day know love people us back go see first also make going think last great much year two really way today got even want work still right years thanks need many rt life say take come made little never home may best u next week night things school something game lol always around another happy better every state world look show big long since man feel city help three sure hope thing follow find use days getting lot keep says ever house place team put tonight family part give
We can also plot the usage count of the top 1000 words excluding stop words.
This shows us that the top 20% of words are used 98.3008% of the time.
We can also compare the number of stop words and non-stop words by file.
For building a predictive algorithm, it would seem the most important feature is the last word prior to the word being predicted. If the last word is a stop word, it may help to include the previous words until a word which is not a stop word is found. From here we can use occurances of that word or sequence of words in our training corpus to find candidate predictions. When multiple candidate predictions exist, we could build a model using the word counts of previous words in our corpus to predict the subsequent word.