The three text files in our data set contain blog posts, news posts, and tweets.
The blog file contains 899,288 posts, and 37,546,806 words, of which 319,546 are unique.
The news file contains 77,259 posts, and 2,674,561 words, of which 86,601 are unique.
The twitter file contains 2,360,148 tweets, and 30,096,649 words, of which 367,972 are unique.