The three text files in our data set contain blog posts, news posts, and tweets.
The blog file contains 899,288 posts, and 37,546,806 words, of which 319,546 are unique. It takes 115 unique words for 50% coverage, and 6,778 words for 90% coverage of the blog posts.
The news file contains 77,259 posts, and 2,674,561 words, of which 86,601 are unique. It takes 220 unique words for 50% coverage, and 8,440 words for 90% coverage of the news posts.
The twitter file contains 2,360,148 tweets, and 30,096,649 words, of which 367,972 are unique. It takes 131 unique words for 50% coverage, and 5,555 words for 90% coverage of the twitter posts.