The number of texts available by source are: blogs had 899288 entries; news had 77259 entries; and Twitter had 2360148 entries.
In the sample of 25% of the texts available, there were 146603 words accounting for a total count of 7140031 words in the whole sample.
The distribution of word counts (constrained to counts of 20 or less) look as follows:
Another way to look at the data is to see how many n-grams exist in the texts. An n-gram is a combination of words where a unigram is 1, bigram is 2, trigram is 3, and so forth. Seeing the high amount of single count n-grams, some trimming will need to be pursued.