I had three files for US language, corresponding to Blogs, Twitter and News. First I read these files and build a simple statistics on them as follows.
.
Then I splited all three files in quantiles and kept only observations between quantile 25 and 75 to reduce the data size to half and appended all data (blogs, twitter and news) in one file.