Intro

Corrent report is compiled to present provided explanatory analysis on the English text data from 3 different sources. Due to the vast amount af data random 1% samples were taken from each source for the analysis:

Source Number of lines in original sourse Number of words in 0.01 sample
Blog 899288 187753
News 77259 14055
Twitter 2360148 160120

Analysis

The sample was cleaned from the words which either do not add much value to the content (e.g. “ohhh”) or have an offensive meaning. Then the frequencies were calculated; the top 20 most frequent words per each source can be found below:

Different patterns of frequencies distribution above suggest that there choice of the used word depends a lot on the environment. Most probably it is conditioned by the socio-economic background of authors and generally different purposes of the platforms: while news are written mostly by professional journalists and have purpose to inform post in instagram can be done by anyone. This specificity must be taken into account during the modeling.

Further analysis includes frequencies of 2 words combination - so called bigrams - per line:

Initial assumption was that word combination - bigrams - will be more frequent in twitter due to reasons of not so reach vocabulary of users in comparison with professional writers. At the same time in news cliche are not appriciated, hence, we have such a low level of word combinations repetitions.

Conclusions

Forecasting based on the twitter database seem to be easier to do, however the problem of orthography is still huge there. Forecasting for the news at the same time will require much more work, including synonyms analysis. Blogs are somewhere in between.