Corrent report is compiled to present provided explanatory analysis on the English text data from 3 different sources. Due to the vast amount af data random 1% samples were taken from each source for the analysis:
| Source | Number of lines in original sourse | Number of words in 0.01 sample |
|---|---|---|
| Blog | 899288 | 187753 |
| News | 77259 | 14055 |
| 2360148 | 160120 |
The sample was cleaned from the words which either do not add much value to the content (e.g. “ohhh”) or have an offensive meaning. Then the frequencies were calculated; the top 20 most frequent words per each source can be found below:
Different patterns of frequencies distribution above suggest that there choice of the used word depends a lot on the environment. Most probably it is conditioned by the socio-economic background of authors and generally different purposes of the platforms: while news are written mostly by professional journalists and have purpose to inform post in instagram can be done by anyone. This specificity must be taken into account during the modeling.
Further analysis includes frequencies of 2 words combination - so called bigrams - per line:
Initial assumption was that word combination - bigrams - will be more frequent in twitter due to reasons of not so reach vocabulary of users in comparison with professional writers. At the same time in news cliche are not appriciated, hence, we have such a low level of word combinations repetitions.
Forecasting based on the twitter database seem to be easier to do, however the problem of orthography is still huge there. Forecasting for the news at the same time will require much more work, including synonyms analysis. Blogs are somewhere in between.