Synopsis

Data set summary

Following summary illustartes the data from English News, Blogs and Twits.

f_names f_size f_lines n_char n_words pct_n_char pct_lines pct_words
blogs 200.4242 899288 208361438 37334131 0.54 0.27 0.53
news 196.2775 77259 15683765 2643969 0.04 0.02 0.04
twitter 159.3641 2360148 162384825 30373543 0.42 0.71 0.43

Uni-gram models

Wordcloud depicting uni-gram

Uni-gram models by source

The different sources are news, blogs and twitter.

Uni-gram Distributions

Based on relative frequency uni-gram distributions is plotted. They are plotted for each set of n-grams.

Bi-gram Distribution

Tri-gram Distribution

Quad-gram Distribution

N-gram Prediction Model

word1 word2 word3 word4 n proportion coverage
the end of the 497 8.00e-05 0.0000800
the rest of the 454 7.31e-05 0.0001531
at the end of 405 6.52e-05 0.0002183
for the first time 397 6.39e-05 0.0002822
thank you for the 359 5.78e-05 0.0003401
is going to be 358 5.76e-05 0.0003977

Conclusion