Introduction

In the following we do some basic exploratory data analysis for the files: en_US.twitter.txt, en_US.news.txt, and en_US.blogs.txt. Ultimately, we wish to use these to build a predictive text analytics app. This report is to begin by getting a basic understanding of the data. We will look at 1-grams, 2-grams, and 3-grams. We strip all punctuation and normalize by lower-casing everything. This preprocessing, as well as aggregating counts by n-gram, was done in Python. We load these results into R to analyze in the following.

Counts

We now provide the number of total and distinct n-grams, as well as how many are needed to cover 50% and 90% of the total n-grams. We also provide (logarithmically scaled) plots to show the distributions of the n-grams for frequencies above a certain threshold - the threshold determined for 90% total n-gram coverage. This is done for n = 1, 2, and 3, and for the three data sets. Finally, we give the top ten most frequent n-grams.

News

Number of lines = 1010242

distinct 1-grams = 399550
distinct 2-grams = 6761240
distinct 3-grams = 18702509

total 1-grams = 34253633
total 2-grams = 33243404
total 3-grams = 32237106

10 most freqent n-grams

1-gram count 2-gram count 3-gram count
the 1967260 of_the 185993 one_of_the 14512
to 900704 in_the 176066 a_lot_of 11459
and 883594 to_the 83692 as_well_as 6233
a 873408 on_the 72710 part_of_the 5688
of 770956 for_the 68797 the_end_of 5617
in 673415 at_the 57798 according_to_the 5601
for 350846 and_the 51799 out_of_the 5552
that 345662 in_a 50507 some_of_the 5440
is 283823 to_be 46821 to_be_a 5363
on 266482 with_the 43438 in_the_first 5132
## [1] "Distinct 1-grams needed for 50%:" "219"
## [1] "Distinct 1-grams needed for 90%:" "9756"
## [1] "Distinct 2-grams needed for 50%:" "26781"
## [1] "Distinct 2-grams needed for 90%:" "793377"
## [1] "Distinct 3-grams needed for 50%:" "241646"
## [1] "Distinct 3-grams needed for 90%:" "2128053"

Twitter

Number of lines = 2360148

distinct 1-grams = 544991
distinct 2-grams = 5548330
distinct 3-grams = 13851470

total 1-grams = 29760155
total 2-grams = 27400007
total 3-grams = 25041177

10 most freqent n-grams

1-gram count 2-gram count 3-gram count
the 933427 in_the 77991 thanks_for_the 23528
to 786379 for_the 73833 looking_forward_to 8711
i 712377 of_the 56811 thank_you_for 8588
a 606543 on_the 48294 cant_wait_to 8244
you 542155 to_be 46841 i_love_you 7974
and 433528 to_the 43298 for_the_follow 7777
for 384311 thanks_for 42747 going_to_be 7393
in 376726 at_the 37100 i_want_to 7030
of 358876 i_love 35397 a_lot_of 6224
is 357433 going_to 34170 to_be_a 5981
## [1] "Distinct 1-grams needed for 50%:" "127"
## [1] "Distinct 1-grams needed for 90%:" "6252"
## [1] "Distinct 2-grams needed for 50%:" "12306"
## [1] "Distinct 2-grams needed for 90%:" "503841"
## [1] "Distinct 3-grams needed for 50%:" "117201"
## [1] "Distinct 3-grams needed for 90%:" "1354424"

Blogs

distinct 1-grams = 517843
distinct 2-grams = 6841290
distinct 3-grams = 19576346

total 1-grams = 37214322
total 2-grams = 36315098
total 3-grams = 35427699

Number of lines = 899288

10 most freqent n-grams

1-gram count 2-gram count 3-gram count
the 1848597 of_the 186623 one_of_the 14512
and 1084582 in_the 152515 a_lot_of 11459
to 1064762 to_the 85712 as_well_as 6233
a 894545 on_the 75010 part_of_the 5688
of 874700 to_be 67879 the_end_of 5617
i 762769 and_the 58154 according_to_the 5601
in 592243 for_the 57839 out_of_the 5552
that 458155 i_was 49075 some_of_the 5440
is 430735 and_i 48745 to_be_a 5363
it 397611 i_have 47439 in_the_first 5132
## [1] "Distinct 1-grams needed for 50%:" "127"
## [1] "Distinct 1-grams needed for 90%:" "6252"
## [1] "Distinct 2-grams needed for 50%:" "16041"
## [1] "Distinct 2-grams needed for 90%:" "636227"
## [1] "Distinct 3-grams needed for 50%:" "176523"
## [1] "Distinct 3-grams needed for 90%:" "1965105"

Conclusion

In all cases, the distribution drops off pretty quickly to zero and then there are a few n-grams that show up extremely frequently. The distribution is narrower for 2-grams than 1-grams, and even more so for 3-grams; that is, the frequencies with which n-grams occur is smaller for larger n. This is also why the number of distinct n-grams grows so much larger for n =2 and n=3.

This indicates that incorporating 2-gram influences into the model is better than a purely 1-gram (i.e. word count) approach, and 3-grams will do even better still. The 50% and 90% coverage numbers will be useful in that they indicate a subset of the data that we can opt to work with to increase