In the following we do some basic exploratory data analysis for the files: en_US.twitter.txt, en_US.news.txt, and en_US.blogs.txt. Ultimately, we wish to use these to build a predictive text analytics app. This report is to begin by getting a basic understanding of the data. We will look at 1-grams, 2-grams, and 3-grams. We strip all punctuation and normalize by lower-casing everything. This preprocessing, as well as aggregating counts by n-gram, was done in Python. We load these results into R to analyze in the following.
We now provide the number of total and distinct n-grams, as well as how many are needed to cover 50% and 90% of the total n-grams. We also provide (logarithmically scaled) plots to show the distributions of the n-grams for frequencies above a certain threshold - the threshold determined for 90% total n-gram coverage. This is done for n = 1, 2, and 3, and for the three data sets. Finally, we give the top ten most frequent n-grams.
Number of lines = 1010242
distinct 1-grams = 399550
distinct 2-grams = 6761240
distinct 3-grams = 18702509
total 1-grams = 34253633
total 2-grams = 33243404
total 3-grams = 32237106
10 most freqent n-grams
| 1-gram | count | 2-gram | count | 3-gram | count |
| the | 1967260 | of_the | 185993 | one_of_the | 14512 |
| to | 900704 | in_the | 176066 | a_lot_of | 11459 |
| and | 883594 | to_the | 83692 | as_well_as | 6233 |
| a | 873408 | on_the | 72710 | part_of_the | 5688 |
| of | 770956 | for_the | 68797 | the_end_of | 5617 |
| in | 673415 | at_the | 57798 | according_to_the | 5601 |
| for | 350846 | and_the | 51799 | out_of_the | 5552 |
| that | 345662 | in_a | 50507 | some_of_the | 5440 |
| is | 283823 | to_be | 46821 | to_be_a | 5363 |
| on | 266482 | with_the | 43438 | in_the_first | 5132 |
## [1] "Distinct 1-grams needed for 50%:" "219"
## [1] "Distinct 1-grams needed for 90%:" "9756"
## [1] "Distinct 2-grams needed for 50%:" "26781"
## [1] "Distinct 2-grams needed for 90%:" "793377"
## [1] "Distinct 3-grams needed for 50%:" "241646"
## [1] "Distinct 3-grams needed for 90%:" "2128053"
Number of lines = 2360148
distinct 1-grams = 544991
distinct 2-grams = 5548330
distinct 3-grams = 13851470
total 1-grams = 29760155
total 2-grams = 27400007
total 3-grams = 25041177
10 most freqent n-grams
| 1-gram | count | 2-gram | count | 3-gram | count |
| the | 933427 | in_the | 77991 | thanks_for_the | 23528 |
| to | 786379 | for_the | 73833 | looking_forward_to | 8711 |
| i | 712377 | of_the | 56811 | thank_you_for | 8588 |
| a | 606543 | on_the | 48294 | cant_wait_to | 8244 |
| you | 542155 | to_be | 46841 | i_love_you | 7974 |
| and | 433528 | to_the | 43298 | for_the_follow | 7777 |
| for | 384311 | thanks_for | 42747 | going_to_be | 7393 |
| in | 376726 | at_the | 37100 | i_want_to | 7030 |
| of | 358876 | i_love | 35397 | a_lot_of | 6224 |
| is | 357433 | going_to | 34170 | to_be_a | 5981 |
## [1] "Distinct 1-grams needed for 50%:" "127"
## [1] "Distinct 1-grams needed for 90%:" "6252"
## [1] "Distinct 2-grams needed for 50%:" "12306"
## [1] "Distinct 2-grams needed for 90%:" "503841"
## [1] "Distinct 3-grams needed for 50%:" "117201"
## [1] "Distinct 3-grams needed for 90%:" "1354424"
distinct 1-grams = 517843
distinct 2-grams = 6841290
distinct 3-grams = 19576346
total 1-grams = 37214322
total 2-grams = 36315098
total 3-grams = 35427699
Number of lines = 899288
10 most freqent n-grams
| 1-gram | count | 2-gram | count | 3-gram | count |
| the | 1848597 | of_the | 186623 | one_of_the | 14512 |
| and | 1084582 | in_the | 152515 | a_lot_of | 11459 |
| to | 1064762 | to_the | 85712 | as_well_as | 6233 |
| a | 894545 | on_the | 75010 | part_of_the | 5688 |
| of | 874700 | to_be | 67879 | the_end_of | 5617 |
| i | 762769 | and_the | 58154 | according_to_the | 5601 |
| in | 592243 | for_the | 57839 | out_of_the | 5552 |
| that | 458155 | i_was | 49075 | some_of_the | 5440 |
| is | 430735 | and_i | 48745 | to_be_a | 5363 |
| it | 397611 | i_have | 47439 | in_the_first | 5132 |
## [1] "Distinct 1-grams needed for 50%:" "127"
## [1] "Distinct 1-grams needed for 90%:" "6252"
## [1] "Distinct 2-grams needed for 50%:" "16041"
## [1] "Distinct 2-grams needed for 90%:" "636227"
## [1] "Distinct 3-grams needed for 50%:" "176523"
## [1] "Distinct 3-grams needed for 90%:" "1965105"
In all cases, the distribution drops off pretty quickly to zero and then there are a few n-grams that show up extremely frequently. The distribution is narrower for 2-grams than 1-grams, and even more so for 3-grams; that is, the frequencies with which n-grams occur is smaller for larger n. This is also why the number of distinct n-grams grows so much larger for n =2 and n=3.
This indicates that incorporating 2-gram influences into the model is better than a purely 1-gram (i.e. word count) approach, and 3-grams will do even better still. The 50% and 90% coverage numbers will be useful in that they indicate a subset of the data that we can opt to work with to increase