Background

In partnership with SwiftKey, the Coursera Data Science Capstone project challenges students to apply what they have learned in the data science certification program, as well as their independence, creativity, and initiative in the development of a data product highlighting predictive text models. (I bet you could have predicted that last word “models”…).

The texts used for this project include US blogs, news, and tweets. Note that prior to the exploratory analysis, the texts were cleaned by converting everything to lowercase, removing extra white spaces, removing special characters, removing offensive words, converting all end-of-sentence characters to periods, converting all numbers to the symbol “#”, and separating the data into sentences. In addition, only the first 25000 lines of each text source were included for exploratory analysis, yielding a total of 1.4926 × 105 lines for analysis (6.4517 × 104 from blogs, 5.67 × 104 from news, and 2.8039 × 104 from tweets).

Objectives

The objectives of this milestone report are to…

Distribution of Word Frequencies

The first step in exploring the data is to simply identify the distribution of word frequencies.

Figure 1 below plots the Log10 frequency of words in the corpus. Note that the distribution is highly skewed, further suggesting that…

plot of chunk Explore2

Figure 2 (below) further summarizes the idea that a relatively small number of unique words account for a large proportion of all the words in the corpus. In fact, about 145 words account for 50% of all the words in the corpus. Interestingly, we need about 8500 words to account for 90% of all words in the corpus.

plot of chunk Explore4

## numeric(0)

Most Frequent Words

The next logical question is “which words are used most frequently”? The table below summarizes the top 20 most frequent words. Notice that these frequent words tend to be conjunctions, pronouns, prepositions, or conjugations of the verb “to be”, with a small number of letters.

##  Word   Freq
##   the 108511
##    to  58944
##   and  56481
##     a  52546
##    of  46641
##    in  36370
##     i  31377
##  that  23194
##    is  22801
##   for  22213
##     #  21149
##    it  19057
##    on  16821
##  with  15897
##   you  15610
##   was  14497
##    at  11874
##  this  11642
##    as  11330
##    be  11161

Distribution of Word Combinations (2-Grams)

Next, we consider the most frequent two word combinations (i.e. “2-grams”). As with individual words, we can quantify the number and frequency of 2-grams.

Figure 3 below plots the Log10 frequency of 2-grams in the corpus. The distribution of 2-grams is even more highly skewed than the distribution of individual word frequencies.

plot of chunk Explore6

Most Frequent 2-Grams

The table below summarizes the top 20 most frequent 2-grams. Not surprisingly, the most frequent 2-grams tend to be combinations of small conjunctions, pronouns, prepositions, and/or conjugations of the verb “to be”.

##          Blogs News Tweets Total
## of the    5197 4459    523 10179
## in the    4263 4357    734  9354
## to the    2417 2076    388  4881
## on the    2044 1747    451  4242
## for the   1574 1691    687  3952
## to be     1870 1187    449  3506
## and the   1635 1354    132  3121
## at the    1282 1484    353  3119
## in a      1219 1222    206  2647
## with the  1171 1032    169  2372
## is a      1315  722    249  2286
## it was    1263  707    228  2198
## from the  1083  958    105  2146
## for a      909  748    286  1943
## with a     922  873    112  1907
## it is     1293  420    171  1884
## i was     1371  268    232  1871
## and i     1352  259    228  1839
## of a       936  772    115  1823
## i have    1281  158    291  1730

Distribution of Word Triplets (3-Grams)

Next, we consider the most frequent three word combinations (i.e. “3-grams”).

Figure 4 below plots the Log10 frequency of 3-grams in the corpus. The distribution of 3-grams is even more highly skewed than the distribution of individual word frequencies and 2-grams.

plot of chunk Explore9

Most Frequent 3-Grams

The table below summarizes the top 20 most frequent 3-grams. Again, the most frequent 3-grams tend to be combinations of small conjunctions, pronouns, prepositions, and/or conjugations of the verb “to be”.

##                Blogs News Tweets Total
## one of the       423  344     52   819
## a lot of         393  282     55   730
## # # #             16  358      4   378
## the end of       176  144     35   355
## as well as       186  153      8   347
## more than #       43  295      5   343
## to be a          167  119     52   338
## going to be      116  133     76   325
## out of the       167  130     26   323
## be able to       166  106     28   300
## it was a         155  111     31   297
## some of the      173  112     11   296
## part of the      152  126     11   289
## i want to        160   46     74   280
## a couple of      162   68     12   242
## the first time   102  107     29   238
## the rest of      130   72     24   226
## thanks for the     0    2    223   225
## there is a       139   61     23   223
## i have to        152   20     46   218

Modeling Strategy

This exploratory analysis suggests the following strategies will be needed to develop a predictive text model

While these steps may be helpful, I am admittedly struggling as to how to actually develop a predictive text model based on this information. Suggestions for a modeling strategy are welcome and appreciated!