Data Science Capstone Milestone Report

Background

In partnership with SwiftKey, the Coursera Data Science Capstone project challenges students to apply what they have learned in the data science certification program, as well as their independence, creativity, and initiative in the development of a data product highlighting predictive text models. (I bet you could have predicted that last word “models”…).

The texts used for this project include US blogs, news, and tweets. Note that prior to the exploratory analysis, the texts were cleaned by converting everything to lowercase, removing extra white spaces, removing special characters, removing offensive words, converting all end-of-sentence characters to periods, converting all numbers to the symbol “#”, and separating the data into sentences. In addition, only the first 25000 lines of each text source were included for exploratory analysis, yielding a total of 1.4926 × 10⁵ lines for analysis (6.4517 × 10⁴ from blogs, 5.67 × 10⁴ from news, and 2.8039 × 10⁴ from tweets).

Objectives

The objectives of this milestone report are to…

Demonstrate the data has been successfully downloaded
Provide summary statistics about the data sets
Report interesting findings
Get feedback on plans for creating a prediction algorithm

Distribution of Word Frequencies

The first step in exploring the data is to simply identify the distribution of word frequencies.

There are 2176528 total “words” in the corpus
There are 95816 unique words in the corpus
There are 53637 words (56%) that are used only once in the corpus (and therefore may not be very useful for prediction purposes)

Figure 1 below plots the Log10 frequency of words in the corpus. Note that the distribution is highly skewed, further suggesting that…

A large number of words are only included once (recall: Log10(1)=0)
A relatively small proportion of the words are used extremely often (and therefore represent a majority of the text).

plot of chunk Explore2

Figure 2 (below) further summarizes the idea that a relatively small number of unique words account for a large proportion of all the words in the corpus. In fact, about 145 words account for 50% of all the words in the corpus. Interestingly, we need about 8500 words to account for 90% of all words in the corpus.

plot of chunk Explore4

## numeric(0)

Most Frequent Words

The next logical question is “which words are used most frequently”? The table below summarizes the top 20 most frequent words. Notice that these frequent words tend to be conjunctions, pronouns, prepositions, or conjugations of the verb “to be”, with a small number of letters.

##  Word   Freq
##   the 108511
##    to  58944
##   and  56481
##     a  52546
##    of  46641
##    in  36370
##     i  31377
##  that  23194
##    is  22801
##   for  22213
##     #  21149
##    it  19057
##    on  16821
##  with  15897
##   you  15610
##   was  14497
##    at  11874
##  this  11642
##    as  11330
##    be  11161

Distribution of Word Combinations (2-Grams)

Next, we consider the most frequent two word combinations (i.e. “2-grams”). As with individual words, we can quantify the number and frequency of 2-grams.

There are 2.0272 × 10⁶ total 2-grams (i.e. two word combos) in the corpus
There are 797065 unique 2-grams in the corpus
There are 622348 (78.1%) 2-grams that occure only once in the corpus (and therefore may not be very useful for prediction purposes). This is an even larger proportion than observed for individual words (56%)

Figure 3 below plots the Log10 frequency of 2-grams in the corpus. The distribution of 2-grams is even more highly skewed than the distribution of individual word frequencies.

plot of chunk Explore6

Most Frequent 2-Grams

The table below summarizes the top 20 most frequent 2-grams. Not surprisingly, the most frequent 2-grams tend to be combinations of small conjunctions, pronouns, prepositions, and/or conjugations of the verb “to be”.

##          Blogs News Tweets Total
## of the    5197 4459    523 10179
## in the    4263 4357    734  9354
## to the    2417 2076    388  4881
## on the    2044 1747    451  4242
## for the   1574 1691    687  3952
## to be     1870 1187    449  3506
## and the   1635 1354    132  3121
## at the    1282 1484    353  3119
## in a      1219 1222    206  2647
## with the  1171 1032    169  2372
## is a      1315  722    249  2286
## it was    1263  707    228  2198
## from the  1083  958    105  2146
## for a      909  748    286  1943
## with a     922  873    112  1907
## it is     1293  420    171  1884
## i was     1371  268    232  1871
## and i     1352  259    228  1839
## of a       936  772    115  1823
## i have    1281  158    291  1730

Distribution of Word Triplets (3-Grams)

Next, we consider the most frequent three word combinations (i.e. “3-grams”).

There are 1.8876 × 10⁶ total 3-grams (i.e. three word combos) in the corpus
There are 1487134 unique 3-grams in the corpus
There are 1353314 (91%) 3-grams that occure only once in the corpus (and therefore may not be very useful for prediction purposes). This is an even larger proportion than observed for individual words (56%) and 2-grams (78.1%)

Figure 4 below plots the Log10 frequency of 3-grams in the corpus. The distribution of 3-grams is even more highly skewed than the distribution of individual word frequencies and 2-grams.

plot of chunk Explore9

Most Frequent 3-Grams

The table below summarizes the top 20 most frequent 3-grams. Again, the most frequent 3-grams tend to be combinations of small conjunctions, pronouns, prepositions, and/or conjugations of the verb “to be”.

##                Blogs News Tweets Total
## one of the       423  344     52   819
## a lot of         393  282     55   730
## # # #             16  358      4   378
## the end of       176  144     35   355
## as well as       186  153      8   347
## more than #       43  295      5   343
## to be a          167  119     52   338
## going to be      116  133     76   325
## out of the       167  130     26   323
## be able to       166  106     28   300
## it was a         155  111     31   297
## some of the      173  112     11   296
## part of the      152  126     11   289
## i want to        160   46     74   280
## a couple of      162   68     12   242
## the first time   102  107     29   238
## the rest of      130   72     24   226
## thanks for the     0    2    223   225
## there is a       139   61     23   223
## i have to        152   20     46   218

Modeling Strategy

This exploratory analysis suggests the following strategies will be needed to develop a predictive text model

Remove sparse words, sparse 2-grams, and sparse 3-grams
Exclude n-grams where n > 3 (as these will be even more sparse)
Stem words for prediction
Use individual word frequencies and n-gram frequencies as independent variables in predictive text model

While these steps may be helpful, I am admittedly struggling as to how to actually develop a predictive text model based on this information. Suggestions for a modeling strategy are welcome and appreciated!