In partnership with SwiftKey, the Coursera Data Science Capstone project challenges students to apply what they have learned in the data science certification program, as well as their independence, creativity, and initiative in the development of a data product highlighting predictive text models. (I bet you could have predicted that last word “models”…).
The texts used for this project include US blogs, news, and tweets. Note that prior to the exploratory analysis, the texts were cleaned by converting everything to lowercase, removing extra white spaces, removing special characters, removing offensive words, converting all end-of-sentence characters to periods, converting all numbers to the symbol “#”, and separating the data into sentences. In addition, only the first 25000 lines of each text source were included for exploratory analysis, yielding a total of 1.4926 × 105 lines for analysis (6.4517 × 104 from blogs, 5.67 × 104 from news, and 2.8039 × 104 from tweets).
The objectives of this milestone report are to…
The first step in exploring the data is to simply identify the distribution of word frequencies.
Figure 1 below plots the Log10 frequency of words in the corpus. Note that the distribution is highly skewed, further suggesting that…
Figure 2 (below) further summarizes the idea that a relatively small number of unique words account for a large proportion of all the words in the corpus. In fact, about 145 words account for 50% of all the words in the corpus. Interestingly, we need about 8500 words to account for 90% of all words in the corpus.
## numeric(0)
The next logical question is “which words are used most frequently”? The table below summarizes the top 20 most frequent words. Notice that these frequent words tend to be conjunctions, pronouns, prepositions, or conjugations of the verb “to be”, with a small number of letters.
## Word Freq
## the 108511
## to 58944
## and 56481
## a 52546
## of 46641
## in 36370
## i 31377
## that 23194
## is 22801
## for 22213
## # 21149
## it 19057
## on 16821
## with 15897
## you 15610
## was 14497
## at 11874
## this 11642
## as 11330
## be 11161
Next, we consider the most frequent two word combinations (i.e. “2-grams”). As with individual words, we can quantify the number and frequency of 2-grams.
Figure 3 below plots the Log10 frequency of 2-grams in the corpus. The distribution of 2-grams is even more highly skewed than the distribution of individual word frequencies.
The table below summarizes the top 20 most frequent 2-grams. Not surprisingly, the most frequent 2-grams tend to be combinations of small conjunctions, pronouns, prepositions, and/or conjugations of the verb “to be”.
## Blogs News Tweets Total
## of the 5197 4459 523 10179
## in the 4263 4357 734 9354
## to the 2417 2076 388 4881
## on the 2044 1747 451 4242
## for the 1574 1691 687 3952
## to be 1870 1187 449 3506
## and the 1635 1354 132 3121
## at the 1282 1484 353 3119
## in a 1219 1222 206 2647
## with the 1171 1032 169 2372
## is a 1315 722 249 2286
## it was 1263 707 228 2198
## from the 1083 958 105 2146
## for a 909 748 286 1943
## with a 922 873 112 1907
## it is 1293 420 171 1884
## i was 1371 268 232 1871
## and i 1352 259 228 1839
## of a 936 772 115 1823
## i have 1281 158 291 1730
Next, we consider the most frequent three word combinations (i.e. “3-grams”).
Figure 4 below plots the Log10 frequency of 3-grams in the corpus. The distribution of 3-grams is even more highly skewed than the distribution of individual word frequencies and 2-grams.
The table below summarizes the top 20 most frequent 3-grams. Again, the most frequent 3-grams tend to be combinations of small conjunctions, pronouns, prepositions, and/or conjugations of the verb “to be”.
## Blogs News Tweets Total
## one of the 423 344 52 819
## a lot of 393 282 55 730
## # # # 16 358 4 378
## the end of 176 144 35 355
## as well as 186 153 8 347
## more than # 43 295 5 343
## to be a 167 119 52 338
## going to be 116 133 76 325
## out of the 167 130 26 323
## be able to 166 106 28 300
## it was a 155 111 31 297
## some of the 173 112 11 296
## part of the 152 126 11 289
## i want to 160 46 74 280
## a couple of 162 68 12 242
## the first time 102 107 29 238
## the rest of 130 72 24 226
## thanks for the 0 2 223 225
## there is a 139 61 23 223
## i have to 152 20 46 218
This exploratory analysis suggests the following strategies will be needed to develop a predictive text model
While these steps may be helpful, I am admittedly struggling as to how to actually develop a predictive text model based on this information. Suggestions for a modeling strategy are welcome and appreciated!