This is the first milestone report for the Data Science Specialization Capstone. The goal of this document is to explain the exloratory data analysis that has been undertaken of the Coursera-Swiftkey English language text files, as well as the cleaning and tokenization process and the plan for building a prediction model. No code is provided, on purpose. However, all analysis has been done in R, mainly using the following packages: stringi/stringr, quanteda, and data.table.
While the download link which was provided also contains text in German, Finnish and Russian, only the English language text files are considered here. The table below shows the number of lines and number of words in each unprocessed corpus.
## text_type number_lines number_words
## 1 news 1010242 34762395
## 2 blog 899288 37546239
## 3 twitter 2360148 30093413
Each separate corpus (blogs, news, and twitter) was split into 6, in order to keep at least 15% of the text for testing model accuracy and to keep the sizes of n-gram tables and dfms manageable with my computer’s limited RAM (6 GB). Three dice rolls were then simulated to decide which chunk of data would be kept for the test set in each case.
A number of cleaning steps were taken. In each case, the effects of these cleaning steps were first tested on a very small sample of text to ensure that they did not have unforeseen consequences.
The clean data was then tokenized into words and n-grams up to 5. Skipgrams will not be used in the model. The tokenization was done using the quanteda package, and most options to remove things like urls or numbers were selected.
Tokens compounding was also performed to unite some tokens that actually belong together, like San Francisco, Los Angeles… These pairs were found by looking at bigram frequencies, and later trigram frequencies, and specifically at words that are either preceded or followed by a very limited number of words a high proportion of the time.
Based on the patterns identified, a bigger list of tokens to compound was made by looking at bigrams that started or ended with specific words, for example bigrams starting with North or New, or ending with Bay or Island.
## V1 V2 freqvector fv1 fv2 followratio precederatio
## 1: united states 279 447 608 0.6241611 0.4588816
## 2: ice cream 209 474 724 0.4409283 0.2886740
## 3: olive oil 167 222 743 0.7522523 0.2247645
## 4: jesus christ 162 795 514 0.2037736 0.3151751
## 5: south africa 158 727 296 0.2173315 0.5337838
## 6: san francisco 99 249 105 0.3975904 0.9428571
## 7: prime minister 93 162 206 0.5740741 0.4514563
## 8: los angeles 93 112 97 0.8303571 0.9587629
## 9: amazon eu 76 292 163 0.2602740 0.4662577
## 10: highly recommend 71 354 305 0.2005650 0.2327869
## 11: lemon juice 70 270 265 0.2592593 0.2641509
## 12: blah blah 63 84 105 0.7500000 0.6000000
## 13: chicago illinois 60 289 115 0.2076125 0.5217391
## 14: pale ale 60 201 299 0.2985075 0.2006689
## 15: martha stewart 55 98 85 0.5612245 0.6470588
## 16: tim holtz 46 142 49 0.3239437 0.9387755
## 17: ha ha 46 93 114 0.4946237 0.4035088
## 18: hong kong 46 56 65 0.8214286 0.7076923
## 19: harry potter 46 117 63 0.3931624 0.7301587
## 20: et al 45 70 143 0.6428571 0.3146853
The table above shows 20 of the most commonly co-occuring bigrams before the tokens were compounded. The numbers in the table are statistics that were used for filtering the data so that co-occurences could be found rapidly. Here’s a summary of the logic behind the decision to compound or not:
The most common words are almost always the same no matter what the origin of the text: when looking at the top 20 words from each corpus, only 27 words appear in total.
Below are the 13 words that appear in the top 20 in each corpus:
## [1] "the" "to" "a" "and" "for" "in" "of" "is" "it" "on"
## [11] "that" "be" "with"
The three corpii, blogs, news and twitter, have similar amounts of vocabulary (254 to 323 thousand unique words when considering 5/6th of the document provided, keeping the rest for testing.) In each case, about half of the words in the vocabulary appear only once, and 42 to 55000 words appear at least 10 times. Between 2100 and 2800 words appear at least 1000 times in each corpus.
While twitter has the most individual words, it also has the biggest share that appear only once, and news actually has more diverse vocabulary that appears more frequently.
## Number_words_appearing_at_least blogs news twitter
## 1 1 254605 304716 322336
## 2 2 121032 154305 130131
## 3 5 65639 82273 65040
## 4 10 44501 55053 42228
## 5 100 11960 14384 10883
## 6 1000 2185 2790 2143
News and blog corpii are quite similar: out of 11960 words that appeared more than 100 times in blogs, 9731 (81%) also appeared more than 100 times in news. Only 13 words (out of over 2000) that occur more than 1000 times in blogs didn’t appear at least 100 times in news, and only 50 words (out of almost 2800) that appeared over 1000 times in news didn’t appear at least 100 times in blogs.
Below is the list of words that were frequently seen in news but not in blogs: as we can see, most of these words are related to sports, politics, or local administrations.
## [1] "sen" "sheriff's" "county's" "teammates"
## [5] "indianapolis" "calif" "longtime" "state's"
## [9] "newark" "analysts" "vikings" "analyst"
## [13] "counties" "cuyahoga" "team's" "christie"
## [17] "nation's" "gov" "innings" "lakers"
## [21] "spokeswoman" "prosecutors" "prosecutor" "quarterback"
## [25] "superintendent" "firefighters" "assists" "timbers"
## [29] "cardinals" "broncos" "ncaa" "postseason"
## [33] "playoff" "company's" "linebacker" "medicare"
## [37] "inning" "interstate" "blazers" "lawmakers"
## [41] "legislators" "sacramento" "sophomore" "nonprofit"
## [45] "testified" "township" "rutgers" "rams"
## [49] "rebounds" "kasich"
Twitter and blogs are also quite similar: only 5 words that appear more than 1000 times in blogs appear less than 100 times in twitter, whereas 75 words appear more than 1000 times in twitter while appearing fewer than 100 times in blogs. Even for twitter and news, the common words are quite similar: 70 words appear more than 1000 times in news while appearing fewer than 100 times in twitter, and 129 words appear more than 1000 times in twitter and fewer than 100 times in news.
The number of n-grams is much more than the size of the vocabulary: while the training set for blogs has around 250 thousand unique words, it has over 4 million unique bigrams, over 11 M unique trigrams and over 15M unique quadgrams. Similar growth rates are seen for the other corpii.
Most of these n-grams, however, occur exactly once. The graph below shows the percentage of words, bigrams, trigrams and quadgrams that occur more than once for each of the three sets: these proportions are almost the same for each n.
This has an important consequence for building the prediction model: while ngram tables can be very big, if we get rid of unique occurences, we can trim these tables very radically with a simple query condition.
A model which gives five word predictions for any input is foreseen. This model will work in the following way: