Executive summary

This is the first milestone report for the Data Science Specialization Capstone. The goal of this document is to explain the exloratory data analysis that has been undertaken of the Coursera-Swiftkey English language text files, as well as the cleaning and tokenization process and the plan for building a prediction model. No code is provided, on purpose. However, all analysis has been done in R, mainly using the following packages: stringi/stringr, quanteda, and data.table.

Basic summaries of the three files

While the download link which was provided also contains text in German, Finnish and Russian, only the English language text files are considered here. The table below shows the number of lines and number of words in each unprocessed corpus.

##   text_type number_lines number_words
## 1      news      1010242     34762395
## 2      blog       899288     37546239
## 3   twitter      2360148     30093413

Processing the files for cleaning and tokenization

Each separate corpus (blogs, news, and twitter) was split into 6, in order to keep at least 15% of the text for testing model accuracy and to keep the sizes of n-gram tables and dfms manageable with my computer’s limited RAM (6 GB). Three dice rolls were then simulated to decide which chunk of data would be kept for the test set in each case.

Cleaning process

A number of cleaning steps were taken. In each case, the effects of these cleaning steps were first tested on a very small sample of text to ensure that they did not have unforeseen consequences.

  • All text was split up into sentences.
  • Profanity was removed, using a list of common rude words which was found on github, and shortened to speed up the process. The profanity filter took quite a bit of refining to avoid catching false positives (we want to remove ass but not assess, semen but not basement…). Whenever a sentence contained profanity, the whole sentence was removed.
  • Text was put into lowercase.
  • Special characters were removed, including extended latin characters, which were transformed to ascii: for example, café becomes cafe.
  • Smiley faces, shrugs, hearts, etc… were removed.
  • All unicode characters which quanteda could not tokenize were removed. In that case, the whole sentence was removed.
  • Numbers which were used as letters “happy birthay 2 u” “4give” were replaced as much as possible. These instances were found both by sampling the corpus and by looking at the most frequent words for unwanted common occurences.
  • Similarly, letters used as abbreviations were transformed back into their proper word formats as much as possible.
  • For tweets, rt was a very common occurence that we don’t want. It was removed.

Tokenizing

The clean data was then tokenized into words and n-grams up to 5. Skipgrams will not be used in the model. The tokenization was done using the quanteda package, and most options to remove things like urls or numbers were selected.

Tokens compounding was also performed to unite some tokens that actually belong together, like San Francisco, Los Angeles… These pairs were found by looking at bigram frequencies, and later trigram frequencies, and specifically at words that are either preceded or followed by a very limited number of words a high proportion of the time.
Based on the patterns identified, a bigger list of tokens to compound was made by looking at bigrams that started or ended with specific words, for example bigrams starting with North or New, or ending with Bay or Island.

Table of commonly co-occuring bigrams (before compounding)

##          V1        V2 freqvector fv1 fv2 followratio precederatio
##  1:  united    states        279 447 608   0.6241611    0.4588816
##  2:     ice     cream        209 474 724   0.4409283    0.2886740
##  3:   olive       oil        167 222 743   0.7522523    0.2247645
##  4:   jesus    christ        162 795 514   0.2037736    0.3151751
##  5:   south    africa        158 727 296   0.2173315    0.5337838
##  6:     san francisco         99 249 105   0.3975904    0.9428571
##  7:   prime  minister         93 162 206   0.5740741    0.4514563
##  8:     los   angeles         93 112  97   0.8303571    0.9587629
##  9:  amazon        eu         76 292 163   0.2602740    0.4662577
## 10:  highly recommend         71 354 305   0.2005650    0.2327869
## 11:   lemon     juice         70 270 265   0.2592593    0.2641509
## 12:    blah      blah         63  84 105   0.7500000    0.6000000
## 13: chicago  illinois         60 289 115   0.2076125    0.5217391
## 14:    pale       ale         60 201 299   0.2985075    0.2006689
## 15:  martha   stewart         55  98  85   0.5612245    0.6470588
## 16:     tim     holtz         46 142  49   0.3239437    0.9387755
## 17:      ha        ha         46  93 114   0.4946237    0.4035088
## 18:    hong      kong         46  56  65   0.8214286    0.7076923
## 19:   harry    potter         46 117  63   0.3931624    0.7301587
## 20:      et        al         45  70 143   0.6428571    0.3146853

The table above shows 20 of the most commonly co-occuring bigrams before the tokens were compounded. The numbers in the table are statistics that were used for filtering the data so that co-occurences could be found rapidly. Here’s a summary of the logic behind the decision to compound or not:

  • If a part of the item makes sense individually,then the item shouldn’t be compounded. For example, ice cream or olive oil remain separated.
  • Geographical locations: “Hong Kong” and “San Francisco” obviously belong together. But the decision was made to compound “North Carolina” but not “North America”, “South Africa” but not “South America” based on whether the name represents an official country or state.
  • First names and last names were not compounded, i.e. “Harry Potter” remains two words.
  • Last names which include spaces were compounded: e.g. “Da Vinci” or “Van Damme”

Features of the data

Stopwords and most common tokens in each corpus

The most common words are almost always the same no matter what the origin of the text: when looking at the top 20 words from each corpus, only 27 words appear in total.

Below are the 13 words that appear in the top 20 in each corpus:

##  [1] "the"  "to"   "a"    "and"  "for"  "in"   "of"   "is"   "it"   "on"  
## [11] "that" "be"   "with"

Vocabulary and word frequencies

The three corpii, blogs, news and twitter, have similar amounts of vocabulary (254 to 323 thousand unique words when considering 5/6th of the document provided, keeping the rest for testing.) In each case, about half of the words in the vocabulary appear only once, and 42 to 55000 words appear at least 10 times. Between 2100 and 2800 words appear at least 1000 times in each corpus.
While twitter has the most individual words, it also has the biggest share that appear only once, and news actually has more diverse vocabulary that appears more frequently.

##   Number_words_appearing_at_least  blogs   news twitter
## 1                               1 254605 304716  322336
## 2                               2 121032 154305  130131
## 3                               5  65639  82273   65040
## 4                              10  44501  55053   42228
## 5                             100  11960  14384   10883
## 6                            1000   2185   2790    2143

Similarities and differences in vocabulary between corpii

News and blog corpii are quite similar: out of 11960 words that appeared more than 100 times in blogs, 9731 (81%) also appeared more than 100 times in news. Only 13 words (out of over 2000) that occur more than 1000 times in blogs didn’t appear at least 100 times in news, and only 50 words (out of almost 2800) that appeared over 1000 times in news didn’t appear at least 100 times in blogs.

Below is the list of words that were frequently seen in news but not in blogs: as we can see, most of these words are related to sports, politics, or local administrations.

##  [1] "sen"            "sheriff's"      "county's"       "teammates"     
##  [5] "indianapolis"   "calif"          "longtime"       "state's"       
##  [9] "newark"         "analysts"       "vikings"        "analyst"       
## [13] "counties"       "cuyahoga"       "team's"         "christie"      
## [17] "nation's"       "gov"            "innings"        "lakers"        
## [21] "spokeswoman"    "prosecutors"    "prosecutor"     "quarterback"   
## [25] "superintendent" "firefighters"   "assists"        "timbers"       
## [29] "cardinals"      "broncos"        "ncaa"           "postseason"    
## [33] "playoff"        "company's"      "linebacker"     "medicare"      
## [37] "inning"         "interstate"     "blazers"        "lawmakers"     
## [41] "legislators"    "sacramento"     "sophomore"      "nonprofit"     
## [45] "testified"      "township"       "rutgers"        "rams"          
## [49] "rebounds"       "kasich"

Twitter and blogs are also quite similar: only 5 words that appear more than 1000 times in blogs appear less than 100 times in twitter, whereas 75 words appear more than 1000 times in twitter while appearing fewer than 100 times in blogs. Even for twitter and news, the common words are quite similar: 70 words appear more than 1000 times in news while appearing fewer than 100 times in twitter, and 129 words appear more than 1000 times in twitter and fewer than 100 times in news.

N-Gram numbers and frequencies

The number of n-grams is much more than the size of the vocabulary: while the training set for blogs has around 250 thousand unique words, it has over 4 million unique bigrams, over 11 M unique trigrams and over 15M unique quadgrams. Similar growth rates are seen for the other corpii.

Most of these n-grams, however, occur exactly once. The graph below shows the percentage of words, bigrams, trigrams and quadgrams that occur more than once for each of the three sets: these proportions are almost the same for each n.

This has an important consequence for building the prediction model: while ngram tables can be very big, if we get rid of unique occurences, we can trim these tables very radically with a simple query condition.

Plan for building the prediction model

A model which gives five word predictions for any input is foreseen. This model will work in the following way: