Loading Data

The three English-language data sets are downloaded, unzipped to the working directory and loaded into R. Because the eventual predictive model must accept any uni, bi or tri-gram (without contextually knowing whether the gram is from a blog, news or twitter source), the three datasets are combined into a single file for further analysis. Finally, the combined data is sampled to facilitate testing.

A prebuilt list of profane words is also downloaded, to facilitate building a profanity filter. In addition, a prebuilt list of English words is also downloaded, to help assess non-english words that appear in texts.

Sampling & Summary

To feasibly explore the data set, we will sample 20% of each file.

We can then summarize key attributes of each data file; Blogs and News documents have similar attributes (though there is less variability in news documents), but Twitter documents are markedly different. This suggests that a model used for Blogs and News may not necessarily be applicable to Twitter, and is a topic for later investigation.

##    Corpus Document.Count Max.Document.Length Min.Document.Length
## 1   Blogs         179858               19795                   1
## 2    News          15452                1929                   3
## 3 Twitter         472030                 140                   3
##   Mean.Document.Length Median.Document.Length
## 1            230.32868                    156
## 2            201.75692                    185
## 3             68.66478                     64

Preprocessing

Texts are pre-processed to facilitate analysis.

Exploration

It is possible that the texts include non-English words, which could be a problem for a predictive model. A cursory examination (by comparing tokenized text with a prebuilt English dictionary) reveals that the vast majority of “non-English” words are proper names, acronyms, typos or slang. Only a very small number, less than 1%, appear to be truly in languages other than English, and thus we can initially ignore attempts at removing non-English words.

To illustrate, below are the first 20 “non-English” words.

##  [1] "melissa"                     "trinitarian"                
##  [3] "winnie"                      "winnie"                     
##  [5] "decals"                      "ferber"                     
##  [7] "alex"                        "someplace"                  
##  [9] "aa"                          "microclimate"               
## [11] "se"                          "se"                         
## [13] "se"                          "se"                         
## [15] "th"                          "kirsten"                    
## [17] "boehner"                     "kirstenlegiblelandscapesorg"
## [19] "mother's"                    "performerartist"

Plotting the total number of words, and the number of unique words in Blogs, News, Twitter, and All combined. From those plots, we can discern that Blogs and Twitter are both substantially larger in size than News, but also that they use more varied vocabulary. This lends further credence to structural differences between the sources, and that predictive models should probably be “contextual” to the intent of the writer.

An examination of histograms backs up the assertion that Blogs and Twitter have more varied vocabulary. News texts use the same words with higher relative frequencies than do Blogs and Twitter; Blogs and Twitter have “fatter” right tails indicating broader distribution of terms.

Histograms also highlight that the vast majority of terms appear only once. We can infer then that predictive modelling solutions will need to be robust to rare terms.

We can also view the top n-grams for each source; the below plots demonstrate the top 10 n-grams overall, and display the counts for each source. The most illuminating finding from visual analysis is that the “thanks for the” tri-gram is extremely common in Twitter, but relatively uncommon elsewhere. This type of phrase is highly conversational, and thus unlikely to appear in a Blog or News document unless quoted from a source. Again, this backs up previous evidence that any predictive model should be able to “sense” the context of its user, and adapt its suggestions accordingly.