Loading libraries and some home grown text utilities (turning off warnings)

library(tm) # Framework for text mining.
library(SnowballC) # Provides wordStem() for stemming.
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
library(magrittr)
#library(Rgraphviz) # Correlation plots  #no package available for 3.1.1 from cran

source("text_utils.R")

Gather some basic stats on the files. Due to memory issues, I did the initial word counting with ‘wc’ rather than the tm package. (Note: llin = line length)

## parsing  en_US.blogs.txt ...
## parsing  en_US.news.txt ...
## parsing  en_US.twitter.txt ...
##                name    size lines    words llin.avg llen.sd llin.max
## 1   en_US.blogs.txt 2185838  9109 37334131    238.0   256.5     3404
## 2    en_US.news.txt 2103975 10206 34372530    204.2   135.3     2595
## 3 en_US.twitter.txt 1646838 23668 30373583     68.6    37.3      155

I am treating all the 3 files as a single corpus for this project. I may do some more analysis to see if there is a significant difference between them, but I don’t see huge differences in the content besides things like hashtags and links that are more prevalent in one source than the other. I hope my methodology is robust enough to handle these non-English tokens, so splitting them up should not improve my model.

For the remainder of the EDA, I created a random subsets of the files to improve the speed of the analysis.

I began by doing some cleaning the corpus. This was trial and error, and I am continuing to add cleaning rules as I notice anomolies during the modeling process. Profanity has been removed by pattern matching for vulgar words, and punctuation has been selectively removed.

I plan to replace the rare words with a common unique term so that I can filter it out of the ngrams later, however I have left this cleansing out of the EDA

## <<TermDocumentMatrix (terms: 8, documents: 3)>>
## Non-/sparse entries: 9/15
## Sparsity           : 62%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##            Docs
## Terms       en_US.blogs.txt en_US.news.txt en_US.twitter.txt
##   a'bunadh                1              0                 0
##   a's                     0              4                 3
##   a"war                   0              0                 1
##   zygi                    0              2                 0
##   zynga                   0              0                 1
##   zynga's                 0              1                 0
##   zziiinnng               0              0                 1
##   zzzzzzzzz               1              0                 0
##   the 
## 48898

There are 52080 terms in corpus which contains 814873 total words in the sample corpus.

looking at a few of the first and last we see that there remain very many non-english words. In fact only “zynga” would be something I would expect from a predictive algorythm. I anticipate my rare.words strategy will take care of this.

A listing of the top 10 words follows with an accounting of their frequency in the corpus.

##  have   are   was  with   not   you  that   for   and   the 
##  5538  6188  6520  7306  8453 10355 10754 11319 24888 48898

We can see that these words represent over 17% of the total number of words in the corpus. Since many of these are stop words, we clearly don’t want to remove those from the corpus.

The following plot can be used to determine the coverage the most frequent terms have over the entir corpus

Here we see 50% coverage is achieved by the top 325 terms, and 90% coverage with the top 9114 terms.

Looking at the common word lengths:

The most common word length is 7 letters. For fun, and as bit of a gut check (to make sure these are real words), I plotted the top 10 words of this length.

I created a function to perform these interrogations for a corpus for an n-gram and reproduced them (not shown here). As we start looking at word combinations, the persent of coverage by the most frequent combinations decreases dramatically, indicating that we will need a large amount of data for predictions.

From here I plan to build 2,3 and 4-grams. My biggest obstacle is the amount of time and memory it takes to process these. I am still researching the best way to do this, but I hope that by eliminating the rare words, the size of the problem decreases dramatically.