Loading libraries and some home grown text utilities (turning off warnings)
library(tm) # Framework for text mining.
library(SnowballC) # Provides wordStem() for stemming.
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
library(magrittr)
#library(Rgraphviz) # Correlation plots #no package available for 3.1.1 from cran
source("text_utils.R")
Gather some basic stats on the files. Due to memory issues, I did the initial word counting with ‘wc’ rather than the tm package. (Note: llin = line length)
## parsing en_US.blogs.txt ...
## parsing en_US.news.txt ...
## parsing en_US.twitter.txt ...
## name size lines words llin.avg llen.sd llin.max
## 1 en_US.blogs.txt 2185838 9109 37334131 238.0 256.5 3404
## 2 en_US.news.txt 2103975 10206 34372530 204.2 135.3 2595
## 3 en_US.twitter.txt 1646838 23668 30373583 68.6 37.3 155
I am treating all the 3 files as a single corpus for this project. I may do some more analysis to see if there is a significant difference between them, but I don’t see huge differences in the content besides things like hashtags and links that are more prevalent in one source than the other. I hope my methodology is robust enough to handle these non-English tokens, so splitting them up should not improve my model.
For the remainder of the EDA, I created a random subsets of the files to improve the speed of the analysis.
I began by doing some cleaning the corpus. This was trial and error, and I am continuing to add cleaning rules as I notice anomolies during the modeling process. Profanity has been removed by pattern matching for vulgar words, and punctuation has been selectively removed.
I plan to replace the rare words with a common unique term so that I can filter it out of the ngrams later, however I have left this cleansing out of the EDA
## <<TermDocumentMatrix (terms: 8, documents: 3)>>
## Non-/sparse entries: 9/15
## Sparsity : 62%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Docs
## Terms en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## a'bunadh 1 0 0
## a's 0 4 3
## a"war 0 0 1
## zygi 0 2 0
## zynga 0 0 1
## zynga's 0 1 0
## zziiinnng 0 0 1
## zzzzzzzzz 1 0 0
## the
## 48898
There are 52080 terms in corpus which contains 814873 total words in the sample corpus.
looking at a few of the first and last we see that there remain very many non-english words. In fact only “zynga” would be something I would expect from a predictive algorythm. I anticipate my rare.words strategy will take care of this.
A listing of the top 10 words follows with an accounting of their frequency in the corpus.
## have are was with not you that for and the
## 5538 6188 6520 7306 8453 10355 10754 11319 24888 48898
We can see that these words represent over 17% of the total number of words in the corpus. Since many of these are stop words, we clearly don’t want to remove those from the corpus.
The following plot can be used to determine the coverage the most frequent terms have over the entir corpus
Here we see 50% coverage is achieved by the top 325 terms, and 90% coverage with the top 9114 terms.
Looking at the common word lengths:
The most common word length is 7 letters. For fun, and as bit of a gut check (to make sure these are real words), I plotted the top 10 words of this length.
I created a function to perform these interrogations for a corpus for an n-gram and reproduced them (not shown here). As we start looking at word combinations, the persent of coverage by the most frequent combinations decreases dramatically, indicating that we will need a large amount of data for predictions.
From here I plan to build 2,3 and 4-grams. My biggest obstacle is the amount of time and memory it takes to process these. I am still researching the best way to do this, but I hope that by eliminating the rare words, the size of the problem decreases dramatically.