Summary

In this report we are looking to provide an overview of the various word corpuses (corpora), that we are planning to build word prediction algorithms from. The milestone report should show basic facts about the raw data and then demonstrate cleaning steps prior to building the documents into a structured corpus, called a “term document matrix” The “term document matrix” allows us to clean and ‘purify’, making it suitable for statistical analyses, such as word frequency counts.

The corpora is taken from english blogs, news and twitter messages. The filenames we are working with are the following:

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

We then load the files, using read_lines from the readr package for speed:

Now that the raw files are loaded, we can gather basic data to look at the size in words and MB or each corpus:

Statistics for files en_US.blogs.txt en_US.news.txt en_US.twitter.txt
Lines/msgs per file 899,288 1,010,242 2,360,148
Sizes MB for files 248.5 Mb 249.6 Mb 301.4 Mb

These are just the raw statistics before any character cleanup or organization of the data. The steps we will take with each file are:

Word removal and reduction decisions

Sampling

To make the generation of statistics easier and faster, we will sample just 10% of each blog/news/twitter corpus. Even though this is a fraction, the total number of records or messages is still > 300,000, which should be sufficient for representative statistics of frequencies etc.

enus_blogs10<-sample(enus_blogs,size=length(enus_blogs)/10)
enus_news10<-sample(enus_news,size=length(enus_news)/10)
enus_twits10<-sample(enus_twits,size=length(enus_twits)/10)
# remove orig to manage memory
rm(enus_blogs)
rm(enus_news)
rm(enus_twits)

Corpus Building

The document corpus is then translated into a structured object called a Term document Matrix, which accumulates the frequency of all words across the corpus. This allowsextraction of most common words and associations and also subsequently allows calculation of higher level bi-grams and tri-grams etc

Removing sparse terms

If you look at the initial corpus entries you can see words that are clearly expressions or emotion rather than actual words.If these words occur rarely, then they can be removed from the corpus, since predicting them is not possible or useful. We can remove sparse terms, by for instance removing any word missing from at least 99% of the document corpus:

setwd("~/R/Capstone/final/en_US") # set again for knitr, since above is lost
# The unfiltered corpus contains non-word exclamations
dtm_blogs <- readRDS("dtm_blogs_10pct.Rds")
dtm_news <- readRDS("dtm_news_10pct.Rds")
dtm_twits <- readRDS("dtm_twits_10pct.Rds")
inspect(dtm_blogs[1:4,1:4])
## <<DocumentTermMatrix (documents: 4, terms: 4)>>
## Non-/sparse entries: 0/16
## Sparsity           : 100%
## Maximal term length: 21
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs aaa aaaa aaaaa aaaaaaaaahhhhhhhhhhhh
##    1   0    0     0                     0
##    2   0    0     0                     0
##    3   0    0     0                     0
##    4   0    0     0                     0
# After filtering, only recognizable words are seen
dtm_blogs_nonSpse <- removeSparseTerms(dtm_blogs, 0.99)
dtm_news_nonSpse <- removeSparseTerms(dtm_news, 0.99)
dtm_twits_nonSpse <- removeSparseTerms(dtm_twits, 0.99)
inspect(dtm_blogs_nonSpse[1:4,1:4])
## <<DocumentTermMatrix (documents: 4, terms: 4)>>
## Non-/sparse entries: 1/15
## Sparsity           : 94%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs able about actually add
##    1    0     0        0   0
##    2    0     0        0   0
##    3    0     0        0   0
##    4    0     1        0   0

As stated above: After filtering at the 99% level, only recognizable words are seen

Matrices and Frequencies

Each of the Term document Matrices is then further combined into a combined Term document Matrix, from which the most frequent words are extracted.

setwd("~/R/Capstone/final/en_US") # set again for knitr, since above is lost
dtm_all <- readRDS("dtm_all_10pct.Rds")
dtm_all_mtx <- as.matrix(dtm_all)
# Optional to load the saved mtx, which is > 1.4GB
freq_all <- colSums(dtm_all_mtx)
freq_all_desc <- sort(freq_all, decreasing=TRUE)
# A list of the 25 most frequent words in the sampled Corpus
head(freq_all_desc,20)
##    the    and    for   that    you   with    was   this   have    are 
## 478137 242259 110002 104478  94353  71325  62556  54556  52874  48727 
##    but    not   from    its    all   they   will   just   your    one 
##  48421  41398  38353  35875  33954  32002  31723  30492  30458  30281
library(lattice)
# A Barchart of the 25 most frequent words in the sampled Corpus, with frequencies
barchart(freq_all_desc[1:25])

Conclusion

Words normally considered stopwords dominate the most frequent counts in the frequency distribution of the combined corpus. For model development a key question will be whether stop words are useful in bi-grams, tri-grams etc or whether they will clutter the model.

Note on Bi-grams

Bi-gram distributions are not shown here, but a challenge, once computed is deciding whether to keep a large proportion. As can be seen here, Seeing the letter ‘a’ printed, gives no clue as to what the next word might be. This suggests that leading stopwords should be discarded, unless there are some useful ‘guesses’.

terms = findFreqTerms(bi_tdm, lowfreq = 6)[1:25]
terms
 [1] "a a"          "a about"      "a above"      "a absolute"   "a abundant"   "a abv"       
 [7] "a acceptable" "a acre"       "a action"     "a actually"   "a added"      "a advance"   
[13] "a after"      "a again"      "a age"        "a album"      "a all"        "a also"      
[19] "a always"     "a am"         "a amazing"    "a amazon"     "a american"   "a amp"       
[25] "a an"