In this report we are looking to provide an overview of the various word corpuses (corpora), that we are planning to build word prediction algorithms from. The milestone report should show basic facts about the raw data and then demonstrate cleaning steps prior to building the documents into a structured corpus, called a “term document matrix” The “term document matrix” allows us to clean and ‘purify’, making it suitable for statistical analyses, such as word frequency counts.
The corpora is taken from english blogs, news and twitter messages. The filenames we are working with are the following:
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
We then load the files, using read_lines from the readr package for speed:
Now that the raw files are loaded, we can gather basic data to look at the size in words and MB or each corpus:
| Statistics for files | en_US.blogs.txt | en_US.news.txt | en_US.twitter.txt |
|---|---|---|---|
| Lines/msgs per file | 899,288 | 1,010,242 | 2,360,148 |
| Sizes MB for files | 248.5 Mb | 249.6 Mb | 301.4 Mb |
These are just the raw statistics before any character cleanup or organization of the data. The steps we will take with each file are:
To make the generation of statistics easier and faster, we will sample just 10% of each blog/news/twitter corpus. Even though this is a fraction, the total number of records or messages is still > 300,000, which should be sufficient for representative statistics of frequencies etc.
enus_blogs10<-sample(enus_blogs,size=length(enus_blogs)/10)
enus_news10<-sample(enus_news,size=length(enus_news)/10)
enus_twits10<-sample(enus_twits,size=length(enus_twits)/10)
# remove orig to manage memory
rm(enus_blogs)
rm(enus_news)
rm(enus_twits)
The document corpus is then translated into a structured object called a Term document Matrix, which accumulates the frequency of all words across the corpus. This allowsextraction of most common words and associations and also subsequently allows calculation of higher level bi-grams and tri-grams etc
If you look at the initial corpus entries you can see words that are clearly expressions or emotion rather than actual words.If these words occur rarely, then they can be removed from the corpus, since predicting them is not possible or useful. We can remove sparse terms, by for instance removing any word missing from at least 99% of the document corpus:
setwd("~/R/Capstone/final/en_US") # set again for knitr, since above is lost
# The unfiltered corpus contains non-word exclamations
dtm_blogs <- readRDS("dtm_blogs_10pct.Rds")
dtm_news <- readRDS("dtm_news_10pct.Rds")
dtm_twits <- readRDS("dtm_twits_10pct.Rds")
inspect(dtm_blogs[1:4,1:4])
## <<DocumentTermMatrix (documents: 4, terms: 4)>>
## Non-/sparse entries: 0/16
## Sparsity : 100%
## Maximal term length: 21
## Weighting : term frequency (tf)
##
## Terms
## Docs aaa aaaa aaaaa aaaaaaaaahhhhhhhhhhhh
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
# After filtering, only recognizable words are seen
dtm_blogs_nonSpse <- removeSparseTerms(dtm_blogs, 0.99)
dtm_news_nonSpse <- removeSparseTerms(dtm_news, 0.99)
dtm_twits_nonSpse <- removeSparseTerms(dtm_twits, 0.99)
inspect(dtm_blogs_nonSpse[1:4,1:4])
## <<DocumentTermMatrix (documents: 4, terms: 4)>>
## Non-/sparse entries: 1/15
## Sparsity : 94%
## Maximal term length: 8
## Weighting : term frequency (tf)
##
## Terms
## Docs able about actually add
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 1 0 0
As stated above: After filtering at the 99% level, only recognizable words are seen
Each of the Term document Matrices is then further combined into a combined Term document Matrix, from which the most frequent words are extracted.
setwd("~/R/Capstone/final/en_US") # set again for knitr, since above is lost
dtm_all <- readRDS("dtm_all_10pct.Rds")
dtm_all_mtx <- as.matrix(dtm_all)
# Optional to load the saved mtx, which is > 1.4GB
freq_all <- colSums(dtm_all_mtx)
freq_all_desc <- sort(freq_all, decreasing=TRUE)
# A list of the 25 most frequent words in the sampled Corpus
head(freq_all_desc,20)
## the and for that you with was this have are
## 478137 242259 110002 104478 94353 71325 62556 54556 52874 48727
## but not from its all they will just your one
## 48421 41398 38353 35875 33954 32002 31723 30492 30458 30281
library(lattice)
# A Barchart of the 25 most frequent words in the sampled Corpus, with frequencies
barchart(freq_all_desc[1:25])
Words normally considered stopwords dominate the most frequent counts in the frequency distribution of the combined corpus. For model development a key question will be whether stop words are useful in bi-grams, tri-grams etc or whether they will clutter the model.
Bi-gram distributions are not shown here, but a challenge, once computed is deciding whether to keep a large proportion. As can be seen here, Seeing the letter ‘a’ printed, gives no clue as to what the next word might be. This suggests that leading stopwords should be discarded, unless there are some useful ‘guesses’.
terms = findFreqTerms(bi_tdm, lowfreq = 6)[1:25]
terms
[1] "a a" "a about" "a above" "a absolute" "a abundant" "a abv"
[7] "a acceptable" "a acre" "a action" "a actually" "a added" "a advance"
[13] "a after" "a again" "a age" "a album" "a all" "a also"
[19] "a always" "a am" "a amazing" "a amazon" "a american" "a amp"
[25] "a an"