Overview

The doc contsains brief overview of Corpora data corpus (http://www.corpora.heliohost.org/aboutcorpus.html) data source as a ground to build a word prediction model.

The dataset is available through some web-crawled word corpus. The outcome is presented in different languagues ang English is going to be used as the main language. This exporatory analisys includes data cleaning and overview, tokenization. As auxiliary dictionary was being used: “US ENglish offencive words dictionary from Carnegie Mellon University Luis von Ahn’s Research Group (http://www.cs.cmu.edu/~biglou/resources/bad-words.txt)”

Prerequisites

Following libararies are used for exploratory analysis:

library(tm)
library(RWeka)
library(wordcloud)
library(ggplot2)
library(scales)
set.seed(404)

Data downloading

The list of the files for analisys:

file.info(list.files("./final/en_US", full.names = TRUE))[1]
##                                      size
## ./final/en_US/en_US.blogs.txt   210160014
## ./final/en_US/en_US.news.txt    205811883
## ./final/en_US/en_US.twitter.txt 167105338

The files are more than 150M, so peferably to use file connection to download data into memory:

con <- file("./final/en_US/en_US.twitter.txt", "r") 
tweetsVector <- readLines(con)
close(con)
con <- file("./final/en_US/en_US.news.txt", "r") 
newsVector <- readLines(con)
close(con)
con <- file("./final/en_US/en_US.blogs.txt", "r") 
blogsVector <- readLines(con)
close(con)
con <- file("./src/en_US.profanity.txt", "r") 
profanityVector <- readLines(con)
close(con)

Review of data

The files contain big volume of data:

Corpus NLines NWords NSymbols
Tweets 2360148 30373543 162384825
News 1010242 34372528 203791399
Blogs 899288 37334131 208361438
Corpus total 4269678 102080202 574537662

Based on the results, the total of words is more than 100M. Therefore, it makes sense to use further just some “sample” data. Based on tech capacity, would be utilized 5% of total.

Data sampling

Applying “tm”" library to convert the data into a corpus. Using 0.5%:

sampledTweets <- sample(tweetsVector, length(tweetsVector) * sampleSize)
sampledNews <- sample(newsVector, length(newsVector) * sampleSize)
sampledBlogs <- sample(blogsVector, length(blogsVector) * sampleSize)
corpus <- VCorpus(VectorSource(c(sampledTweets, sampledNews, sampledBlogs)))

Small overview for the sampled corpus:

Corpus NLines NWords NSymbols
Tweets 11800 151332 810563
News 5051 173369 1027178
Blogs 4496 187417 1043878
Corpus total 21347 512118 2881619

The resulted dataset could be processed without adding aditional tech capacity.

Data cleaning

The original data was fed out of web-crawling, as a result include some wrong-spelled words as well non-existing words and other “noise” which should be stripped out before further model making. In order to do it, the following filter would be applied:

corpus <- tm_map(corpus, content_transformer(function(x) iconv(x, from="latin1", to="ASCII", sub="")))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, profanityVector)
corpus <- tm_map(corpus, stripWhitespace)

Word statistics

The first analisys is going to be counting words frequency analysis by Term Document Matrix in the tm package. Let’s find the 20 most popular words:

options(mc.cores=1) 
tdm <- TermDocumentMatrix(corpus, control=list(tokenize =  RWeka::WordTokenizer, 
                                              wordLengths = c(1, Inf)))
tdm  
## <<TermDocumentMatrix (terms: 38647, documents: 21347)>>
## Non-/sparse entries: 416475/824581034
## Sparsity           : 100%
## Maximal term length: 71
## Weighting          : term frequency (tf)
head(sort(slam::row_sums(tdm), decreasing=TRUE),20)
##   the    to   and     a    of    in     i    is   for  that    it   you 
## 24046 13725 12055 12007 10182  8345  8335  5512  5454  5261  4760  4606 
##    on  with   was    my    at  this    be  have 
##  4119  3600  3148  3009  2935  2814  2726  2566

The results shows “words” without any logical meaning. Let’s strip out these words from the result and present in the nice word cloud:

tdmWS <- TermDocumentMatrix(corpus, control=list(tokenize = RWeka::WordTokenizer, 
                                              wordLengths = c(1, Inf),
                                              stopwords = stopwords("english")))
tdmWS  
## <<TermDocumentMatrix (terms: 38523, documents: 21347)>>
## Non-/sparse entries: 261453/822089028
## Sparsity           : 100%
## Maximal term length: 71
## Weighting          : term frequency (tf)
head(sort(slam::row_sums(tdmWS), decreasing=TRUE),20)
##   will   said    one   just   like    can    get     im   time    new 
##   1589   1555   1511   1501   1339   1222   1210   1193   1057   1030 
##   dont   good    day    now   love   know people     us   back    see 
##    884    884    878    844    814    806    802    758    709    699
v <- sort(slam::row_sums(tdm[findFreqTerms(tdm,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

The presentatin of changes in frequency for top 10 words:

v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)[1:100]
ggplot(data = data.frame(names=factor(names(v[1:10]), levels = names(v[1:10])), frequencies = v[1:10]), aes(x = names, y = frequencies, fill=names)) + geom_bar(stat = "identity") 

From observation, the counclusion could be made that there linear rate in frequency changes which copplies with Zipf’s law (http://en.wikipedia.org/wiki/Zipf%27s_law)

The next step is finding the amount of words to cover the doc according the course task:

v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)
ggplot(data = data.frame(frequencies = cumsum(v / sum(v)), pos = c(1:length(v))), aes(x = pos, y = frequencies, color="Freq")) + geom_step(stat="ecdf")  + scale_x_continuous(trans = log1p_trans()) + xlab("Number of words") + ylab("Coverage")

The calculation shows that 26000 words is enough to cover 50% of provided docs and around 50000 words to cover 90%.

NGram analisys

The next step of the analisys includes NGram whick begins with generating bigrams and trigrams:

bigramTokenizer  <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))

ngrams <- list( 
  "2gram"=TermDocumentMatrix(corpus, control=list(tokenize = bigramTokenizer, 
                                                  wordLengths=c(1, Inf))), 
  "3gram"=TermDocumentMatrix(corpus, control=list(tokenize = trigramTokenizer, 
                                                  wordLengths=c(1, Inf))) 
)

The most used bigrams are here:

head(sort(slam::row_sums(ngrams$`2gram`), decreasing=TRUE),20)
##   of the   in the   to the  for the   on the    to be   at the  and the 
##     2232     2121     1085     1009      958      822      761      639 
##     in a with the   it was     is a    it is from the    i was    for a 
##      616      541      521      517      466      462      447      438 
##   i have    and i     of a   with a 
##      416      409      406      406

Most popular trigrams are here:

head(sort(slam::row_sums(ngrams$`3gram`), decreasing=TRUE),20)
##         one of the           a lot of     thanks for the 
##                183                136                121 
##            to be a         the end of        going to be 
##                 95                 81                 79 
##        part of the          i want to         out of the 
##                 74                 73                 72 
##           it was a        some of the         as well as 
##                 70                 70                 68 
##          i have to        i dont know           i have a 
##                 61                 60                 55 
##        the rest of        a couple of     the first time 
##                 55                 53                 53 
##         there is a looking forward to 
##                 53                 52

The nice cloud representation for the most used bigrams and trigrams:

v <- sort(slam::row_sums(ngrams$`2gram`[findFreqTerms(ngrams$`2gram`,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

v <- sort(slam::row_sums(ngrams$`3gram`[findFreqTerms(ngrams$`3gram`,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

After this, the check would be run to see how Zipf’s law stands for bigrams and trigrams:

v <- sort(slam::row_sums(ngrams$`2gram`[findFreqTerms(ngrams$`2gram`,1),]), decreasing=TRUE)
ggplot(data=data.frame(words=v, rnk = rank(-v))) + geom_line(aes(x=rnk, y=words, colour="red")) + scale_x_log10() + scale_y_log10() + stat_smooth(method="lm", linetype = 2, aes(x=rnk, y=words)) +ggtitle("Zipf's law probability for 2grams") +xlab("Rank") + ylab("Probability Mass")

v <- sort(slam::row_sums(ngrams$`3gram`[findFreqTerms(ngrams$`3gram`,1),]), decreasing=TRUE)
ggplot(data=data.frame(words=v, rnk = rank(-v))) + geom_line(aes(x=rnk, y=words, colour="red")) + scale_x_log10() + scale_y_log10() + stat_smooth(method="lm", linetype = 2, aes(x=rnk, y=words)) +ggtitle("Zipf's law probability for 3grams") +xlab("Rank") + ylab("Probability Mass")

Conclusion

After exploratory analisys of the files, the counclusion is made that tech limitations prevent from using entire data for research. However, trimmig data to 5% simplifies data processing. In this doc data cleaning and initial analisys of data has been performed which further would be applied for the following phase in the class project.