Explaratory Analisys of media feed for Course Project

Overview

The doc contsains brief overview of Corpora data corpus (http://www.corpora.heliohost.org/aboutcorpus.html) data source as a ground to build a word prediction model.

The dataset is available through some web-crawled word corpus. The outcome is presented in different languagues ang English is going to be used as the main language. This exporatory analisys includes data cleaning and overview, tokenization. As auxiliary dictionary was being used: “US ENglish offencive words dictionary from Carnegie Mellon University Luis von AhnвЂ™s Research Group (http://www.cs.cmu.edu/~biglou/resources/bad-words.txt)”

Prerequisites

Following libararies are used for exploratory analysis:

library(tm)
library(RWeka)
library(wordcloud)
library(ggplot2)
library(scales)
set.seed(404)

Data downloading

The list of the files for analisys:

file.info(list.files("./final/en_US", full.names = TRUE))[1]

##                                      size
## ./final/en_US/en_US.blogs.txt   210160014
## ./final/en_US/en_US.news.txt    205811883
## ./final/en_US/en_US.twitter.txt 167105338

The files are more than 150M, so peferably to use file connection to download data into memory:

con <- file("./final/en_US/en_US.twitter.txt", "r") 
tweetsVector <- readLines(con)
close(con)
con <- file("./final/en_US/en_US.news.txt", "r") 
newsVector <- readLines(con)
close(con)
con <- file("./final/en_US/en_US.blogs.txt", "r") 
blogsVector <- readLines(con)
close(con)
con <- file("./src/en_US.profanity.txt", "r") 
profanityVector <- readLines(con)
close(con)

Review of data

The files contain big volume of data:

Corpus	NLines	NWords	NSymbols
Tweets	2360148	30373543	162384825
News	1010242	34372528	203791399
Blogs	899288	37334131	208361438
Corpus total	4269678	102080202	574537662

Based on the results, the total of words is more than 100M. Therefore, it makes sense to use further just some “sample” data. Based on tech capacity, would be utilized 5% of total.

Data sampling

Applying “tm”" library to convert the data into a corpus. Using 0.5%:

sampledTweets <- sample(tweetsVector, length(tweetsVector) * sampleSize)
sampledNews <- sample(newsVector, length(newsVector) * sampleSize)
sampledBlogs <- sample(blogsVector, length(blogsVector) * sampleSize)
corpus <- VCorpus(VectorSource(c(sampledTweets, sampledNews, sampledBlogs)))

Small overview for the sampled corpus:

Corpus	NLines	NWords	NSymbols
Tweets	11800	151332	810563
News	5051	173369	1027178
Blogs	4496	187417	1043878
Corpus total	21347	512118	2881619

The resulted dataset could be processed without adding aditional tech capacity.

Data cleaning

The original data was fed out of web-crawling, as a result include some wrong-spelled words as well non-existing words and other “noise” which should be stripped out before further model making. In order to do it, the following filter would be applied:

Removing punctiontion
Removing “bad” words and slang
Removing digits
Removing words in other languages except English

corpus <- tm_map(corpus, content_transformer(function(x) iconv(x, from="latin1", to="ASCII", sub="")))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, profanityVector)
corpus <- tm_map(corpus, stripWhitespace)

Word statistics

The first analisys is going to be counting words frequency analysis by Term Document Matrix in the tm package. Let’s find the 20 most popular words:

options(mc.cores=1) 
tdm <- TermDocumentMatrix(corpus, control=list(tokenize =  RWeka::WordTokenizer, 
                                              wordLengths = c(1, Inf)))
tdm

## <<TermDocumentMatrix (terms: 38647, documents: 21347)>>
## Non-/sparse entries: 416475/824581034
## Sparsity           : 100%
## Maximal term length: 71
## Weighting          : term frequency (tf)

head(sort(slam::row_sums(tdm), decreasing=TRUE),20)

##   the    to   and     a    of    in     i    is   for  that    it   you 
## 24046 13725 12055 12007 10182  8345  8335  5512  5454  5261  4760  4606 
##    on  with   was    my    at  this    be  have 
##  4119  3600  3148  3009  2935  2814  2726  2566

The results shows “words” without any logical meaning. Let’s strip out these words from the result and present in the nice word cloud:

tdmWS <- TermDocumentMatrix(corpus, control=list(tokenize = RWeka::WordTokenizer, 
                                              wordLengths = c(1, Inf),
                                              stopwords = stopwords("english")))
tdmWS

## <<TermDocumentMatrix (terms: 38523, documents: 21347)>>
## Non-/sparse entries: 261453/822089028
## Sparsity           : 100%
## Maximal term length: 71
## Weighting          : term frequency (tf)

head(sort(slam::row_sums(tdmWS), decreasing=TRUE),20)

##   will   said    one   just   like    can    get     im   time    new 
##   1589   1555   1511   1501   1339   1222   1210   1193   1057   1030 
##   dont   good    day    now   love   know people     us   back    see 
##    884    884    878    844    814    806    802    758    709    699

v <- sort(slam::row_sums(tdm[findFreqTerms(tdm,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

The presentatin of changes in frequency for top 10 words:

v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)[1:100]
ggplot(data = data.frame(names=factor(names(v[1:10]), levels = names(v[1:10])), frequencies = v[1:10]), aes(x = names, y = frequencies, fill=names)) + geom_bar(stat = "identity")

From observation, the counclusion could be made that there linear rate in frequency changes which copplies with ZipfвЂ™s law (http://en.wikipedia.org/wiki/Zipf%27s_law)

The next step is finding the amount of words to cover the doc according the course task:

v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)
ggplot(data = data.frame(frequencies = cumsum(v / sum(v)), pos = c(1:length(v))), aes(x = pos, y = frequencies, color="Freq")) + geom_step(stat="ecdf")  + scale_x_continuous(trans = log1p_trans()) + xlab("Number of words") + ylab("Coverage")

The calculation shows that 26000 words is enough to cover 50% of provided docs and around 50000 words to cover 90%.

NGram analisys

The next step of the analisys includes NGram whick begins with generating bigrams and trigrams:

bigramTokenizer  <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))

ngrams <- list( 
  "2gram"=TermDocumentMatrix(corpus, control=list(tokenize = bigramTokenizer, 
                                                  wordLengths=c(1, Inf))), 
  "3gram"=TermDocumentMatrix(corpus, control=list(tokenize = trigramTokenizer, 
                                                  wordLengths=c(1, Inf))) 
)

The most used bigrams are here:

head(sort(slam::row_sums(ngrams$`2gram`), decreasing=TRUE),20)

##   of the   in the   to the  for the   on the    to be   at the  and the 
##     2232     2121     1085     1009      958      822      761      639 
##     in a with the   it was     is a    it is from the    i was    for a 
##      616      541      521      517      466      462      447      438 
##   i have    and i     of a   with a 
##      416      409      406      406

Most popular trigrams are here:

head(sort(slam::row_sums(ngrams$`3gram`), decreasing=TRUE),20)

##         one of the           a lot of     thanks for the 
##                183                136                121 
##            to be a         the end of        going to be 
##                 95                 81                 79 
##        part of the          i want to         out of the 
##                 74                 73                 72 
##           it was a        some of the         as well as 
##                 70                 70                 68 
##          i have to        i dont know           i have a 
##                 61                 60                 55 
##        the rest of        a couple of     the first time 
##                 55                 53                 53 
##         there is a looking forward to 
##                 53                 52

The nice cloud representation for the most used bigrams and trigrams:

v <- sort(slam::row_sums(ngrams$`2gram`[findFreqTerms(ngrams$`2gram`,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

v <- sort(slam::row_sums(ngrams$`3gram`[findFreqTerms(ngrams$`3gram`,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))

After this, the check would be run to see how Zipf’s law stands for bigrams and trigrams:

v <- sort(slam::row_sums(ngrams$`2gram`[findFreqTerms(ngrams$`2gram`,1),]), decreasing=TRUE)
ggplot(data=data.frame(words=v, rnk = rank(-v))) + geom_line(aes(x=rnk, y=words, colour="red")) + scale_x_log10() + scale_y_log10() + stat_smooth(method="lm", linetype = 2, aes(x=rnk, y=words)) +ggtitle("Zipf's law probability for 2grams") +xlab("Rank") + ylab("Probability Mass")

v <- sort(slam::row_sums(ngrams$`3gram`[findFreqTerms(ngrams$`3gram`,1),]), decreasing=TRUE)
ggplot(data=data.frame(words=v, rnk = rank(-v))) + geom_line(aes(x=rnk, y=words, colour="red")) + scale_x_log10() + scale_y_log10() + stat_smooth(method="lm", linetype = 2, aes(x=rnk, y=words)) +ggtitle("Zipf's law probability for 3grams") +xlab("Rank") + ylab("Probability Mass")

Conclusion

After exploratory analisys of the files, the counclusion is made that tech limitations prevent from using entire data for research. However, trimmig data to 5% simplifies data processing. In this doc data cleaning and initial analisys of data has been performed which further would be applied for the following phase in the class project.