The doc contsains brief overview of Corpora data corpus (http://www.corpora.heliohost.org/aboutcorpus.html) data source as a ground to build a word prediction model.
The dataset is available through some web-crawled word corpus. The outcome is presented in different languagues ang English is going to be used as the main language. This exporatory analisys includes data cleaning and overview, tokenization. As auxiliary dictionary was being used: “US ENglish offencive words dictionary from Carnegie Mellon University Luis von Ahn’s Research Group (http://www.cs.cmu.edu/~biglou/resources/bad-words.txt)”
Following libararies are used for exploratory analysis:
library(tm)
library(RWeka)
library(wordcloud)
library(ggplot2)
library(scales)
set.seed(404)
The list of the files for analisys:
file.info(list.files("./final/en_US", full.names = TRUE))[1]
## size
## ./final/en_US/en_US.blogs.txt 210160014
## ./final/en_US/en_US.news.txt 205811883
## ./final/en_US/en_US.twitter.txt 167105338
The files are more than 150M, so peferably to use file connection to download data into memory:
con <- file("./final/en_US/en_US.twitter.txt", "r")
tweetsVector <- readLines(con)
close(con)
con <- file("./final/en_US/en_US.news.txt", "r")
newsVector <- readLines(con)
close(con)
con <- file("./final/en_US/en_US.blogs.txt", "r")
blogsVector <- readLines(con)
close(con)
con <- file("./src/en_US.profanity.txt", "r")
profanityVector <- readLines(con)
close(con)
The files contain big volume of data:
| Corpus | NLines | NWords | NSymbols |
|---|---|---|---|
| Tweets | 2360148 | 30373543 | 162384825 |
| News | 1010242 | 34372528 | 203791399 |
| Blogs | 899288 | 37334131 | 208361438 |
| Corpus total | 4269678 | 102080202 | 574537662 |
Based on the results, the total of words is more than 100M. Therefore, it makes sense to use further just some “sample” data. Based on tech capacity, would be utilized 5% of total.
Applying “tm”" library to convert the data into a corpus. Using 0.5%:
sampledTweets <- sample(tweetsVector, length(tweetsVector) * sampleSize)
sampledNews <- sample(newsVector, length(newsVector) * sampleSize)
sampledBlogs <- sample(blogsVector, length(blogsVector) * sampleSize)
corpus <- VCorpus(VectorSource(c(sampledTweets, sampledNews, sampledBlogs)))
Small overview for the sampled corpus:
| Corpus | NLines | NWords | NSymbols |
|---|---|---|---|
| Tweets | 11800 | 151332 | 810563 |
| News | 5051 | 173369 | 1027178 |
| Blogs | 4496 | 187417 | 1043878 |
| Corpus total | 21347 | 512118 | 2881619 |
The resulted dataset could be processed without adding aditional tech capacity.
The original data was fed out of web-crawling, as a result include some wrong-spelled words as well non-existing words and other “noise” which should be stripped out before further model making. In order to do it, the following filter would be applied:
corpus <- tm_map(corpus, content_transformer(function(x) iconv(x, from="latin1", to="ASCII", sub="")))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, profanityVector)
corpus <- tm_map(corpus, stripWhitespace)
The first analisys is going to be counting words frequency analysis by Term Document Matrix in the tm package. Let’s find the 20 most popular words:
options(mc.cores=1)
tdm <- TermDocumentMatrix(corpus, control=list(tokenize = RWeka::WordTokenizer,
wordLengths = c(1, Inf)))
tdm
## <<TermDocumentMatrix (terms: 38647, documents: 21347)>>
## Non-/sparse entries: 416475/824581034
## Sparsity : 100%
## Maximal term length: 71
## Weighting : term frequency (tf)
head(sort(slam::row_sums(tdm), decreasing=TRUE),20)
## the to and a of in i is for that it you
## 24046 13725 12055 12007 10182 8345 8335 5512 5454 5261 4760 4606
## on with was my at this be have
## 4119 3600 3148 3009 2935 2814 2726 2566
The results shows “words” without any logical meaning. Let’s strip out these words from the result and present in the nice word cloud:
tdmWS <- TermDocumentMatrix(corpus, control=list(tokenize = RWeka::WordTokenizer,
wordLengths = c(1, Inf),
stopwords = stopwords("english")))
tdmWS
## <<TermDocumentMatrix (terms: 38523, documents: 21347)>>
## Non-/sparse entries: 261453/822089028
## Sparsity : 100%
## Maximal term length: 71
## Weighting : term frequency (tf)
head(sort(slam::row_sums(tdmWS), decreasing=TRUE),20)
## will said one just like can get im time new
## 1589 1555 1511 1501 1339 1222 1210 1193 1057 1030
## dont good day now love know people us back see
## 884 884 878 844 814 806 802 758 709 699
v <- sort(slam::row_sums(tdm[findFreqTerms(tdm,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))
v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))
The presentatin of changes in frequency for top 10 words:
v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)[1:100]
ggplot(data = data.frame(names=factor(names(v[1:10]), levels = names(v[1:10])), frequencies = v[1:10]), aes(x = names, y = frequencies, fill=names)) + geom_bar(stat = "identity")
From observation, the counclusion could be made that there linear rate in frequency changes which copplies with Zipf’s law (http://en.wikipedia.org/wiki/Zipf%27s_law)
The next step is finding the amount of words to cover the doc according the course task:
v <- sort(slam::row_sums(tdmWS[findFreqTerms(tdmWS,1),]), decreasing=TRUE)
ggplot(data = data.frame(frequencies = cumsum(v / sum(v)), pos = c(1:length(v))), aes(x = pos, y = frequencies, color="Freq")) + geom_step(stat="ecdf") + scale_x_continuous(trans = log1p_trans()) + xlab("Number of words") + ylab("Coverage")
The calculation shows that 26000 words is enough to cover 50% of provided docs and around 50000 words to cover 90%.
The next step of the analisys includes NGram whick begins with generating bigrams and trigrams:
bigramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))
ngrams <- list(
"2gram"=TermDocumentMatrix(corpus, control=list(tokenize = bigramTokenizer,
wordLengths=c(1, Inf))),
"3gram"=TermDocumentMatrix(corpus, control=list(tokenize = trigramTokenizer,
wordLengths=c(1, Inf)))
)
The most used bigrams are here:
head(sort(slam::row_sums(ngrams$`2gram`), decreasing=TRUE),20)
## of the in the to the for the on the to be at the and the
## 2232 2121 1085 1009 958 822 761 639
## in a with the it was is a it is from the i was for a
## 616 541 521 517 466 462 447 438
## i have and i of a with a
## 416 409 406 406
Most popular trigrams are here:
head(sort(slam::row_sums(ngrams$`3gram`), decreasing=TRUE),20)
## one of the a lot of thanks for the
## 183 136 121
## to be a the end of going to be
## 95 81 79
## part of the i want to out of the
## 74 73 72
## it was a some of the as well as
## 70 70 68
## i have to i dont know i have a
## 61 60 55
## the rest of a couple of the first time
## 55 53 53
## there is a looking forward to
## 53 52
The nice cloud representation for the most used bigrams and trigrams:
v <- sort(slam::row_sums(ngrams$`2gram`[findFreqTerms(ngrams$`2gram`,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))
v <- sort(slam::row_sums(ngrams$`3gram`[findFreqTerms(ngrams$`3gram`,1),]), decreasing=TRUE)[1:100]
wordcloud(names(v), v, scale = c(5,2), colors = brewer.pal(8, "Accent"))
After this, the check would be run to see how Zipf’s law stands for bigrams and trigrams:
v <- sort(slam::row_sums(ngrams$`2gram`[findFreqTerms(ngrams$`2gram`,1),]), decreasing=TRUE)
ggplot(data=data.frame(words=v, rnk = rank(-v))) + geom_line(aes(x=rnk, y=words, colour="red")) + scale_x_log10() + scale_y_log10() + stat_smooth(method="lm", linetype = 2, aes(x=rnk, y=words)) +ggtitle("Zipf's law probability for 2grams") +xlab("Rank") + ylab("Probability Mass")
v <- sort(slam::row_sums(ngrams$`3gram`[findFreqTerms(ngrams$`3gram`,1),]), decreasing=TRUE)
ggplot(data=data.frame(words=v, rnk = rank(-v))) + geom_line(aes(x=rnk, y=words, colour="red")) + scale_x_log10() + scale_y_log10() + stat_smooth(method="lm", linetype = 2, aes(x=rnk, y=words)) +ggtitle("Zipf's law probability for 3grams") +xlab("Rank") + ylab("Probability Mass")
After exploratory analisys of the files, the counclusion is made that tech limitations prevent from using entire data for research. However, trimmig data to 5% simplifies data processing. In this doc data cleaning and initial analisys of data has been performed which further would be applied for the following phase in the class project.