The goal of this document is to parse and analyse text data from SwiftKey Dataset. All documents are grouped by the languanges and source types. There are 3 types, such as the blogs, news and twitter. The main goal of analyse is to find answers on the following questions:
source("Common.R")
inputDir<-"train"
if(!dir.exists(inputDir)) SampleCopyFilesInDirectory("source", inputDir, sampleSize = 100000)
rawTrainCorpus<-VCorpus(DirSource(file.path(inputDir, "en_US")))
transFuncs<- list(removePunctuation, stripWhitespace, content_transformer(tolower), stemDocument )
trainCorpus <- tm_map(rawTrainCorpus, tm_reduce, tmFuns = transFuncs)
trainCorpus <- tm_map(trainCorpus, removeWords, stopwords())
To clean data we used several methods:
Size of parsed and cleaned data in memory
sapply(trainCorpus, object.size)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## 220496 208504 118368
one_gramVector <- NGramTokenizer(trainCorpus, Weka_control(min=1, max=1))
two_gramVector <- NGramTokenizer(trainCorpus, Weka_control(min=2, max=2))
tri_gramVector <- NGramTokenizer(trainCorpus, Weka_control(min=3, max=3))
one_gram<-as.data.table(table(one_gramVector))[order(-N)]
two_gram<-as.data.table(table(two_gramVector))[order(-N)]
tri_gram<-as.data.table(table(tri_gramVector))[order(-N)]
ggplot(one_gram[1:30], aes(x=reorder(one_gramVector, 30:1), y=N), horiz=F) + geom_bar(stat='identity', fill="grey") + coord_flip() +ylab("Frequency") + xlab("tokens") + ggtitle("The most frequent one-gram")
ggplot(two_gram[1:30], aes(x=reorder(two_gramVector, 30:1), y=N), horiz=F) + geom_bar(stat='identity', fill="grey") + coord_flip() +ylab("Frequency") + xlab("tokens") + ggtitle("The most frequent two-grams")
ggplot(tri_gram[1:30], aes(x=reorder(tri_gramVector, 30:1), y=N), horiz=F) + geom_bar(stat='identity', fill="grey") + coord_flip() +ylab("Frequency") + xlab("tokens") + ggtitle("The most frequent third-grams")
one_gram[,cumN:=cumsum(N)]
which.min( one_gram$cumN < one_gram[.N, cumN] * 0.5)
## [1] 759
which.min( one_gram$cumN < one_gram[.N, cumN] * 0.9)
## [1] 7962
Further research has to be done. Memory and performance optimization are very important for this analyse because even building DocumentTermMatrix is a very slow process. Our model should be prepared, calculated and stored on powerful computers. The calculated model should be small and simple to use even on mobile devices.