Synopsis

The goal of this document is to parse and analyse text data from SwiftKey Dataset. All documents are grouped by the languanges and source types. There are 3 types, such as the blogs, news and twitter. The main goal of analyse is to find answers on the following questions:

Parse and clean data

source("Common.R")
inputDir<-"train"
if(!dir.exists(inputDir)) SampleCopyFilesInDirectory("source", inputDir, sampleSize = 100000)

rawTrainCorpus<-VCorpus(DirSource(file.path(inputDir, "en_US")))

transFuncs<- list(removePunctuation, stripWhitespace, content_transformer(tolower), stemDocument )
trainCorpus <- tm_map(rawTrainCorpus, tm_reduce, tmFuns = transFuncs)
trainCorpus <- tm_map(trainCorpus, removeWords, stopwords())

To clean data we used several methods:

  • Remove all punctuation
  • Remove spaces
  • Remove stop words
  • Tranform all capital letters to lower format
  • Stemming

Inspect data

Size of parsed and cleaned data in memory

 sapply(trainCorpus, object.size)
##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##            220496            208504            118368

Some words are more frequent than others - what are the distributions of word frequencies?

one_gramVector <- NGramTokenizer(trainCorpus, Weka_control(min=1, max=1))
two_gramVector <- NGramTokenizer(trainCorpus, Weka_control(min=2, max=2))
tri_gramVector <- NGramTokenizer(trainCorpus, Weka_control(min=3, max=3))

one_gram<-as.data.table(table(one_gramVector))[order(-N)]
two_gram<-as.data.table(table(two_gramVector))[order(-N)]
tri_gram<-as.data.table(table(tri_gramVector))[order(-N)]

ggplot(one_gram[1:30], aes(x=reorder(one_gramVector, 30:1), y=N), horiz=F) + geom_bar(stat='identity', fill="grey") +  coord_flip() +ylab("Frequency") + xlab("tokens") + ggtitle("The most frequent one-gram")

ggplot(two_gram[1:30], aes(x=reorder(two_gramVector, 30:1), y=N), horiz=F) + geom_bar(stat='identity', fill="grey") +  coord_flip() +ylab("Frequency") + xlab("tokens") + ggtitle("The most frequent two-grams")

ggplot(tri_gram[1:30], aes(x=reorder(tri_gramVector, 30:1), y=N), horiz=F) + geom_bar(stat='identity', fill="grey") +  coord_flip() +ylab("Frequency") + xlab("tokens") + ggtitle("The most frequent third-grams")

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

  one_gram[,cumN:=cumsum(N)]
  which.min(  one_gram$cumN < one_gram[.N, cumN] * 0.5)
## [1] 759
  which.min(  one_gram$cumN < one_gram[.N, cumN] * 0.9)
## [1] 7962

Conclusions

Further research has to be done. Memory and performance optimization are very important for this analyse because even building DocumentTermMatrix is a very slow process. Our model should be prepared, calculated and stored on powerful computers. The calculated model should be small and simple to use even on mobile devices.